dictionary.csv

2010-12-23 Thread Black, Michael (IS)
Using hadoop-0.20.2+737 on Redhat's distribution.

I'm trying to use a dictionary.csv file from a Lucene index inside a map 
function plus another comma delimited file.

It's just a simple loop of reading a line, split the line on commas, and add 
the dictionary entry to a hash map.

It's about an 8M file with 1.5M lines.  I'm using an absolute path so the file 
read is local (and not hdfs).  I've verified no hdfs reads occurring from the 
job status.

When I run this outside of hadoop it executes in 6 seconds.

Inside hadoop it takes 13 seconds and the java process is 100% CPU the whole 
time...

This makes absolutely no sense to me...I would've thought it should execute in 
the same time frame seeing as how it's just reading a local file (I'm only 
running one task at the moment).

I'm also reading another file in a similar fashion and see 3.4 seconds vs 0.3 
seconds (longer lines that are also getting split).  This one is 45 lines and 
278K.

It appears that perhaps the split function is running slower since the smaller 
file with more columns runs 10X slower than the large file which is only 2X 
slower.

Anybody have any idea why file input is slower under hadoop?


smime.p7s
Description: S/MIME cryptographic signature


Re: dictionary.csv

2010-12-23 Thread Todd Lipcon
Hi Michael,

Just a guess - maybe when you run outside of Hadoop you're running with a
much larger Java heap? You can set mapred.child.java.opts to determine the
heap size of the task procsses.

Also double check that the same JVM is getting used. There are some
functions that I've found to be signficantly faster or slower in OpenJDK vs
Sun JDK.

-Todd

On Thu, Dec 23, 2010 at 6:28 AM, Black, Michael (IS) michael.bla...@ngc.com
 wrote:

 Using hadoop-0.20.2+737 on Redhat's distribution.

 I'm trying to use a dictionary.csv file from a Lucene index inside a map
 function plus another comma delimited file.

 It's just a simple loop of reading a line, split the line on commas, and
 add
 the dictionary entry to a hash map.

 It's about an 8M file with 1.5M lines.  I'm using an absolute path so the
 file
 read is local (and not hdfs).  I've verified no hdfs reads occurring from
 the
 job status.

 When I run this outside of hadoop it executes in 6 seconds.

 Inside hadoop it takes 13 seconds and the java process is 100% CPU the
 whole
 time...

 This makes absolutely no sense to me...I would've thought it should execute
 in
 the same time frame seeing as how it's just reading a local file (I'm only
 running one task at the moment).

 I'm also reading another file in a similar fashion and see 3.4 seconds vs
 0.3
 seconds (longer lines that are also getting split).  This one is 45 lines
 and
 278K.

 It appears that perhaps the split function is running slower since the
 smaller
 file with more columns runs 10X slower than the large file which is only
 2X
 slower.

 Anybody have any idea why file input is slower under hadoop?




-- 
Todd Lipcon
Software Engineer, Cloudera