Hi Michael,

Just a guess - maybe when you run outside of Hadoop you're running with a
much larger Java heap? You can set mapred.child.java.opts to determine the
heap size of the task procsses.

Also double check that the same JVM is getting used. There are some
functions that I've found to be signficantly faster or slower in OpenJDK vs
Sun JDK.

-Todd

On Thu, Dec 23, 2010 at 6:28 AM, Black, Michael (IS) <michael.bla...@ngc.com
> wrote:

> Using hadoop-0.20.2+737 on Redhat's distribution.
>
> I'm trying to use a dictionary.csv file from a Lucene index inside a map
> function plus another comma delimited file.
>
> It's just a simple loop of reading a line, split the line on commas, and
> add
> the dictionary entry to a hash map.
>
> It's about an 8M file with 1.5M lines.  I'm using an absolute path so the
> file
> read is local (and not hdfs).  I've verified no hdfs reads occurring from
> the
> job status.
>
> When I run this outside of hadoop it executes in 6 seconds.
>
> Inside hadoop it takes 13 seconds and the java process is 100% CPU the
> whole
> time...
>
> This makes absolutely no sense to me...I would've thought it should execute
> in
> the same time frame seeing as how it's just reading a local file (I'm only
> running one task at the moment).
>
> I'm also reading another file in a similar fashion and see 3.4 seconds vs
> 0.3
> seconds (longer lines that are also getting split).  This one is 45 lines
> and
> 278K.
>
> It appears that perhaps the split function is running slower since the
> smaller
> file with more columns runs 10X slower than the large file which is "only"
> 2X
> slower.
>
> Anybody have any idea why file input is slower under hadoop?
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to