Re: Seattle Hadoop/Scalability/NoSQL Meetup Tonight!

2010-02-25 Thread Bradford Stephens
Thanks for coming, everyone! We had around 25 people. A *huge*
success, for Seattle. And a big thanks to 10gen for sending Richard.

Can't wait to see you all next month.

On Wed, Feb 24, 2010 at 2:15 PM, Bradford Stephens
bradfordsteph...@gmail.com wrote:
 The Seattle Hadoop/Scalability/NoSQL (yeah, we vary the title) meetup
 is tonight! We're going to have a guest speaker from MongoDB :)

 As always, it's at the University of Washington, Allen Computer
 Science building, Room 303 at 6:45pm. You can find a map here:
 http://www.washington.edu/home/maps/southcentral.html?cse

 If you can, please RSVP here (not required, but very nice):
 http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/

 --
 http://www.drawntoscalehq.com --  The intuitive, cloud-scale data
 solution. Process, store, query, search, and serve all your data.

 http://www.roadtofailure.com -- The Fringes of Scalability, Social
 Media, and Computer Science




-- 
http://www.drawntoscalehq.com --  The intuitive, cloud-scale data
solution. Process, store, query, search, and serve all your data.

http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science


Reduce step never starts, can't read output from mappers? (Too many fetch-failures)

2010-02-25 Thread Martin Häger
(re-posted from the mapreduce-user list in case anyone here might have
an answer)

Hello,

I have set up a cluster with one NameNode/JobTracker and three
DataNode/TaskTrackers, and having some issues with the reduce step
being unable to start. Masters and slaves can ping and ssh each other.
Attaching conf files (same on all machines).

Is there anything else I should be looking at?

Log output for JobTracker and one of the TaskTrackers that seems suspicious:

JobTracker
=

exj...@exjobb-1:~$ hadoop jar /opt/hadoop/hadoop-0.20.1-examples.jar
wordcount input/sessions-20100205145800.txt output-wordcount
10/02/24 11:15:24 INFO input.FileInputFormat: Total input paths to process : 1
10/02/24 11:15:25 INFO mapred.JobClient: Running job: job_201002240852_0003
10/02/24 11:15:26 INFO mapred.JobClient:  map 0% reduce 0%
10/02/24 11:15:49 INFO mapred.JobClient:  map 1% reduce 0%
10/02/24 11:15:58 INFO mapred.JobClient:  map 2% reduce 0%
10/02/24 11:16:06 INFO mapred.JobClient:  map 3% reduce 0%
10/02/24 11:16:15 INFO mapred.JobClient:  map 4% reduce 0%
10/02/24 11:16:23 INFO mapred.JobClient:  map 5% reduce 0%
10/02/24 11:16:32 INFO mapred.JobClient:  map 6% reduce 0%
10/02/24 11:16:40 INFO mapred.JobClient:  map 7% reduce 0%
10/02/24 11:16:51 INFO mapred.JobClient:  map 8% reduce 0%
10/02/24 11:16:59 INFO mapred.JobClient:  map 9% reduce 0%
10/02/24 11:17:07 INFO mapred.JobClient:  map 10% reduce 0%
10/02/24 11:17:31 INFO mapred.JobClient:  map 11% reduce 0%
10/02/24 11:17:39 INFO mapred.JobClient:  map 12% reduce 0%
10/02/24 11:17:49 INFO mapred.JobClient:  map 13% reduce 0%
10/02/24 11:17:57 INFO mapred.JobClient:  map 14% reduce 0%
10/02/24 11:18:05 INFO mapred.JobClient:  map 15% reduce 0%
10/02/24 11:18:15 INFO mapred.JobClient:  map 16% reduce 0%
10/02/24 11:18:23 INFO mapred.JobClient:  map 17% reduce 0%
10/02/24 11:18:32 INFO mapred.JobClient:  map 18% reduce 0%
10/02/24 11:18:42 INFO mapred.JobClient:  map 19% reduce 0%
10/02/24 11:18:51 INFO mapred.JobClient:  map 20% reduce 0%
10/02/24 11:19:11 INFO mapred.JobClient:  map 21% reduce 0%
10/02/24 11:19:22 INFO mapred.JobClient:  map 22% reduce 0%
10/02/24 11:19:32 INFO mapred.JobClient:  map 23% reduce 0%
10/02/24 11:19:40 INFO mapred.JobClient:  map 24% reduce 0%
10/02/24 11:19:49 INFO mapred.JobClient:  map 25% reduce 0%
10/02/24 11:19:57 INFO mapred.JobClient:  map 26% reduce 0%
10/02/24 11:20:05 INFO mapred.JobClient:  map 27% reduce 0%
10/02/24 11:20:15 INFO mapred.JobClient:  map 28% reduce 0%
10/02/24 11:20:24 INFO mapred.JobClient:  map 29% reduce 0%
10/02/24 11:20:34 INFO mapred.JobClient:  map 30% reduce 0%
10/02/24 11:20:52 INFO mapred.JobClient:  map 31% reduce 0%
10/02/24 11:21:02 INFO mapred.JobClient:  map 32% reduce 0%
10/02/24 11:21:12 INFO mapred.JobClient:  map 33% reduce 0%
10/02/24 11:21:21 INFO mapred.JobClient:  map 34% reduce 0%
10/02/24 11:21:31 INFO mapred.JobClient:  map 35% reduce 0%
10/02/24 11:21:40 INFO mapred.JobClient:  map 36% reduce 0%
10/02/24 11:21:49 INFO mapred.JobClient:  map 37% reduce 0%
10/02/24 11:21:58 INFO mapred.JobClient:  map 38% reduce 0%
10/02/24 11:22:07 INFO mapred.JobClient:  map 39% reduce 0%
10/02/24 11:22:17 INFO mapred.JobClient:  map 40% reduce 0%
10/02/24 11:22:35 INFO mapred.JobClient:  map 41% reduce 0%
10/02/24 11:22:44 INFO mapred.JobClient:  map 42% reduce 0%
10/02/24 11:22:53 INFO mapred.JobClient:  map 43% reduce 0%
10/02/24 11:23:05 INFO mapred.JobClient:  map 44% reduce 0%
10/02/24 11:23:14 INFO mapred.JobClient:  map 45% reduce 0%
10/02/24 11:23:22 INFO mapred.JobClient:  map 46% reduce 0%
10/02/24 11:23:32 INFO mapred.JobClient:  map 47% reduce 0%
10/02/24 11:23:40 INFO mapred.JobClient:  map 48% reduce 0%
10/02/24 11:23:50 INFO mapred.JobClient:  map 49% reduce 0%
10/02/24 11:23:59 INFO mapred.JobClient:  map 50% reduce 0%
10/02/24 11:24:17 INFO mapred.JobClient:  map 51% reduce 0%
10/02/24 11:24:27 INFO mapred.JobClient:  map 52% reduce 0%
10/02/24 11:24:34 INFO mapred.JobClient:  map 53% reduce 0%
10/02/24 11:24:45 INFO mapred.JobClient:  map 54% reduce 0%
10/02/24 11:24:57 INFO mapred.JobClient:  map 55% reduce 0%
10/02/24 11:25:04 INFO mapred.JobClient:  map 56% reduce 0%
10/02/24 11:25:15 INFO mapred.JobClient:  map 57% reduce 0%
10/02/24 11:25:22 INFO mapred.JobClient:  map 58% reduce 0%
10/02/24 11:25:32 INFO mapred.JobClient:  map 59% reduce 0%
10/02/24 11:25:42 INFO mapred.JobClient:  map 60% reduce 0%
10/02/24 11:25:57 INFO mapred.JobClient:  map 61% reduce 0%
10/02/24 11:26:07 INFO mapred.JobClient:  map 62% reduce 0%
10/02/24 11:26:16 INFO mapred.JobClient:  map 63% reduce 0%
10/02/24 11:26:24 INFO mapred.JobClient:  map 64% reduce 0%
10/02/24 11:26:34 INFO mapred.JobClient:  map 65% reduce 0%
10/02/24 11:26:45 INFO mapred.JobClient:  map 66% reduce 0%
10/02/24 11:26:56 INFO mapred.JobClient:  map 67% reduce 0%
10/02/24 11:27:05 INFO mapred.JobClient:  map 68% reduce 0%
10/02/24 11:27:13 INFO mapred.JobClient:  map 69% reduce 0%
10/02/24 11:27:17 INFO 

Re: Hadoop key mismatch

2010-02-25 Thread Edward Capriolo
On Wed, Feb 24, 2010 at 3:30 PM, Larry Homes larr.ho...@gmail.com wrote:
 Hello,

 I am trying to sort some values by using a simple map and reduce
 without any processing, but I think I messed up my data types somehow.

 Rather than try to paste code in an email, I have described the
 problem and pasted all the code (nicely formatted) here:
 http://www.coderanch.com/t/484435/Distributed-Java/java/Hadoop-key-mismatch

 Thanks


I think the first problem you are having is that you changed the
signature of the map method incorrectly.

 public void map(Text key, Text value, Context context)

The type of key should be LongWritable. Key is an integer representing
the offset of the line. Value is the entire line of text.

Try :
public void map(LongWritable key, Text value, Context context)

Adjust accordingly and you should be ok. (at least until the next problem :)


Re: CDH2 or Apache Hadoop - Official Debian packages

2010-02-25 Thread Thomas Koch
Allen, 
 For all intents and purposes, the Debian package sounds just like a
 re-packaging of the Apache distribution in .deb form.
You're perfectly right. Most Debian packages are just a re-packaging of the 
upstream projects, but with additional management information and logic to 
ease the installation and make them work well on the plattform and together 
with other programs.
It's the beautiful world of package management:
apt-get install hadoop
less /usr/share/doc/hadoop/README
... Have fun with hadoop

  - no version namespace, everything is called just hadoop, not
  hadoop-0.18 or hadoop-0.20 as in the cloudera package
 
 ... and thus making upgrades really hard and not suitable for anything
 real.
Actually my hope is in the plan of hadoop to once establish a stable API (as 
planned) so that an upgrade will be backwards compatible.
As long as that isn't the case, the Debian package is intended only for three 
audiencens:
- People who are willing to deal with any upgrade hassles for the benefit of 
an official Debian package
- People who'd like to try out and learn hadoop with an easily installable 
package
- Me

That said, I'm going to use the Debian package on a tiny production cluster of 
5 machines.
 
Thomas Koch, http://www.koch.ro


Re: Sun JVM 1.6.0u18

2010-02-25 Thread Todd Lipcon
On Thu, Feb 25, 2010 at 11:09 AM, Scott Carey sc...@richrelevance.comwrote:

 On Feb 15, 2010, at 9:54 PM, Todd Lipcon wrote:

  Hey all,
 
  Just a note that you should avoid upgrading your clusters to 1.6.0u18.
  We've seen a lot of segfaults or bus errors on the DN when running
  with this JVM - Stack found the ame thing on one of his clusters as
  well.
 

 Have you seen this for 32bit, 64 bit, or both?  If 64 bit, was it with
 -XX:+UseCompressedOops?


Just 64-bit, no compressed oops. But I haven't tested other variables.



 Any idea if there are Sun bugs open for the crashes?


I opened one, yes. I think Stack opened a separate one. Haven't heard back.


 I have found some notes that suggest that -XX:-ReduceInitialCardMarks
 will work around some known crash problems with 6u18, but that may be
 unrelated.


Yep, I think that is probably a likely workaround as well. For now I'm
recommending downgrade to our clients, rather than introducing cryptic XX
flags :)



 Lastly, I assume that Java 6u17 should work the same as 6u16, since it is a
 minor patch over 6u16 where 6u18 includes a new version of Hotspot.  Can
 anyone confirm that?



I haven't heard anything bad about u17 either. But since we know 16 to be
very good and nothing important is new in 17, I like to recommend 16 still.

-Todd


Use intermediate compression for Map output or not?

2010-02-25 Thread jiang licht
Hi hadoop Gurus, here's a question about intermediate compression.

As I understand, the point to compress Map output is to reduce network traffic 
that occur when feeding sequence files from Map to Reduce tasks which do not 
reside on the same boxes as Map tasks. So, depending on various factors such as 
how cluster is set up, data size, property of problem to solve and the quality 
of m/r program (e.g. a pig script), etc, this reduce in network traffic (due to 
compressed data) may or may not compensate the time for compression and 
decompression. In other words, intermediate compression may not reach its goal 
to reduce the overall time cost of a m/r job.

As I know, a blog 
(http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) 
gives compression and decompression factor and speed and reports a positive 
result of using compression on raw data as input to m/r job, but no test or 
insight about intermediate compression. So, I am wondering if there is any case 
study or test results guiding when to use intermediate compression, pros and 
cons, settings, pitfalls and gains...

Thanks,

Michael


  

Re: CDH2 or Apache Hadoop - Official Debian packages

2010-02-25 Thread Owen O'Malley


On Feb 25, 2010, at 10:20 AM, Allen Wittenauer wrote:

Actually my hope is in the plan of hadoop to once establish a  
stable API (as

planned) so that an upgrade will be backwards compatible.


History shows you are in for a long wait.


I hope not and I'm trying to make sure that isn't true. At this point,  
we have a lot of customers inside Yahoo who yell at our SVP when  
anyone breaks API compatibility with the previous release.


My hope to get to the point where we do one major release a year and  
each major release is backwards compatible with the previous major  
release (as in you don't need to recompile your code). Bonus points if  
we can get a minor release out at the half year point. And of course  
bug fix releases as needed...


-- Owen


Re: CDH2 or Apache Hadoop - Official Debian packages

2010-02-25 Thread Allen Wittenauer



On 2/25/10 8:39 AM, Thomas Koch tho...@koch.ro wrote:
 - no version namespace, everything is called just hadoop, not
 hadoop-0.18 or hadoop-0.20 as in the cloudera package
 
 ... and thus making upgrades really hard and not suitable for anything
 real.
 Actually my hope is in the plan of hadoop to once establish a stable API (as
 planned) so that an upgrade will be backwards compatible.

History shows you are in for a long wait.

It is also worth pointing out that API compat is only part of the issue.
Without ABI compat, it is still a very rough road.  [A point lost on way too
many in the Hadoop community; too many devs, not enough ops.]




Re: Sun JVM 1.6.0u18

2010-02-25 Thread Scott Carey
On Feb 15, 2010, at 9:54 PM, Todd Lipcon wrote:

 Hey all,
 
 Just a note that you should avoid upgrading your clusters to 1.6.0u18.
 We've seen a lot of segfaults or bus errors on the DN when running
 with this JVM - Stack found the ame thing on one of his clusters as
 well.
 

Have you seen this for 32bit, 64 bit, or both?  If 64 bit, was it with 
-XX:+UseCompressedOops?

Any idea if there are Sun bugs open for the crashes?

I have found some notes that suggest that -XX:-ReduceInitialCardMarks will 
work around some known crash problems with 6u18, but that may be unrelated.  

Lastly, I assume that Java 6u17 should work the same as 6u16, since it is a 
minor patch over 6u16 where 6u18 includes a new version of Hotspot.  Can anyone 
confirm that?


 We've found 1.6.0u16 to be very stable.
 
 -Todd



Hadoop freeze?

2010-02-25 Thread jiang licht
I ran into the following problem running a hadoop job written in pig.Pls help 
check what caused the issue. As I could tell, it seems to me the job/task 
tracker failed for some reason but 
name/data nodes still functioning. 

The job simply seems to make no progress at all (no output, no log). But couple 
of other hadoop jobs ran successfully before this one. hadoop fs -ls can still 
list files. But I did Hadoop job -list, it took too long and then failed with 
error message as follows.

Exception in thread main java.io.IOException: Call to 
hostname/ip-address:50002 failed on
 local exception: Connection reset by peer  at 
org.apache.hadoop.ipc.Client.call(Client.java:699)  at 
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)  at 
org.apache.hadoop.mapred.$Proxy0.getProtocolVersion(Unknown Source) at 
org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)at 
org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:435)   at 
org.apache.hadoop.mapred.JobClient.init(JobClient.java:429) at 
org.apache.hadoop.mapred.JobClient.run(JobClient.java:1512) at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)   at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)   at 
org.apache.hadoop.mapred.JobClient.main(JobClient.java:1727)Caused
 by: java.io.IOException: Connection reset by peer  at 
sun.nio.ch.FileDispatcher.read0(Native Method)  at 
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)  at 
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at 
sun.nio.ch.IOUtil.read(IOUtil.java:206) at 
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)   at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
 at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
at 
java.io.FilterInputStream.read(FilterInputStream.java:116)  at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:271)   
at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)  at 
java.io.BufferedInputStream.read(BufferedInputStream.java:237)  at 
java.io.DataInputStream.readInt(DataInputStream.java:370)   at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:493)
at 
org.apache.hadoop.ipc.Client$Connection.run(Client.java:438)
Web interface to job trac...@50030 simply came with no response at all.

By checking netstat, sometimes it shows 50030 and sometimes not. connections 
and ports with data nodes were shown there.

Then, if I ran another pig, it failed with the following error:

Error before Pig is launchedERROR
 6009: Failed to create job client:Call to hostname/ip-address:50002 failed on
 local exception: Connection reset by peer
org.apache.pig.backend.executionengine.ExecException:
 ERROR 6009: Failed to create job client:Call to hostname/ip-address:50002 
failed on
 local exception: Connection reset by peer  at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:217)
  at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:137)
  at 
org.apache.pig.impl.PigContext.connect(PigContext.java:199) at 
org.apache.pig.PigServer.init(PigServer.java:169) at 
org.apache.pig.PigServer.init(PigServer.java:158) at 
org.apache.pig.tools.grunt.Grunt.init(Grunt.java:54)  at 
org.apache.pig.Main.main(Main.java:395)Caused by: 
java.io.IOException: Call to hostname/ip-address:50002 failed on
 local exception: Connection reset by peer  at 
org.apache.hadoop.ipc.Client.call(Client.java:699)  at 
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)  at 
org.apache.hadoop.mapred.$Proxy1.getProtocolVersion(Unknown Source) at 
org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)at 
org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:435)   at 
org.apache.hadoop.mapred.JobClient.init(JobClient.java:429) at 
org.apache.hadoop.mapred.JobClient.init(JobClient.java:398)   at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:212)
  ... 6 moreCaused
 by: java.io.IOException: Connection reset by peer  at 
sun.nio.ch.FileDispatcher.read0(Native Method)  at 
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)  at 
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at 
sun.nio.ch.IOUtil.read(IOUtil.java:206) at 
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)   at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
 at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
at 

Re: On CDH2, (Cloudera EC2) No valid local directories in property: mapred.local.dir

2010-02-25 Thread Saptarshi Guha
Hello,
I fixed this by running more than =2 slaves.
I was testing with 1 when this error occurred.

Regards
Saptarshi

On Tue, Feb 23, 2010 at 2:57 PM, Todd Lipcon t...@cloudera.com wrote:
 Hi Saptarshi,

 Can you please ssh into the JobTracker node and check that this
 directory is mounted, writable by the hadoop user, and not full?

 -Todd

 On Fri, Feb 19, 2010 at 2:13 PM, Saptarshi Guha
 saptarshi.g...@gmail.com wrote:
 Hello,
 Not sure if i should post this here or on Cloudera's message board,
 but here goes.
 When I run EC2 using the latest CDH2 and Hadoop 0.20 (by settiing the
 env variables are hadoop-ec2),
 and launch a job
 hadoop jar ...

 I get the following error


 10/02/19 17:04:55 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the
 same.
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid
 local directories in property: mapred.local.dir
        at 
 org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:975)
        at org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:279)
        at 
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:256)
        at 
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:240)
        at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)

        at org.apache.hadoop.ipc.Client.call(Client.java:740)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at org.apache.hadoop.mapred.$Proxy0.submitJob(Unknown Source)
        at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:841)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)

        at org.godhuli.f.RHMR.submitAndMonitorJob(RHMR.java:195)

 but the value  of mapred.local.dir is /mnt/hadoop/mapred/local

 Any ideas?




Re: cluster involvement trigger

2010-02-25 Thread Amogh Vasekar
Hi,
The number of mappers initialized depends largely on your input format ( the 
getSplits of your input format) , (almost all) input formats available in 
hadoop derive from fileinputformat, hence the 1 mapper per file block notion ( 
this actually is 1 mapper per split ).
You say that you have too many small files. In general each of these small 
files  (  64 mb ) will be executed by a single mapper. However, I would 
suggest looking at CombineFileInputFormat which does the job of packaging many 
small files together depending on data locality for better performance ( 
initialization time is a significant factor in hadoop's performance ).
On the other side, many small files will hamper your namenode performance since 
file metadata is stored in memory and limit its overall capacity wrt number of 
files.

Amogh


On 2/25/10 11:15 PM, Michael Kintzer michael.kint...@zerk.com wrote:

Hi,

We are using the streaming API.We are trying to understand what hadoop uses 
as a threshold or trigger to involve more TaskTracker nodes in a given 
Map-Reduce execution.

With default settings (64MB chunk size in HDFS), if the input file is less than 
64MB, will the data processing only occur on a single TaskTracker Node, even if 
our cluster size is greater than 1?

For example, we are trying to figure out if hadoop is more efficient at 
processing:
a) a single input file which is just an index file that refers to a jar archive 
of 100K or 1M individual small files, where the jar file is passed as the 
-archives argument, or
b) a single input file containing all the raw data represented by the 100K or 
1M small files.

With (a), our input file is 64MB.   With (b) our input file is very large.

Thanks for any insight,

-Michael