Is it wrong to bypass HDFS?

2014-11-09 Thread Trevor Harmon
Hi, I’m trying to model an embarrassingly parallel problem as a map-reduce job. The amount of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to a single value. The

Re: Is it wrong to bypass HDFS?

2014-11-09 Thread Dieter De Witte
100MB is very small, so the overhead of putting the data in hdfs is also very small. Does it even make sense to optimize this? (reading/writing will only take a second or so) If you don't want to stream data to hdfs and you have very little data then you should look in to alternative high

Re: Is it wrong to bypass HDFS?

2014-11-09 Thread Steve Lewis
You should consider writing a custom InputFormat which reads directly from the database - while FileInputformat is the most common class for InputFormat, the specification for InputFormat or what the critical method getSplits does not require HDFS - A custom version can return database entries as

Re: Is it wrong to bypass HDFS?

2014-11-09 Thread Trevor Harmon
You’re right, 100MB is small, but if there are 100,000 jobs, the overhead of copying data to HDFS adds up. I guess my main concern was whether allowing mappers to fetch the input data would violate some technical rule or map-reduce principle. I have considered alternative solutions like

Re: Is it wrong to bypass HDFS?

2014-11-09 Thread Trevor Harmon
Ah, yes, I remember reading about custom InputFormats but did not realize they could bypass HDFS entirely. Sounds like a good solution, I will look into it. Thanks, Trevor On Nov 9, 2014, at 12:48 PM, Steve Lewis lordjoe2...@gmail.com wrote: You should consider writing a custom InputFormat