actually - he just put it on github :)
https://github.com/edwardcapriolo/filecrush

On Nov 30, 2011, at 9:03 AM, Jeremy Hanna wrote:

> We went through some grief with small files and inefficiencies there.  First 
> we went the route of CombinedInputFormat.  That worked for us for a while but 
> then we started getting errors relating to the number of open files.  So we 
> used a utility that Ed Capriolo in the Hadoop/Hive/Cassandra community wrote 
> called FileCrush that crushes down files below a certain file size threshold 
> into larger files so that hadoop can more efficiently deal with them - 
> http://www.jointhegrid.com/hadoop_filecrush/index.jsp.  Ultimately we have 
> some custom code for our specific problem - a parallel downloader from S3 
> (tons of small files) to HDFS that writes directly to larger sequence files.  
> FileCrush worked well though fwiw.
> 
> Anyway, you might try CombinedInputFormat and if that doesn't work, perhaps 
> FileCrush - if that's indeed what your bottleneck is.
> 
> Hope that helps.
> 
> Jeremy
> 
> On Nov 30, 2011, at 7:43 AM, Alex Rovner wrote:
> 
>> David,
>> 
>> Hadoop was engineered to efficiently process small number of large files
>> and not the other way around. Since PIG utilizes Hadoop it will have a
>> similar limitation. Some improvement have been made on that front
>> (CombinedInputFormat) but the performance is still lacking.
>> 
>> The version of PIG you are using is plagued with memory issues which would
>> affect performance especially when accumulating large amount of data in the
>> reducers.
>> 
>> It would be hard to give you any pointers without seeing the script that
>> you are using.
>> 
>> Generally though, you might want to consider the following:
>> 
>> 1. Using a newer version of PIG (There might be a workaround that can be
>> put in place on EMR to do that.).
>> 2. Increasing the memory available to each mapper / reducer.
>> 3. Reduce the amount of input files by concatenating  small files into one
>> large file(s). For example combine daily files into monthly or yearly files.
>> 
>> Lastly, do not consider execution speed on your laptop as a benchmark.
>> Hadoop gets it's power by running in the distributed mode on multiple
>> nodes. Local mode will generally perform much worse then single threaded
>> process since it's trying to mimic what happens on the cluster which
>> requires quite a bit of coordination between mappers and reducers.
>> 
>> Alex
>> 
>> 
>> On Wed, Nov 30, 2011 at 1:05 AM, David King <dk...@ketralnis.com> wrote:
>> 
>>> I have a pig script that I've translated from an old Python job. The old
>>> script worked by read a bunch of lines of JSON into sqlite and running
>>> queries again that. The sqlite DB ended up being about 1gb on disk by the
>>> end of the job (it's about a year's worth of data) and the whole job ran in
>>> 40 to 60 minutes single-threaded on a single machine.
>>> 
>>> The pig version (sadly I require pig 0.6, as I'm running on EMR), much to
>>> my surprise, is *way* slower. Run against a few day's worth of data it
>>> takes about 15 seconds, against a month's worth takes about 1.1 minutes,
>>> against two months takes about 3 minutes, but against 3 months of data in
>>> local mode on my 4-proc laptop it takes hours. So long in fact that it
>>> didn't finish in an entire work-day and I had to kill it. On Amazon's EMR I
>>> did get a job to finish after 5h44m using 10 m1.small nodes, which is
>>> pretty nuts compared to the single-proc Python version.
>>> 
>>> There are about 15 thousand JSON files totalling 2.1gb (uncompressed), so
>>> it's not that big. And the code is, I think, pretty simple. Take a look:
>>> http://pastebin.com/3y7e2ZTq . The loader mentioned there is pretty
>>> simple too, it's basically a hack of ElephantBird's JSON loader to dive
>>> deeper into the JSON and make bags out of JSON lists in addition to simpler
>>> maps that EB does http://pastebin.com/dFKX3AJc
>>> 
>>> While it's running in local mode on my laptop it outputs a lot (about one
>>> per minute) of messages like this:
>>> 
>>> 2011-11-29 18:34:21,518 [Low Memory Detector] INFO
>>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
>>> called (Collection threshold exceeded) init = 65404928(63872K) used =
>>> 1700522216(1660666K) committed = 2060255232(2011968K) max =
>>> 2060255232(2011968K)
>>> 2011-11-29 18:34:30,773 [Low Memory Detector] INFO
>>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
>>> called (Collection threshold exceeded) init = 65404928(63872K) used =
>>> 1700519216(1660663K) committed = 2060255232(2011968K) max =
>>> 2060255232(2011968K)
>>> 2011-11-29 18:34:40,953 [Low Memory Detector] INFO
>>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
>>> called (Collection threshold exceeded) init = 65404928(63872K) used =
>>> 1700518024(1660662K) committed = 2060255232(2011968K) max =
>>> 2060255232(2011968K)
>>> 
>>> But I'm not sure how to read EMR's debugging logs to know if it's doing
>>> that in mapreduce mode on EMR too. So my questions are:
>>> 
>>> 1. Is that pig script doing anything that's really that performance
>>> intensive? Is the loader doing anything obviously bad? Why on earth is this
>>> so slow? This whole dataset should fit in RAM on 2 nodes, let alone 10
>>> 
>>> 2. How do people generally go about profiling these scripts?
>>> 
>>> 3. Is that "Low Memory Detector" error in local mode anything to be
>>> worried about? Or is it just telling me that some intermediate dataset
>>> doesn't fit in RAM and is being spilled to disc?
>>> 
>>> 
> 

Reply via email to