actually - he just put it on github :) https://github.com/edwardcapriolo/filecrush
On Nov 30, 2011, at 9:03 AM, Jeremy Hanna wrote: > We went through some grief with small files and inefficiencies there. First > we went the route of CombinedInputFormat. That worked for us for a while but > then we started getting errors relating to the number of open files. So we > used a utility that Ed Capriolo in the Hadoop/Hive/Cassandra community wrote > called FileCrush that crushes down files below a certain file size threshold > into larger files so that hadoop can more efficiently deal with them - > http://www.jointhegrid.com/hadoop_filecrush/index.jsp. Ultimately we have > some custom code for our specific problem - a parallel downloader from S3 > (tons of small files) to HDFS that writes directly to larger sequence files. > FileCrush worked well though fwiw. > > Anyway, you might try CombinedInputFormat and if that doesn't work, perhaps > FileCrush - if that's indeed what your bottleneck is. > > Hope that helps. > > Jeremy > > On Nov 30, 2011, at 7:43 AM, Alex Rovner wrote: > >> David, >> >> Hadoop was engineered to efficiently process small number of large files >> and not the other way around. Since PIG utilizes Hadoop it will have a >> similar limitation. Some improvement have been made on that front >> (CombinedInputFormat) but the performance is still lacking. >> >> The version of PIG you are using is plagued with memory issues which would >> affect performance especially when accumulating large amount of data in the >> reducers. >> >> It would be hard to give you any pointers without seeing the script that >> you are using. >> >> Generally though, you might want to consider the following: >> >> 1. Using a newer version of PIG (There might be a workaround that can be >> put in place on EMR to do that.). >> 2. Increasing the memory available to each mapper / reducer. >> 3. Reduce the amount of input files by concatenating small files into one >> large file(s). For example combine daily files into monthly or yearly files. >> >> Lastly, do not consider execution speed on your laptop as a benchmark. >> Hadoop gets it's power by running in the distributed mode on multiple >> nodes. Local mode will generally perform much worse then single threaded >> process since it's trying to mimic what happens on the cluster which >> requires quite a bit of coordination between mappers and reducers. >> >> Alex >> >> >> On Wed, Nov 30, 2011 at 1:05 AM, David King <dk...@ketralnis.com> wrote: >> >>> I have a pig script that I've translated from an old Python job. The old >>> script worked by read a bunch of lines of JSON into sqlite and running >>> queries again that. The sqlite DB ended up being about 1gb on disk by the >>> end of the job (it's about a year's worth of data) and the whole job ran in >>> 40 to 60 minutes single-threaded on a single machine. >>> >>> The pig version (sadly I require pig 0.6, as I'm running on EMR), much to >>> my surprise, is *way* slower. Run against a few day's worth of data it >>> takes about 15 seconds, against a month's worth takes about 1.1 minutes, >>> against two months takes about 3 minutes, but against 3 months of data in >>> local mode on my 4-proc laptop it takes hours. So long in fact that it >>> didn't finish in an entire work-day and I had to kill it. On Amazon's EMR I >>> did get a job to finish after 5h44m using 10 m1.small nodes, which is >>> pretty nuts compared to the single-proc Python version. >>> >>> There are about 15 thousand JSON files totalling 2.1gb (uncompressed), so >>> it's not that big. And the code is, I think, pretty simple. Take a look: >>> http://pastebin.com/3y7e2ZTq . The loader mentioned there is pretty >>> simple too, it's basically a hack of ElephantBird's JSON loader to dive >>> deeper into the JSON and make bags out of JSON lists in addition to simpler >>> maps that EB does http://pastebin.com/dFKX3AJc >>> >>> While it's running in local mode on my laptop it outputs a lot (about one >>> per minute) of messages like this: >>> >>> 2011-11-29 18:34:21,518 [Low Memory Detector] INFO >>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler >>> called (Collection threshold exceeded) init = 65404928(63872K) used = >>> 1700522216(1660666K) committed = 2060255232(2011968K) max = >>> 2060255232(2011968K) >>> 2011-11-29 18:34:30,773 [Low Memory Detector] INFO >>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler >>> called (Collection threshold exceeded) init = 65404928(63872K) used = >>> 1700519216(1660663K) committed = 2060255232(2011968K) max = >>> 2060255232(2011968K) >>> 2011-11-29 18:34:40,953 [Low Memory Detector] INFO >>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler >>> called (Collection threshold exceeded) init = 65404928(63872K) used = >>> 1700518024(1660662K) committed = 2060255232(2011968K) max = >>> 2060255232(2011968K) >>> >>> But I'm not sure how to read EMR's debugging logs to know if it's doing >>> that in mapreduce mode on EMR too. So my questions are: >>> >>> 1. Is that pig script doing anything that's really that performance >>> intensive? Is the loader doing anything obviously bad? Why on earth is this >>> so slow? This whole dataset should fit in RAM on 2 nodes, let alone 10 >>> >>> 2. How do people generally go about profiling these scripts? >>> >>> 3. Is that "Low Memory Detector" error in local mode anything to be >>> worried about? Or is it just telling me that some intermediate dataset >>> doesn't fit in RAM and is being spilled to disc? >>> >>> >