You can either use sc.wholeTextFiles and then a flatMap to reduce the number of partitions, or give more memory to the driver process by using --driver-memory 20g and then call RDD.repartition(small number) after you load the data in. -Xiangrui
On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun Kim <taeyun....@innowireless.co.kr> wrote: > Hi, > > > > A help for the implementation best practice is needed. > > The operating environment is as follows: > > > > - Log data file arrives irregularly. > > - The size of a log data file is from 3.9KB to 8.5MB. The average is about > 1MB. > > - The number of records of a data file is from 13 lines to 22000 lines. The > average is about 2700 lines. > > - Data file must be post-processed before aggregation. > > - Post-processing algorithm can be changed. > > - Post-processed file is managed separately with original data file, since > the post-processing algorithm might be changed. > > - Daily aggregation is performed. All post-processed data file must be > filtered record-by-record and aggregation(average, max min…) is calculated. > > - Since aggregation is fine-grained, the number of records after the > aggregation is not so small. It can be about half of the number of the > original records. > > - At a point, the number of the post-processed file can be about 200,000. > > - A data file should be able to be deleted individually. > > > > In a test, I tried to process 160,000 post-processed files by Spark starting > with sc.textFile() with glob path, it failed with OutOfMemory exception on > the driver process. > > > > What is the best practice to handle this kind of data? > > Should I use HBase instead of plain files to save post-processed data? > > > > Thank you. > >