Re: Help for the large number of the input data files

Xiangrui Meng Tue, 08 Jul 2014 13:21:52 -0700

You can either use sc.wholeTextFiles and then a flatMap to reduce the
number of partitions, or give more memory to the driver process by
using --driver-memory 20g and then call RDD.repartition(small number)
after you load the data in. -Xiangrui


On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun Kim
<taeyun....@innowireless.co.kr> wrote:
> Hi,
>
>
>
> A help for the implementation best practice is needed.
>
> The operating environment is as follows:
>
>
>
> - Log data file arrives irregularly.
>
> - The size of a log data file is from 3.9KB to 8.5MB. The average is about
> 1MB.
>
> - The number of records of a data file is from 13 lines to 22000 lines. The
> average is about 2700 lines.
>
> - Data file must be post-processed before aggregation.
>
> - Post-processing algorithm can be changed.
>
> - Post-processed file is managed separately with original data file, since
> the post-processing algorithm might be changed.
>
> - Daily aggregation is performed. All post-processed data file must be
> filtered record-by-record and aggregation(average, max min…) is calculated.
>
> - Since aggregation is fine-grained, the number of records after the
> aggregation is not so small. It can be about half of the number of the
> original records.
>
> - At a point, the number of the post-processed file can be about 200,000.
>
> - A data file should be able to be deleted individually.
>
>
>
> In a test, I tried to process 160,000 post-processed files by Spark starting
> with sc.textFile() with glob path, it failed with OutOfMemory exception on
> the driver process.
>
>
>
> What is the best practice to handle this kind of data?
>
> Should I use HBase instead of plain files to save post-processed data?
>
>
>
> Thank you.
>
>

Re: Help for the large number of the input data files

Reply via email to