Hi

Thanks for your mail. I have read few of those posts. But always I see
solutions assume data is on hdfs already. My problem is to get data on to
HDFS for the first time.

One way I can think of is to load small files on each cluster machines on
the same folder. For example load file 1-0.3 mil on machine 1, 0.3-0.6 mil
on machine 2 and so on. Then I can run spark jobs which will locally read
files.

Any better solution? Can flume help here?

Any idea is appreciated.

Best
Ayan
On 12 Sep 2016 20:54, "Alonso Isidoro Roman" <alons...@gmail.com> wrote:

> That is a good question Ayan. A few searches on so returns me:
>
> http://stackoverflow.com/questions/31009834/merge-
> multiple-small-files-in-to-few-larger-files-in-spark
>
> http://stackoverflow.com/questions/29025147/how-can-i-
> merge-spark-results-files-without-repartition-and-copymerge
>
>
> good luck, tell us something about this issue
>
> Alonso
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>
> 2016-09-12 12:39 GMT+02:00 ayan guha <guha.a...@gmail.com>:
>
>> Hi
>>
>> I have a general question: I have 1.6 mil small files, about 200G all put
>> together. I want to put them on hdfs for spark processing.
>> I know sequence file is the way to go because putting small files on hdfs
>> is not correct practice. Also, I can write a code to consolidate small
>> files to seq files locally.
>> My question: is there any way to do this in parallel, for example using
>> spark or mr or anything else.
>>
>> Thanks
>> Ayan
>>
>
>

Reply via email to