That is a good question Ayan. A few searches on so returns me:

http://stackoverflow.com/questions/31009834/merge-multiple-small-files-in-to-few-larger-files-in-spark

http://stackoverflow.com/questions/29025147/how-can-i-merge-spark-results-files-without-repartition-and-copymerge


good luck, tell us something about this issue

Alonso


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-09-12 12:39 GMT+02:00 ayan guha <guha.a...@gmail.com>:

> Hi
>
> I have a general question: I have 1.6 mil small files, about 200G all put
> together. I want to put them on hdfs for spark processing.
> I know sequence file is the way to go because putting small files on hdfs
> is not correct practice. Also, I can write a code to consolidate small
> files to seq files locally.
> My question: is there any way to do this in parallel, for example using
> spark or mr or anything else.
>
> Thanks
> Ayan
>

Reply via email to