That is a good question Ayan. A few searches on so returns me: http://stackoverflow.com/questions/31009834/merge-multiple-small-files-in-to-few-larger-files-in-spark
http://stackoverflow.com/questions/29025147/how-can-i-merge-spark-results-files-without-repartition-and-copymerge good luck, tell us something about this issue Alonso Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> 2016-09-12 12:39 GMT+02:00 ayan guha <guha.a...@gmail.com>: > Hi > > I have a general question: I have 1.6 mil small files, about 200G all put > together. I want to put them on hdfs for spark processing. > I know sequence file is the way to go because putting small files on hdfs > is not correct practice. Also, I can write a code to consolidate small > files to seq files locally. > My question: is there any way to do this in parallel, for example using > spark or mr or anything else. > > Thanks > Ayan >