Hi Thanks for your mail. I have read few of those posts. But always I see solutions assume data is on hdfs already. My problem is to get data on to HDFS for the first time.
One way I can think of is to load small files on each cluster machines on the same folder. For example load file 1-0.3 mil on machine 1, 0.3-0.6 mil on machine 2 and so on. Then I can run spark jobs which will locally read files. Any better solution? Can flume help here? Any idea is appreciated. Best Ayan On 12 Sep 2016 20:54, "Alonso Isidoro Roman" <alons...@gmail.com> wrote: > That is a good question Ayan. A few searches on so returns me: > > http://stackoverflow.com/questions/31009834/merge- > multiple-small-files-in-to-few-larger-files-in-spark > > http://stackoverflow.com/questions/29025147/how-can-i- > merge-spark-results-files-without-repartition-and-copymerge > > > good luck, tell us something about this issue > > Alonso > > > Alonso Isidoro Roman > [image: https://]about.me/alonso.isidoro.roman > > <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> > > 2016-09-12 12:39 GMT+02:00 ayan guha <guha.a...@gmail.com>: > >> Hi >> >> I have a general question: I have 1.6 mil small files, about 200G all put >> together. I want to put them on hdfs for spark processing. >> I know sequence file is the way to go because putting small files on hdfs >> is not correct practice. Also, I can write a code to consolidate small >> files to seq files locally. >> My question: is there any way to do this in parallel, for example using >> spark or mr or anything else. >> >> Thanks >> Ayan >> > >