Hi Ayan,

"My problem is to get data on to HDFS for the first time."

well, you have to put them on the cluster. With this simple command you can
load them within HDFS:

hdfs dfs -put $LOCAL_SRC_DIR $HDFS_PATH

Then, i think you have to use coalesce in order to create an uber super
mega file :) but i did not have to do it in my life, yet, so, maybe i am
wrong.

Please, take a look to this post and let us know about how you deal with it.

https://stuartsierra.com/2008/04/24/a-million-little-files

<goog_113545371>
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/


"One way I can think of is to load small files on each cluster machines on
the same folder. For example load file 1-0.3 mil on machine 1, 0.3-0.6 mil
on machine 2 and so on. Then I can run spark jobs which will locally read
files. "


Well, hadoop does not work such that way, when you load data within a
hadoop cluster, data are going to be allocated between every machine
belonging to your cluster, and the files are going to be splitted between
machines. I think you are trying to talk about data locality, isn't ?

"Any better solution? Can flume help here?"

Of course Flume can do the job, but you still will have the small files
problem anyway. You have to create an uber file before you upload it to the
HDFS.

Regards




Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-09-12 14:11 GMT+02:00 ayan guha <guha.a...@gmail.com>:

> Hi
>
> Thanks for your mail. I have read few of those posts. But always I see
> solutions assume data is on hdfs already. My problem is to get data on to
> HDFS for the first time.
>
> One way I can think of is to load small files on each cluster machines on
> the same folder. For example load file 1-0.3 mil on machine 1, 0.3-0.6 mil
> on machine 2 and so on. Then I can run spark jobs which will locally read
> files.
>
> Any better solution? Can flume help here?
>
> Any idea is appreciated.
>
> Best
> Ayan
> On 12 Sep 2016 20:54, "Alonso Isidoro Roman" <alons...@gmail.com> wrote:
>
>> That is a good question Ayan. A few searches on so returns me:
>>
>> http://stackoverflow.com/questions/31009834/merge-multiple-
>> small-files-in-to-few-larger-files-in-spark
>>
>> http://stackoverflow.com/questions/29025147/how-can-i-merge-
>> spark-results-files-without-repartition-and-copymerge
>>
>>
>> good luck, tell us something about this issue
>>
>> Alonso
>>
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>>
>> 2016-09-12 12:39 GMT+02:00 ayan guha <guha.a...@gmail.com>:
>>
>>> Hi
>>>
>>> I have a general question: I have 1.6 mil small files, about 200G all
>>> put together. I want to put them on hdfs for spark processing.
>>> I know sequence file is the way to go because putting small files on
>>> hdfs is not correct practice. Also, I can write a code to consolidate small
>>> files to seq files locally.
>>> My question: is there any way to do this in parallel, for example using
>>> spark or mr or anything else.
>>>
>>> Thanks
>>> Ayan
>>>
>>
>>

Reply via email to