Re: Anatomy of read in hdfs

Sidharth Kumar Mon, 10 Apr 2017 02:47:31 -0700

Thanks Philippe,

I am looking for answer only restricted to HDFS. Because we can do read and
write operations from CLI using commands like "*hadoop fs -copyfromlocal
/(local disk location) /(hdfs path)" *and read using "*hadoop fs -text
/(hdfs file)" *as well.


So my question are
1) when I write data using -copyfromlocal command how data from data queue
is being pushed to data streamer ? Do we have only one data streamer which
listen to data queue and store data into individual datanode one by one or
we have multiple streamer which listen to data queue and create pipeline
for each individual packets?

2) Similarly when we read data, client will receive packets one after
another in sequential manner like 2nd data node will wait for 1st node to
send it's block first or it will be a parallel process.


Thanks for your help in advance.

Sidharth


On 10-Apr-2017 1:50 PM, "Philippe Kernévez" <[email protected]> wrote:

> Hi Sidharth,
>
> As it has been explained, HDFS is not just a file system. It's a part of
> the Hadoop platform. To take advantage of HDFS you have to understand how
> Hadoop storage (HDFS) AND Yarn processing (say MapReduce) work all together
> to implements jobs and parallel processing.
> That says that you will have to rethink the design of your programs to
> take advantage of HDFS.
>
> You may start with this kind of tutorial
> https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm
>
> Then have a deeper read of the Hadoop documentation
> http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-
> client/hadoop-mapreduce-client-core/MapReduceTutorial.html
>
> Regards,
> Philippe
>
>
>
> On Sun, Apr 9, 2017 at 11:13 PM, daemeon reiydelle <[email protected]>
> wrote:
>
>> Readers ARE parallel processes, one per map task. There are defaults in
>> map phase, about how many readers there are for the input file(s). Default
>> is one mapper task block (or file, where any file is smaller than the hdfs
>> block size). There is no java framework per se for splitting up an file
>> (technically not so, but let's simplify, outside of your own custom code).
>>
>>
>> *.......*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>
>> On Sun, Apr 9, 2017 at 2:50 AM, Sidharth Kumar <
>> [email protected]> wrote:
>>
>>> Thanks Tariq, It really helped me to understand but just one another
>>> doubt that if reading is not a parallel process then to ready a file of
>>> 100GB and  hdfs block size is 128MB. It will take lot much to read the
>>> complete file but it's not the scenerio in the real time. And second
>>> question is write operations as well is sequential process ? And will every
>>> datanode have their own data streamer which listen to data queue to get the
>>> packets and create pipeline. So, can you kindly help me to get clear idea
>>> of hdfs read and write operations.
>>>
>>> Regards
>>> Sidharth
>>>
>>> On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <[email protected]> wrote:
>>>
>>> Hi Sidhart,
>>>
>>> When you read data from HDFS using a framework, like MapReduce, blocks
>>> of a HDFS file are read in parallel by multiple mappers created in that
>>> particular program. Input splits to be precise.
>>>
>>> On the other hand if you have a standalone java program then it's just a
>>> single thread process and will read the data sequentially.
>>>
>>>
>>> On Friday, April 7, 2017, Sidharth Kumar <[email protected]>
>>> wrote:
>>>
>>>> Thanks for your response . But I dint understand yet,if you don't mind
>>>> can you tell me what do you mean by "*With Hadoop, the idea is to
>>>> parallelize the readers (one per block for the mapper) with processing
>>>> framework like MapReduce.*"
>>>>
>>>> And also how the concept of parallelize the readers will work with hdfs
>>>>
>>>> Thanks a lot in advance for your help.
>>>>
>>>>
>>>> Regards
>>>> Sidharth
>>>>
>>>> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <[email protected]> wrote:
>>>>
>>>> Hi Sidharth,
>>>>
>>>> The reads are sequential.
>>>> With Hadoop, the idea is to parallelize the readers (one per block for
>>>> the mapper) with processing framework like MapReduce.
>>>>
>>>> Regards,
>>>> Philippe
>>>>
>>>>
>>>> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Genies,
>>>>>
>>>>> I have a small doubt that hdfs read operation is parallel or
>>>>> sequential process. Because from my understanding it should be parallel 
>>>>> but
>>>>> if I read "hadoop definitive guide 4" in anatomy of read it says "*Data
>>>>> is streamed from the datanode back **to the client, which calls
>>>>> read() repeatedly on the stream (step 4). When the end of the **block
>>>>> is reached, DFSInputStream will close the connection to the datanode, then
>>>>> find **the best datanode for the next block (step 5). This happens
>>>>> transparently to the client, **which from its point of view is just
>>>>> reading a continuous stream*."
>>>>>
>>>>> So can you kindly explain me how read operation will exactly happens.
>>>>>
>>>>>
>>>>> Thanks for your help in advance
>>>>>
>>>>> Sidharth
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Philippe Kernévez
>>>>
>>>>
>>>>
>>>> Directeur technique (Suisse),
>>>> [email protected]
>>>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>>>
>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>> OCTO Technology http://www.octo.ch
>>>>
>>>>
>>>>
>>>
>>> --
>>>
>>>
>>> [image: http://]
>>>
>>> Tariq, Mohammad
>>> about.me/mti
>>> [image: http://]
>>> <http://about.me/mti>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> [email protected]
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.ch
>

Re: Anatomy of read in hdfs

Reply via email to