Re: Reading from HDFS from inside the mapper

Sigurd Spieckermann Mon, 17 Sep 2012 06:16:12 -0700

I've tracked down the problem to only occur in standalone mode. In
pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
12.04 64bit. When I access the directory in linux directly, everything
looks normal. It's just when I access it through hadoop. Has anyone seen
this problem before and knows a solution?


Thanks,
Sigurd

2012/9/17 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>

> I'm experiencing a strange problem right now. I'm writing part-files to
> the HDFS providing initial data and (which should actually not make a
> difference anyway) write them in ascending order, i.e. part-00000,
> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
> that possible? Why aren't they shown in natural order? Also the map-side
> join package considers them in this order which causes problems.
>
>
> 2012/9/10 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
>
>> OK, interesting. Just to confirm: is it okay to distribute quite large
>> files through the DistributedCache? Dataset B could be on the order of
>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>> the probability that every node will have to read (almost) every block of B
>> is quite high so given DC is okay here in general, it would be more
>> efficient to use DC over HDFS reading. How about the case though that I
>> have m*n nodes, then every node would receive all of B while only needing a
>> small fraction, right? Could you maybe elaborate on this in a few sentence
>> just to be sure I understand Hadoop correctly?
>>
>> Thanks,
>> Sigurd
>>
>> 2012/9/10 Harsh J <ha...@cloudera.com>
>>
>>> Sigurd,
>>>
>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>> - it is a generic way of distributing files and archives to tasks of a
>>> job. It is not something that pushes things automatically in memory,
>>> but on the local disk of the TaskTracker your task runs on. You can
>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>> which would end up being (slightly) faster than your same approach
>>> applied to MapFiles on HDFS.
>>>
>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>
>>> <sigurd.spieckerm...@gmail.com> wrote:
>>> > I checked DistributedCache, but in general I have to assume that none
>>> of the
>>> > datasets fits in memory... That's why I was considering map-side join,
>>> but
>>> > by default it doesn't fit to my problem. I could probably get it to
>>> work
>>> > though, but I would have to enforce the requirements of the map-side
>>> join.
>>> >
>>> >
>>> > 2012/9/10 Hemanth Yamijala <yhema...@thoughtworks.com>
>>> >>
>>> >> Hi,
>>> >>
>>> >> You could check DistributedCache
>>> >> (
>>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
>>> ).
>>> >> It would allow you to distribute data to the nodes where your tasks
>>> are run.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>> >> <sigurd.spieckerm...@gmail.com> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> I would like to perform a map-side join of two large datasets where
>>> >>> dataset A consists of m*n elements and dataset B consists of n
>>> elements. For
>>> >>> the join, every element in dataset B needs to be accessed m times.
>>> Each
>>> >>> mapper would join one element from A with the corresponding element
>>> from B.
>>> >>> Elements here are actually data blocks. Is there a performance
>>> problem (and
>>> >>> difference compared to a slightly modified map-side join using the
>>> >>> join-package) if I set dataset A as the map-reduce input and load the
>>> >>> relevant element from dataset B directly from the HDFS inside the
>>> mapper? I
>>> >>> could store the elements of B in a MapFile for faster random access.
>>> In the
>>> >>> second case without the join-package I would not have to partition
>>> the
>>> >>> datasets manually which would allow a bit more flexibility, but I'm
>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>>> does
>>> >>> Hadoop have a cache for such situations by any chance?
>>> >>>
>>> >>> I appreciate any comments!
>>> >>>
>>> >>> Sigurd
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: Reading from HDFS from inside the mapper

Reply via email to