I've tracked down the problem to only occur in standalone mode. In pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu 12.04 64bit. When I access the directory in linux directly, everything looks normal. It's just when I access it through hadoop. Has anyone seen this problem before and knows a solution?
Thanks, Sigurd 2012/9/17 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com> > I'm experiencing a strange problem right now. I'm writing part-files to > the HDFS providing initial data and (which should actually not make a > difference anyway) write them in ascending order, i.e. part-00000, > part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they > are in the order part-00001, part-00000, part-00002, part-00003 etc. How is > that possible? Why aren't they shown in natural order? Also the map-side > join package considers them in this order which causes problems. > > > 2012/9/10 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com> > >> OK, interesting. Just to confirm: is it okay to distribute quite large >> files through the DistributedCache? Dataset B could be on the order of >> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then >> the probability that every node will have to read (almost) every block of B >> is quite high so given DC is okay here in general, it would be more >> efficient to use DC over HDFS reading. How about the case though that I >> have m*n nodes, then every node would receive all of B while only needing a >> small fraction, right? Could you maybe elaborate on this in a few sentence >> just to be sure I understand Hadoop correctly? >> >> Thanks, >> Sigurd >> >> 2012/9/10 Harsh J <ha...@cloudera.com> >> >>> Sigurd, >>> >>> Hemanth's recommendation of DistributedCache does fit your requirement >>> - it is a generic way of distributing files and archives to tasks of a >>> job. It is not something that pushes things automatically in memory, >>> but on the local disk of the TaskTracker your task runs on. You can >>> choose to then use a LocalFileSystem impl. to read it out from there, >>> which would end up being (slightly) faster than your same approach >>> applied to MapFiles on HDFS. >>> >>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann >>> >>> <sigurd.spieckerm...@gmail.com> wrote: >>> > I checked DistributedCache, but in general I have to assume that none >>> of the >>> > datasets fits in memory... That's why I was considering map-side join, >>> but >>> > by default it doesn't fit to my problem. I could probably get it to >>> work >>> > though, but I would have to enforce the requirements of the map-side >>> join. >>> > >>> > >>> > 2012/9/10 Hemanth Yamijala <yhema...@thoughtworks.com> >>> >> >>> >> Hi, >>> >> >>> >> You could check DistributedCache >>> >> ( >>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache >>> ). >>> >> It would allow you to distribute data to the nodes where your tasks >>> are run. >>> >> >>> >> Thanks >>> >> Hemanth >>> >> >>> >> >>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann >>> >> <sigurd.spieckerm...@gmail.com> wrote: >>> >>> >>> >>> Hi, >>> >>> >>> >>> I would like to perform a map-side join of two large datasets where >>> >>> dataset A consists of m*n elements and dataset B consists of n >>> elements. For >>> >>> the join, every element in dataset B needs to be accessed m times. >>> Each >>> >>> mapper would join one element from A with the corresponding element >>> from B. >>> >>> Elements here are actually data blocks. Is there a performance >>> problem (and >>> >>> difference compared to a slightly modified map-side join using the >>> >>> join-package) if I set dataset A as the map-reduce input and load the >>> >>> relevant element from dataset B directly from the HDFS inside the >>> mapper? I >>> >>> could store the elements of B in a MapFile for faster random access. >>> In the >>> >>> second case without the join-package I would not have to partition >>> the >>> >>> datasets manually which would allow a bit more flexibility, but I'm >>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also, >>> does >>> >>> Hadoop have a cache for such situations by any chance? >>> >>> >>> >>> I appreciate any comments! >>> >>> >>> >>> Sigurd >>> >> >>> >> >>> > >>> >>> >>> >>> -- >>> Harsh J >>> >> >> >