Re: how to differentiate which input directory current record comes from?

2012-12-15 Thread Sigurd Spieckermann
If you use the new API, you can access the MapContext object in the setup method of the mapper. Then, you can get the input split with MapContext#getInputSplit(), cast it to FileSplit and obtain the path of the file the current split is part of through the FileSplit#getPath() method. All records of

Re: Tell Hadoop to store pairs of files at the same location(s) on HDFS

2012-12-05 Thread Sigurd Spieckermann
but may arrive soon. On Wed, Dec 5, 2012 at 11:23 PM, Sigurd Spieckermann wrote: Hi guys, I have been wondering if there's a way (hack'ish would be okay too) to tell Hadoop that two files shall be stored together at the same location(s). It would benefit map-side join performance if it

Tell Hadoop to store pairs of files at the same location(s) on HDFS

2012-12-05 Thread Sigurd Spieckermann
Hi guys, I have been wondering if there's a way (hack'ish would be okay too) to tell Hadoop that two files shall be stored together at the same location(s). It would benefit map-side join performance if it could be done somehow because all map tasks would be able to read data from a local copy. Do

Re: Spill file compression

2012-11-07 Thread Sigurd Spieckermann
same key that are in the in-memory buffer before the spill and it should be at least a few per spill in my case. This is confusing... 2012/11/7 Sigurd Spieckermann > Hm, maybe I need some clarification on what the combiner exactly does. > From what I understand from "Hadoop - The

Re: Spill file compression

2012-11-07 Thread Sigurd Spieckermann
thout* compression. It seems they aren't compressed, but that's strange because I definitely enabled compression the way I described. 2012/11/7 Sigurd Spieckermann > OK, just wanted to confirm. Maybe there is another problem then. I just > looked at the task logs and there were ~20

Re: Spill file compression

2012-11-07 Thread Sigurd Spieckermann
gt; may be counting decompressed values of the records written, and not > post-compressed ones. > > On Wed, Nov 7, 2012 at 6:02 PM, Sigurd Spieckermann > wrote: > > Hi guys, > > > > I've encountered a situation where the ratio between "Map output bytes&q

Re: Data locality of map-side join

2012-11-06 Thread Sigurd Spieckermann
. There is zero use of DistributedCache - the only decisions are >> made based on the expression (i.e. to select which form of joining >> record reader to use). >> >> Enhancements to this may be accepted though, so feel free to file some >> JIRAs if you have someth

Re: Data locality of map-side join

2012-10-23 Thread Sigurd Spieckermann
specify the > smaller data set using some hints in query. > > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > > -Original Message- > From: Sigurd Spieckermann > Date: Mon, 22 Oct 2012 22:29:15 > To: > Reply-To: user@hadoop.apache.org &g

Data locality of map-side join

2012-10-22 Thread Sigurd Spieckermann
Hi guys, I've been trying to figure out whether a map-side join using the join-package does anything clever regarding data locality with respect to at least one of the partitions to join. To be more specific, if I want to join two datasets and some partition of dataset A is larger than the co

Join-package in new API?

2012-10-10 Thread Sigurd Spieckermann
Hi, I've just noticed that the join-package only exists in the old map-reduce API. Is there a particular reason why it's not in the new API? (deprecated maybe?) If so, what is the approach to take in order to perform this map-side join strategy with the new API? Thanks, Sigurd

Re: Join-package combiner number of input and output records the same

2012-09-25 Thread Sigurd Spieckermann
each map task takes one split and the combiner operates only on the key-value pairs within one split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer? 2012/9/25 Sigurd Spieckermann mailto

Re: Join-package combiner number of input and output records the same

2012-09-25 Thread Sigurd Spieckermann
split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer? 2012/9/25 Sigurd Spieckermann > Maybe one more note: the combiner and the reducer class are the same and > in the reduce-phas

Re: Reading from HDFS from inside the mapper

2012-09-17 Thread Sigurd Spieckermann
ed to appear in alphabetical order.". That may > just be what is biting you, since standalone mode uses LFS. > > On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann > wrote: > > I've tracked down the problem to only occur in standalone mode. In > > pseudo-distributed

Re: Reading from HDFS from inside the mapper

2012-09-17 Thread Sigurd Spieckermann
is problem before and knows a solution? Thanks, Sigurd 2012/9/17 Sigurd Spieckermann > I'm experiencing a strange problem right now. I'm writing part-files to > the HDFS providing initial data and (which should actually not make a > difference anyway) write them in ascending o

Re: Reading from HDFS from inside the mapper

2012-09-17 Thread Sigurd Spieckermann
ot;, they are in the order part-1, part-0, part-2, part-3 etc. How is that possible? Why aren't they shown in natural order? Also the map-side join package considers them in this order which causes problems. 2012/9/10 Sigurd Spieckermann > OK, interesting. Just to confirm: is

Re: Reading from HDFS from inside the mapper

2012-09-10 Thread Sigurd Spieckermann
omatically in memory, > but on the local disk of the TaskTracker your task runs on. You can > choose to then use a LocalFileSystem impl. to read it out from there, > which would end up being (slightly) faster than your same approach > applied to MapFiles on HDFS. > > On Mon, Sep 10

Re: Reading from HDFS from inside the mapper

2012-09-10 Thread Sigurd Spieckermann
; > > On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann < > sigurd.spieckerm...@gmail.com> wrote: > >> Hi, >> >> I would like to perform a map-side join of two large datasets where >> dataset A consists of m*n elements and dataset B consists of n ele

Reading from HDFS from inside the mapper

2012-09-10 Thread Sigurd Spieckermann
Hi, I would like to perform a map-side join of two large datasets where dataset A consists of m*n elements and dataset B consists of n elements. For the join, every element in dataset B needs to be accessed m times. Each mapper would join one element from A with the corresponding element from B. E

Hadoop CompositeInputFormat block matrix-vector multiplication

2012-09-04 Thread Sigurd Spieckermann
Hi guys, I am trying to implement a block matrix-vector multiplication algorithm with Hadoop according to the schematics from http://i.stanford.edu/~ullman/mmds/ch5.pdf page 162. My matrix is going to be sparse and the vector dense which is exactly what is required in PageRank as well. The vector