Re: Distributing Keys across Reducers

2012-07-20 Thread John Armstrong
On 07/20/2012 09:20 AM, Dave Shine wrote: I believe this is referred to as a “key skew problem”, which I know is heavily dependent on the actual data being processed. Can anyone point me to any blog posts, white papers, etc. that might give me some options on how to deal with this issue? I don

Re: Sharing data between maps

2012-04-04 Thread John Armstrong
On 04/04/2012 05:00 PM, Kevin Savage wrote: However, what we have is one big file of design data that needs to go to all the maps and many big files of climate data that need to go to one map each. I've not been able to work out if there is a good way of doing this in Hadoop. It sounds like "

Re: Changing default task JVM classpath

2012-02-16 Thread John Armstrong
On 02/16/2012 10:15 AM, Harsh J wrote: That is how HBase does it: HBaseConfiguration at driver loads up HBase *xml file configs from driver classpath (or user set() entries, either way), and then submits that as part of job.xml. These configs should be all you need. It should be, and yet I'm ru

Changing default task JVM classpath

2012-02-16 Thread John Armstrong
Hi, everybody. I'm having some difficulties, which I've traced to not having the Accumulo libraries and configuration available in my task JVMs. The most elegant solution -- especially since I will not always have control over the Accumulo configuration files -- would be to make them available t

Re: Overriding remote classes

2011-12-14 Thread John Armstrong
On Wed, 14 Dec 2011 11:04:37 -0500, David Rosenstrauch wrote: > I ran into the same (known) issue. (See: > https://issues.apache.org/jira/browse/MAPREDUCE-1700) > > Doesn't look like there's a solution yet. Thanks; good to know that I'm actually doing the best I can be writing everything to be

Overriding remote classes

2011-12-14 Thread John Armstrong
Hi, there. I've run into an odd situation, and I'm wondering if there's a way around it; I'm trying to use Jackson for some JSON serialization in my program, and I wrote/unit-tested it to work with Jackson 1.9. Then, in integration testing, I started to see some weird version incompatibilities an

Re: Question about how input data is presented to the map function

2011-09-16 Thread John Armstrong
On Fri, 16 Sep 2011 08:26:35 -0500, harry lippy wrote: > The keys are file offsets into the input file. My question: how did the > 'are presented to the map function as key-value pairs' happen? I've run > the > example on the input file using the java Mapper, Reducer, and the code that > runs

Re: Passing a Global Variable into a Mapper

2011-09-15 Thread John Armstrong
On Thu, 15 Sep 2011 12:43:57 -0500, Arko Provo Mukherjee wrote: > Is there a way to pass some data from the driver class to the Mapper > class without going through the HDFS? I generally use the Configuration object embedded in the Job for that. My Tool implements Configurable so I create by job

Re: Extending ArrayWritable, Using Combiner and Spill Failed error

2011-08-18 Thread John Armstrong
On Thu, 18 Aug 2011 13:44:22 -0700, vipul sharma wrote: > *I think the error is due to using combiner. Since combiner is output data > in Text and Reducer is expecting IntArrayWritable. If I remove combiner > everything works. What am I doing wrong and how can I get the combiner to > work? Any hel

Re: Deep Magic on the distributed classpath

2011-07-27 Thread John Armstrong
On Wed, 27 Jul 2011 10:58:17 -0400, David Rosenstrauch wrote: > There is another, easier approach: if your app inherits from the Tool > class / runs via ToolRunner, then your app can inherit the -libjars > command line functionality itself. This is true; the problem with this approach is that

Deep Magic on the distributed classpath

2011-07-27 Thread John Armstrong
So I think I've figured out how to fix my problem with putting files on the distributed classpath by digging through the code Hadoop uses to process -libjars. If I say DistributedCache.addFileToClassPath(hdfsFile,conf); then hdfsFile is added to the distributed cache, but doesn't show upon the c

Re: Adding files to map/reduce classpath

2011-07-27 Thread John Armstrong
On Tue, 26 Jul 2011 12:35:48 -0700, Shrijeet Paliwal wrote: > ** > See if this (very old) reply from Mikhail helps. > http://search-hadoop.com/m/QFVD1kEmQT > Here is the patch he is referring to. > http://m1.archiveorange.com/m/att/RNVYm/ArchiveOrange_8dEcdJI4bXFkKHBnsll8YzTc8u8a.patch > > **repl

Adding files to map/reduce classpath

2011-07-26 Thread John Armstrong
I'm back to trying to add libraries to the classpath instead of handing around a fat JAR. This time I've served up my directory full of JARs on NFS, which each node in my cluster has mounted at /mnt/hadoop-libs. Now my question is how to add that (local) directory to the classpath of the mapper a

Re: Can I use MapWritable as a key?

2011-07-20 Thread John Armstrong
On Tue, 19 Jul 2011 17:02:32 -0700, Choonho Son wrote: > is it possible job.setOutputKeyClass(MapWritable.class); As others have said, MapWritable doesn't implement Comparable, so it can't be used as a key. The ArrayWritable of Texts is one idea, but I'd suggest instead implementing your OWN Wri

Re: Algorithm for cross product

2011-06-23 Thread John Armstrong
On Wed, 22 Jun 2011 15:16:02 -0700, Steve Lewis wrote: > Assume I have two data sources A and B > Assume I have an input format and can generate key values for both A and B > I want an algorithm which will generate the cross product of all values in > A > having the key K and all values in B havin

Re: Large startup time in remote MapReduce job

2011-06-22 Thread John Armstrong
On Wed, 22 Jun 2011 00:15:56 +0200, Gabor Makrai wrote: > Fortunately, DistributedCache solved my problem! I put a jar file to > HDFS. which contains the necessary classes for the job and I used this: > *DistributedCache.addFileToClassPath(new Path("/myjar/myjar.jar"), conf);* Can I ask which ver

Re: When is mapred-site.xml read?

2011-06-21 Thread John Armstrong
On Tue, 21 Jun 2011 06:37:50 -0700, Alex Kozlov wrote: > However, the job's tasks are executed in a separate JVM and some > of the parameters, like max heap from *mapred.java.child.opts*, are set > during the job execution. In this case the parameter is coming from the > client side where the who

When is mapred-site.xml read?

2011-06-21 Thread John Armstrong
One of my colleagues and I have a little confusion between us as to exactly when mapred-site.xml is read. The pages on hadoop.apache.org don't seem to specify it very clearly. One position is that mapred-site.xml is read by the daemon processes at startup, and so changing a parameter in mapred-si

Re: How is reduce completion % calculated?

2011-06-08 Thread John Armstrong
On Wed, 8 Jun 2011 15:09:41 +0100, Virajith Jalaparti wrote: > I was looking at the syslog generated by my job run and it looks like the > reducers start before the mappers complete. I figured this was the case > because even when the Map had <100% completion, the reduce completion % was > greater

Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2

2011-06-01 Thread John Armstrong
On Wed, 1 Jun 2011 12:48:51 -0700, Alejandro Abdelnur wrote: > Do you have all JARs used by your classes in Needed.jar in the DC classpath > as well? needed.jar contains the class Needed, which my mappers need. If the class Needed calls for another class AlsoNeeded in another jar, wouldn't I ge

Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2

2011-06-01 Thread John Armstrong
On Tue, 31 May 2011 15:09:28 -0400, John Armstrong wrote: > On Tue, 31 May 2011 12:02:28 -0700, Alejandro Abdelnur > wrote: >> What is exactly that does not work? In the hopes that more information can help, I've dug into the local filesystems on each of my four nodes and retr

Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2

2011-05-31 Thread John Armstrong
On Tue, 31 May 2011 12:02:28 -0700, Alejandro Abdelnur wrote: > What is exactly that does not work? Oozie launches a wrapper MapReduce job to run a Java job J1. Oozie's /lib/ directory is provided to the classpath of J1 as expected. This part works. The Java job J1 configures and launches a Ma

Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2

2011-05-30 Thread John Armstrong
On Mon, 30 May 2011 09:43:14 -0700, Alejandro Abdelnur wrote: > If you still want to start your MR job from your Java action, then your > Java > action should do all the setup the MapReduceMain class does before starting > the MR job (this will ensure delegation tokens and distributed cache is > a

Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2

2011-05-30 Thread John Armstrong
On Fri, 27 May 2011 15:47:23 -0700, Alejandro Abdelnur wrote: > John, > > If you are using Oozie, dropping all the JARs your MR jobs needs in the > Oozie WF lib/ directory should suffice. Oozie will make sure all those JARs > are in the distributed cache. That doesn't seem to work. I have this

Re: Hadoop problem

2011-05-27 Thread John Armstrong
On Fri, 27 May 2011 13:52:04 +0200, Laurent Hatier wrote: > I'm a newbie with Hadoop/MapReduce. I've a problem with hadoop. I set some > variables in the run function but when Map running, he can't get the value > of theses variables... > If anyone knows the solution :) By the "run function" do y

Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2

2011-05-26 Thread John Armstrong
On Thu, 26 May 2011 23:17:43 +0530, vishnu krishnan wrote: > thanks, > > > if am not using using the map/reduce here, that just i directly sent dat > data to the db, what will be the problems? Look, I hate to be That Guy, especially on my first day on the list but would you mind moving to your

Problems adding JARs to distributed classpath in Hadoop 0.20.2

2011-05-26 Thread John Armstrong
Hi, everybody. I'm running into some difficulties getting needed libraries to map/reduce tasks using the distributed cache. I'm using Hadoop 0.20.2, which from what I can tell is a hard requirement by the client, so more current versions are not really viable options. The code I've inherited is