Duplicate Output Directories in S3

2009-03-22 Thread S D
I have an Hadoop Streaming program that crawls the web for data items,
processes each retrieved item and then stores the results on S3. For each
processed item a directory on S3 is created to store the results produced by
the processing. At the conclusion of a program run I've been getting a
duplication of each directory. E.g., if I process item A1 and item A2 I get
two directories for the results of A1 and two directories for the results of
A2. The corresponding directories are identical. I've checked my code and
don't see anything obvious that could lead to this. Furthermore, it appears
that only one map task is handling a given data item. Any suggestions on
what this might be?

Thanks,
John


Re: Unable to access job details

2009-03-22 Thread Amareshwari Sriramadasu
Can you look for Exception from jetty in JT logs and report here? That 
would tell us the cause for ERROR 500.


Thanks
Amareshwari
Nathan Marz wrote:
Sometimes I am unable to access a job's details and instead only see. 
I am seeing this on 0.19.2 branch.


HTTP ERROR: 500

Internal Server Error

RequestURI=/jobdetails.jsp

Powered by Jetty://

Does anyone know the cause of this?




Re: hadoop migration

2009-03-22 Thread Stuart Sierra
On Sun, Mar 22, 2009 at 2:17 PM, nitesh bhatia
 wrote:
> Bigtable ???
> Is it opensource ? I am not sure if google has released any code of
> bigtable. So far only 1 research paper is available.

No, Google has never released source code for MapReduce and BigTable.
Hadoop and HBase attempt to fill that gap.

-Stuart


Re: hadoop migration

2009-03-22 Thread Philip M. White
On Sun, Mar 22, 2009 at 11:47:35PM +0530, nitesh bhatia wrote:
> Bigtable ???
> Is it opensource ? I am not sure if google has released any code of
> bigtable. So far only 1 research paper is available.

HBase is an implementation of BigTable.

-- 
Philip


smime.p7s
Description: S/MIME cryptographic signature


Re: hadoop migration

2009-03-22 Thread nitesh bhatia
Bigtable ???
Is it opensource ? I am not sure if google has released any code of
bigtable. So far only 1 research paper is available.

--nitesh


On Tue, Mar 17, 2009 at 11:01 AM, Amandeep Khurana  wrote:

> AFAIK, Google uses BigTable for pretty much most of their backend stuff.
> The
> thing to note here is that BigTable is much more mature than Hbase.
>
> You can try it out and see how it works out for you. Do share your results
> on the mailing list...
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Mon, Mar 16, 2009 at 10:28 PM, W  wrote:
>
> > Thanks for the quick response Aman,
> >
> > Ok .., i see the point now.
> >
> > currently i'm doing some research on creating a google books like
> > application using hbase as
> > a backend for storing the files and solr as indexer. From this
> > prototype, my be i can measure how fast
> > is hbase on serving data to the client ... (google using bigTable for
> > their books.google.com right ?)
> >
> > Thanks!
> >
> > Regards,
> > Wildan
> >
> > On Tue, Mar 17, 2009 at 12:13 PM, Amandeep Khurana 
> > wrote:
> > > Hypertable is not as mature as Hbase yet. The next release of Hbase,
> > 0.20.0,
> > > includes some patches which reduce the latency of responses and makes
> it
> > > suitable to be used as a backend for a webapp. However the current
> > release
> > > isnt optimized for this purpose.
> > >
> > > The idea behind Hadoop and the rest of the tools around it is more of a
> > data
> > > processing system than a backend datastore for a website. The output of
> > the
> > > processing that Hadoop does is typically taken into a MySQL cluster
> which
> > > feeds a website.
> > >
> > >
> > >
> >
> >
> > --
> > ---
> > OpenThink Labs
> > www.tobethink.com
> >
> > Aligning IT and Education
> >
> > >> 021-99325243
> > Y! : hawking_123
> > Linkedln : http://www.linkedin.com/in/wildanmaulana
> >
>



-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun


Subtle Classloader Issue

2009-03-22 Thread Jeff Eastman
I'm trying to run the Dirichlet clustering example from 
(http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html). The command 
line:


$HADOOP_HOME/bin/hadoop jar 
$MAHOUT_HOME/examples/target/mahout-examples-0.1.job 
org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job


... loads our example jar file which contains the following structure:

>jar -tf mahout-examples-0.1.job
META-INF/
...
org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.class
org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.class
org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModelDistribution.class
org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.class
...
lib/mahout-core-0.1-tests.jar
lib/mahout-core-0.1.jar
lib/hadoop-core-0.19.1.jar
...

The dirichlet/Job first runs a map-reduce job to convert the input data 
into Mahout Vector format and then runs the DirichletDriver.runJob() 
method contained in the lib/mahout-core-0.1.jar. This method calls 
DirichletDriver.createState() which initializes a 
NormalScModelDistribution with a set of NormalScModels that represent 
the prior state of the clustering. This state is then written to HDFS 
and the job begins running the iterations which assign input data points 
to the models. So far so good.


 public static DirichletState createState(String modelFactory, 
int numModels, double alpha_0) throws
   ClassNotFoundException, InstantiationException, 
IllegalAccessException {

   ClassLoader ccl = Thread.currentThread().getContextClassLoader();
   Class cl = ccl.loadClass(modelFactory);
   ModelDistribution factory = (ModelDistribution) 
cl.newInstance();
   DirichletState state = new DirichletState(factory, 
numModels, alpha_0, 1, 1);

   return state;
 }


In the DirichletMapper, also in the lib/mahout jar, the configure() 
method reads in the current model state by calling 
DirichletDriver.createState(). In this invocation; however, it throws a 
CNF exception.


09/03/22 09:33:03 INFO mapred.JobClient: Task Id : 
attempt_200903211441_0025_m_00_2, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.mahout.clustering.syntheticcontrol.dirichlet.NormalScModelDistribution
   at 
org.apache.mahout.clustering.dirichlet.DirichletMapper.getDirichletState(DirichletMapper.java:97)
   at 
org.apache.mahout.clustering.dirichlet.DirichletMapper.configure(DirichletMapper.java:61)


The kMeans job, which uses the same class loader code to load its 
distance measure in similar driver code, works fine. The difference is 
that the referenced distance measure is contained in the 
mahout-core-0.1.jar, not the mahout-examples-0.1.job. Both jobs run fine 
in test mode from Eclipse.


It would seem that there is some subtle difference in the class loader 
structures used by the DirichletDriver and DirichletMapper process 
invocations. In the former, the driver code is called by code living in 
the example jar; in the latter the driver code is called by code living 
in the mahout jar. Its like the first case can see in to the lib/mahout 
classes but the second cannot see out to the classes in the example jar.


Can anybody clarify what is going on and how to fix it?

Jeff