Duplicate Output Directories in S3
I have an Hadoop Streaming program that crawls the web for data items, processes each retrieved item and then stores the results on S3. For each processed item a directory on S3 is created to store the results produced by the processing. At the conclusion of a program run I've been getting a duplication of each directory. E.g., if I process item A1 and item A2 I get two directories for the results of A1 and two directories for the results of A2. The corresponding directories are identical. I've checked my code and don't see anything obvious that could lead to this. Furthermore, it appears that only one map task is handling a given data item. Any suggestions on what this might be? Thanks, John
Re: Unable to access job details
Can you look for Exception from jetty in JT logs and report here? That would tell us the cause for ERROR 500. Thanks Amareshwari Nathan Marz wrote: Sometimes I am unable to access a job's details and instead only see. I am seeing this on 0.19.2 branch. HTTP ERROR: 500 Internal Server Error RequestURI=/jobdetails.jsp Powered by Jetty:// Does anyone know the cause of this?
Re: hadoop migration
On Sun, Mar 22, 2009 at 2:17 PM, nitesh bhatia wrote: > Bigtable ??? > Is it opensource ? I am not sure if google has released any code of > bigtable. So far only 1 research paper is available. No, Google has never released source code for MapReduce and BigTable. Hadoop and HBase attempt to fill that gap. -Stuart
Re: hadoop migration
On Sun, Mar 22, 2009 at 11:47:35PM +0530, nitesh bhatia wrote: > Bigtable ??? > Is it opensource ? I am not sure if google has released any code of > bigtable. So far only 1 research paper is available. HBase is an implementation of BigTable. -- Philip smime.p7s Description: S/MIME cryptographic signature
Re: hadoop migration
Bigtable ??? Is it opensource ? I am not sure if google has released any code of bigtable. So far only 1 research paper is available. --nitesh On Tue, Mar 17, 2009 at 11:01 AM, Amandeep Khurana wrote: > AFAIK, Google uses BigTable for pretty much most of their backend stuff. > The > thing to note here is that BigTable is much more mature than Hbase. > > You can try it out and see how it works out for you. Do share your results > on the mailing list... > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Mon, Mar 16, 2009 at 10:28 PM, W wrote: > > > Thanks for the quick response Aman, > > > > Ok .., i see the point now. > > > > currently i'm doing some research on creating a google books like > > application using hbase as > > a backend for storing the files and solr as indexer. From this > > prototype, my be i can measure how fast > > is hbase on serving data to the client ... (google using bigTable for > > their books.google.com right ?) > > > > Thanks! > > > > Regards, > > Wildan > > > > On Tue, Mar 17, 2009 at 12:13 PM, Amandeep Khurana > > wrote: > > > Hypertable is not as mature as Hbase yet. The next release of Hbase, > > 0.20.0, > > > includes some patches which reduce the latency of responses and makes > it > > > suitable to be used as a backend for a webapp. However the current > > release > > > isnt optimized for this purpose. > > > > > > The idea behind Hadoop and the rest of the tools around it is more of a > > data > > > processing system than a backend datastore for a website. The output of > > the > > > processing that Hadoop does is typically taken into a MySQL cluster > which > > > feeds a website. > > > > > > > > > > > > > > > -- > > --- > > OpenThink Labs > > www.tobethink.com > > > > Aligning IT and Education > > > > >> 021-99325243 > > Y! : hawking_123 > > Linkedln : http://www.linkedin.com/in/wildanmaulana > > > -- Nitesh Bhatia Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar Gujarat "Life is never perfect. It just depends where you draw the line." visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Subtle Classloader Issue
I'm trying to run the Dirichlet clustering example from (http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html). The command line: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job ... loads our example jar file which contains the following structure: >jar -tf mahout-examples-0.1.job META-INF/ ... org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.class org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.class org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModelDistribution.class org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.class ... lib/mahout-core-0.1-tests.jar lib/mahout-core-0.1.jar lib/hadoop-core-0.19.1.jar ... The dirichlet/Job first runs a map-reduce job to convert the input data into Mahout Vector format and then runs the DirichletDriver.runJob() method contained in the lib/mahout-core-0.1.jar. This method calls DirichletDriver.createState() which initializes a NormalScModelDistribution with a set of NormalScModels that represent the prior state of the clustering. This state is then written to HDFS and the job begins running the iterations which assign input data points to the models. So far so good. public static DirichletState createState(String modelFactory, int numModels, double alpha_0) throws ClassNotFoundException, InstantiationException, IllegalAccessException { ClassLoader ccl = Thread.currentThread().getContextClassLoader(); Class cl = ccl.loadClass(modelFactory); ModelDistribution factory = (ModelDistribution) cl.newInstance(); DirichletState state = new DirichletState(factory, numModels, alpha_0, 1, 1); return state; } In the DirichletMapper, also in the lib/mahout jar, the configure() method reads in the current model state by calling DirichletDriver.createState(). In this invocation; however, it throws a CNF exception. 09/03/22 09:33:03 INFO mapred.JobClient: Task Id : attempt_200903211441_0025_m_00_2, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.clustering.syntheticcontrol.dirichlet.NormalScModelDistribution at org.apache.mahout.clustering.dirichlet.DirichletMapper.getDirichletState(DirichletMapper.java:97) at org.apache.mahout.clustering.dirichlet.DirichletMapper.configure(DirichletMapper.java:61) The kMeans job, which uses the same class loader code to load its distance measure in similar driver code, works fine. The difference is that the referenced distance measure is contained in the mahout-core-0.1.jar, not the mahout-examples-0.1.job. Both jobs run fine in test mode from Eclipse. It would seem that there is some subtle difference in the class loader structures used by the DirichletDriver and DirichletMapper process invocations. In the former, the driver code is called by code living in the example jar; in the latter the driver code is called by code living in the mahout jar. Its like the first case can see in to the lib/mahout classes but the second cannot see out to the classes in the example jar. Can anybody clarify what is going on and how to fix it? Jeff