RE: Merge of the inmemory files threw an exception and diffs between 0.17.2 and 0.18.1

2008-10-31 Thread Deepika Khera
Hi Devraj,

It was pretty consistent with my comparator class in my old email(the
one that uses UTF8). While trying to resolve the issue, I changed UTF8
to Text. That made it disappear for a while but then it came back again.
My new Comparator class(with Text) is - 

public class IncrementalURLIndexKey implements WritableComparable {
  private Text url;
  private long userid;

  public IncrementalURLIndexKey() {
  }

  public IncrementalURLIndexKey(Text url, long userid) {
this.url = url;
this.userid = userid;
  }

  public Text getUrl() {
return url;
  }

  public long getUserid() {
return userid;
  }

  public void write(DataOutput out) throws IOException {
url.write(out);
out.writeLong(userid);
  }

  public void readFields(DataInput in) throws IOException {
url = new Text();
url.readFields(in);
userid = in.readLong();
  }

  public int compareTo(Object o) {
IncrementalURLIndexKey other = (IncrementalURLIndexKey) o;
int result = url.compareTo(other.getUrl());
if (result == 0) result = CUID.compare(userid, other.userid);
return result;
  }

  /**
   * A Comparator optimized for IncrementalURLIndexKey.
   */
  public static class GroupingComparator extends WritableComparator {
public GroupingComparator() {
  super(IncrementalURLIndexKey.class, true);
}

public int compare(WritableComparable a, WritableComparable b) {
  IncrementalURLIndexKey key1 = (IncrementalURLIndexKey) a;
  IncrementalURLIndexKey key2 = (IncrementalURLIndexKey) b;

  return key1.getUrl().compareTo(key2.getUrl());
}
  }

  static {
WritableComparator.define(IncrementalURLIndexKey.class, new
GroupingComparator());
  }
}

Thanks,
Deepika


-Original Message-
From: Devaraj Das [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, October 28, 2008 9:01 PM
To: core-user@hadoop.apache.org
Subject: Re: Merge of the inmemory files threw an exception and diffs
between 0.17.2 and 0.18.1

Quick question (I haven't looked at your comparator code yet) - is this
reproducible/consistent?


On 10/28/08 11:52 PM, Deepika Khera [EMAIL PROTECTED] wrote:

 I am getting a similar exception too with Hadoop 0.18.1(See stacktrace
 below), though its an EOFException. Does anyone have any idea about
what
 it means and how it can be fixed?
 
 2008-10-27 16:53:07,407 WARN org.apache.hadoop.mapred.ReduceTask:
 attempt_200810241922_0844_r_06_0 Merge of the inmemory files threw
 an exception: java.io.IOException: Intermedate merge failed
 at

org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doIn
 MemMerge(ReduceTask.java:2147)
 at

org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(
 ReduceTask.java:2078)
 Caused by: java.lang.RuntimeException: java.io.EOFException
 at

org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:
 103)
 at
 org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:269)
 at
 org.apache.hadoop.util.PriorityQueue.upHeap(PriorityQueue.java:122)
 at
 org.apache.hadoop.util.PriorityQueue.put(PriorityQueue.java:49)
 at
 org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:321)
 at org.apache.hadoop.mapred.Merger.merge(Merger.java:72)
 at

org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doIn
 MemMerge(ReduceTask.java:2123)
 ... 1 more
 Caused by: java.io.EOFException
 at
 java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)
 at org.apache.hadoop.io.UTF8.readFields(UTF8.java:103)
 at com.collarity.io.IOUtil.readUTF8(IOUtil.java:213)
 at

com.collarity.url.IncrementalURLIndexKey.readFields(IncrementalURLIndexK
 ey.java:40)
 at

org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:
 97)
 ... 7 more
 
 2008-10-27 16:53:07,407 WARN org.apache.hadoop.mapred.ReduceTask:
 attempt_200810241922_0844_r_06_0 Merging of the local FS files
threw
 an exception: java.io.IOException: java.lang.RuntimeException:
 java.io.EOFException
 at

org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:
 103)
 at
 org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:269)
 at
 org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:135)
 at
 org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:102)
 at

org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.ja
 va:226)
 at
 org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242)
 at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:83)
 at

org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Reduc
 eTask.java:2035)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
 at com.collarity.io.IOUtil.readUTF8(IOUtil.java:213)
 at

com.collarity.url.IncrementalURLIndexKey.readFields(IncrementalURLIndexK
 ey.java:40)
 at

org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:
 97)
 ... 7 more

RE: Merge of the inmemory files threw an exception and diffs between 0.17.2 and 0.18.1

2008-10-31 Thread Deepika Khera
Wow, if the issue is fixed with version 0.20, then could we please have
a patch for version 0.18? 

Thanks,
Deepika

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 30, 2008 12:19 PM
To: core-user@hadoop.apache.org
Subject: Re: Merge of the inmemory files threw an exception and diffs
between 0.17.2 and 0.18.1

So, Philippe reports that the problem goes away with 0.20-dev  
(trunk?): http://mahout.markmail.org/message/swmzreg6fnzf6icv   We  
aren't totally clear on the structure of SVN for Hadoop, but it seems  
like it is not fixed by this patch.



On Oct 29, 2008, at 10:28 AM, Grant Ingersoll wrote:

 We'll try it out...

 On Oct 28, 2008, at 3:00 PM, Arun C Murthy wrote:


 On Oct 27, 2008, at 7:05 PM, Grant Ingersoll wrote:

 Hi,

 Over in Mahout (lucene.a.o/mahout), we are seeing an oddity with  
 some of our clustering code and Hadoop 0.18.1.  The thread in  
 context is at:  http://mahout.markmail.org/message/vcyvlz2met7fnthr

 The problem seems to occur when going from 0.17.2 to 0.18.1.  In  
 the user logs, we are seeing the following exception:
 2008-10-27 21:18:37,014 INFO org.apache.hadoop.mapred.Merger: Down  
 to the last merge-pass, with 2 segments left of total size: 5011  
 bytes
 2008-10-27 21:18:37,033 WARN org.apache.hadoop.mapred.ReduceTask:  
 attempt_200810272112_0011_r_00_0 Merge of the inmemory files  
 threw an exception: java.io.IOException: Intermedate merge failed
  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
 $InMemFSMergeThread.doInMemMerge(ReduceTask.java:2147)
  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
 $InMemFSMergeThread.run(ReduceTask.java:2078)
 Caused by: java.lang.NumberFormatException: For input string: [

 If you are sure that this isn't caused by your application-logic,  
 you could try running with
http://issues.apache.org/jira/browse/HADOOP-4277 
 .

 That bug caused many a ship to sail in large circles, hopelessly.

 Arun


  at  
 sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java: 
 1224)
  at java.lang.Double.parseDouble(Double.java:510)
  at  
 org.apache.mahout.matrix.DenseVector.decodeFormat(DenseVector.java: 
 60)
  at  
 org 
 .apache 
 .mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:256)
  at  
 org 
 .apache 
 .mahout 
 .clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java:38)
  at  
 org 
 .apache 
 .mahout 
 .clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java:31)
  at org.apache.hadoop.mapred.ReduceTask 
 $ReduceCopier.combineAndSpill(ReduceTask.java:2174)
  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access 
 $3100(ReduceTask.java:341)
  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
 $InMemFSMergeThread.doInMemMerge(ReduceTask.java:2134)

 And in the main output log (from running  bin/hadoop jar  mahout/ 
 examples/build/apache-mahout-examples-0.1-dev.job  
 org.apache.mahout.clustering.syntheticcontrol.kmeans.Job) we see:
 08/10/27 21:18:41 INFO mapred.JobClient: Task Id :  
 attempt_200810272112_0011_r_00_0, Status : FAILED
 java.io.IOException: attempt_200810272112_0011_r_00_0The  
 reduce copier failed
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
  at org.apache.hadoop.mapred.TaskTracker 
 $Child.main(TaskTracker.java:2207)

 If I run this exact same job on 0.17.2 it all runs fine.  I  
 suppose either a bug was introduced in 0.18.1 or a bug was fixed  
 that we were relying on.  Looking at the release notes between the  
 fixes, nothing in particular struck me as related.  If it helps, I  
 can provide the instructions for how to run the example in  
 question (they need to be written up anyway!)


 I see some related things at
http://hadoop.markmail.org/search/?q=Merge+of+the+inmemory+files+threw+a
n+exception 
 , but those are older, it seems, so not sure what to make of them.

 Thanks,
 Grant


 --
 Grant Ingersoll
 Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
 http://www.lucenebootcamp.com


 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ









RE: Merge of the inmemory files threw an exception and diffs between 0.17.2 and 0.18.1

2008-10-28 Thread Deepika Khera
I am getting a similar exception too with Hadoop 0.18.1(See stacktrace
below), though its an EOFException. Does anyone have any idea about what
it means and how it can be fixed?

2008-10-27 16:53:07,407 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200810241922_0844_r_06_0 Merge of the inmemory files threw
an exception: java.io.IOException: Intermedate merge failed
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doIn
MemMerge(ReduceTask.java:2147)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(
ReduceTask.java:2078)
Caused by: java.lang.RuntimeException: java.io.EOFException
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:
103)
at
org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:269)
at
org.apache.hadoop.util.PriorityQueue.upHeap(PriorityQueue.java:122)
at
org.apache.hadoop.util.PriorityQueue.put(PriorityQueue.java:49)
at
org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:321)
at org.apache.hadoop.mapred.Merger.merge(Merger.java:72)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doIn
MemMerge(ReduceTask.java:2123)
... 1 more
Caused by: java.io.EOFException
at
java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:103)
at com.collarity.io.IOUtil.readUTF8(IOUtil.java:213)
at
com.collarity.url.IncrementalURLIndexKey.readFields(IncrementalURLIndexK
ey.java:40)
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:
97)
... 7 more

2008-10-27 16:53:07,407 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200810241922_0844_r_06_0 Merging of the local FS files threw
an exception: java.io.IOException: java.lang.RuntimeException:
java.io.EOFException
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:
103)
at
org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:269)
at
org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:135)
at
org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:102)
at
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.ja
va:226)
at
org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242)
at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:83)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Reduc
eTask.java:2035)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at com.collarity.io.IOUtil.readUTF8(IOUtil.java:213)
at
com.collarity.url.IncrementalURLIndexKey.readFields(IncrementalURLIndexK
ey.java:40)
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:
97)
... 7 more

at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Reduc
eTask.java:2039)

2008-10-27 16:53:07,907 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.io.IOException: attempt_200810241922_0844_r_06_0The reduce
copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)




My WritableComparable class looks like this -




public class IncrementalURLIndexKey implements WritableComparable {
  private UTF8 url;
  private long userid;

  public IncrementalURLIndexKey() {
  }

  public IncrementalURLIndexKey(UTF8 url, long userid) {
this.url = url;
this.userid = userid;
  }

  public UTF8 getUrl() {
return url;
  }

  public long getUserid() {
return userid;
  }

  public void write(DataOutput out) throws IOException {
IOUtil.writeUTF8(out, url);
out.writeLong(userid);
  }

  public void readFields(DataInput in) throws IOException {
url = IOUtil.readUTF8(in, url);
userid = in.readLong();
  }

  public int compareTo(Object o) {
IncrementalURLIndexKey other = (IncrementalURLIndexKey) o;
int result = url.compareTo(other.getUrl());
if (result == 0) result = CUID.compare(userid, other.userid);
return result;
  }

  /**
   * A Comparator optimized for IncrementalURLIndexKey.
   */
  public static class GroupingComparator extends WritableComparator {
public GroupingComparator() {
  super(IncrementalURLIndexKey.class, true);
}

public int compare(WritableComparable a, WritableComparable b) {
  IncrementalURLIndexKey key1 = (IncrementalURLIndexKey) a;
  IncrementalURLIndexKey key2 = (IncrementalURLIndexKey) b;

  return key1.getUrl().compareTo(key2.getUrl());
}
  }

  static {
WritableComparator.define(IncrementalURLIndexKey.class, new
GroupingComparator());
  }
}


Thanks,
Deepika

-Original Message-
From: Grant 

Katta presentation slides

2008-09-22 Thread Deepika Khera
Hi Stefan,

 

Are the slides from the Katta presentation up somewhere? If not then
could you please post them?

 

Thanks,
Deepika



Hadoop 0.18 stable?

2008-09-09 Thread Deepika Khera
Hi ,

 

When is the hadoop 0.18 version expected to be stable? I was looking
into upgrading to it.

 

Are there any known critical issues that we've run into in this version?

 

Thanks,

Deepika



RE: Cannot read reducer values into a list

2008-08-20 Thread Deepika Khera
Thanks...this works beautifully :) !

Deepika

-Original Message-
From: Owen O'Malley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 20, 2008 7:52 AM
To: core-user@hadoop.apache.org
Subject: Re: Cannot read reducer values into a list


On Aug 19, 2008, at 4:57 PM, Deepika Khera wrote:

 Thanks for the clarification on this.

 So, it seems like cloning the object before adding to the list is the
 only solution for this problem. Is that right?

Yes. You can use WritableUtils.clone to do the job.

-- Owen


RE: Cannot read reducer values into a list

2008-08-19 Thread Deepika Khera
Hi,

Are we sure that this issue was fixed in 0.17.0(or do we need to patch).
I am using this version and I still see the issue? 


Thanks,
Deepika 

-Original Message-
From: Arun C Murthy [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 19, 2008 1:04 PM
To: core-user@hadoop.apache.org
Subject: Re: Cannot read reducer values into a list


On Aug 19, 2008, at 12:17 PM, Stuart Sierra wrote:

 Hello list,
 Thought I would share this tidbit that frustrated me for a couple of
 hours.  Beware!  Hadoop reuses the Writable objects given to the
 reducer.  For example:


Yes.
http://issues.apache.org/jira/browse/HADOOP-2399 - fixed in 0.17.0.

Arun

public void reduce(K key, IteratorV values,
   OutputCollectorK, V output,
   Reporter reporter)
throws IOException {

ListV valueList = new ArrayListV();
while (values.hasNext()) {
valueList.add(values.next());
}

// Say there were 10 values.  valueList now contains 10
// pointers to the same object.
}

 I assume this is done for efficiency, but a warning in the Reducer
 documentation would be nice.

 -Stuart



RE: Distributed Lucene - from hadoop contrib

2008-08-12 Thread Deepika Khera
Thank you for your response. 

I was imagining the 2 concepts of i) using hadoop.contrib.index to index
documents ii) providing search in a distributed fashion, to be all in
one box. 

So basically, hadoop.contrib.index is used to create lucene indexes in
a distributed fashion (by creating shards-each shard being a lucene
instance). And then I can use Katta or any other Distributed Lucene
application to serve lucene indexes distributed over many servers.

Deepika 


-Original Message-
From: Ning Li [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 08, 2008 7:08 AM
To: core-user@hadoop.apache.org
Subject: Re: Distributed Lucene - from hadoop contrib

 1) Katta n Distributed Lucene are different projects though, right?
Both
 being based on kind of the same paradigm (Distributed Index)?

The design of Katta and that of Distributed Lucene are quite different
last time I checked. I pointed out the Katta project because you can
find the code for Distributed Lucene there.

 2) So, I should be able to use the hadoop.contrib.index with HDFS.
 Though, it would be much better if it is integrated with Distributed
 Lucene or the Katta project as these are designed keeping the
 structure and behavior of indexes in mind. Right?

As described in the README file, hadoop.contrib.index uses map/reduce
to build Lucene instances. It does not contain a component that serves
queries. If that's not sufficient for you, you can check out the
designs of Katta and Distributed Index and see which one suits your
use better.

Ning


RE: Distributed Lucene - from hadoop contrib

2008-08-07 Thread Deepika Khera
Hey guys,

I would appreciate any feedback on this

Deepika

-Original Message-
From: Deepika Khera [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 06, 2008 5:39 PM
To: core-user@hadoop.apache.org
Subject: Distributed Lucene - from hadoop contrib

Hi,

 

I am planning to use distributed lucene from hadoop.contrib.index for
indexing. Has anyone used this or tested it? Any issues or comments?

 

I see that the design described is different from HDFS (Namenode is
stateless, stores no information regarding blocks for files, etc) . Does
anyone know how hard will it be to setup this kind of system or is there
something that can be reused.

 

A reference link -

 

http://wiki.apache.org/hadoop/DistributedLucene

 

Thanks,
Deepika



Distributed Lucene - from hadoop contrib

2008-08-06 Thread Deepika Khera
Hi,

 

I am planning to use distributed lucene from hadoop.contrib.index for
indexing. Has anyone used this or tested it? Any issues or comments?

 

I see that the design described is different from HDFS (Namenode is
stateless, stores no information regarding blocks for files, etc) . Does
anyone know how hard will it be to setup this kind of system or is there
something that can be reused.

 

A reference link -

 

http://wiki.apache.org/hadoop/DistributedLucene

 

Thanks,
Deepika