Thanks Andrzej. I don't think my scenario would be applicable in real-life
situations. However, it would be great to know where the root of the problem
lies.

I have managed to dedup a larger index, and is working perfect. So your
theory is correct. I guess it's a matter of digging a little deeper to
eliminate this once and for all.

Thanks.




-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: 01 February 2007 17:59
To: [email protected]
Subject: Re: Dedup index error

Hetal Shah wrote:
> Another quick update:
>
> I ran Luke on the index, and part-00000 works fine, whereas part-00001 
> comes up as corrupt or missing. Now seeing from the list of files in 
> both these directories, we know that there is nothing in part-00001 - 
> so why does it get generated? And if it does, why does dedup not handle it
gracefully?
>
> I also ran a merge on the two indexes, and it worked fine. 
>
> So that rests the case that both the indexes are corrupted. This 
> brings me to understand that since I only had two pages indexed and 
> the index was small, part-00001 came up with nothing, and dedup does not
handle it????
>
> Any thoughts?
>   

There seems to be an issue with the document partitioning - it seems that
for larger numbers of document the partitioning schema generates at least
one document per partition, but in your case there were too few documents to
fill the second partition ... I need to check where the problem originates -
however, this should not happen if you index more documents than 2 * the
number of reduce tasks.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com




-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to