[ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552935 ]
Joseph Chen commented on NUTCH-579: ----------------------------------- I changed the db.signature.class and this seems to solve the problem when I first do a crawl. Now I'm seeing a similar problem when I try to merge the results of two crawls. I performed two separate crawls using the crawl tool. I wanted to merge the results of the two crawls. Here are the steps I did: 1) Merged the segments from the two crawls 2) Inverted links 3) Merged the crawldb 4) Indexed the segments 5) Dedup the index 6) Merged the indexes I noticed a problem after running the dedup. My original index had about 8000 documents (corresponding to feed posts) and after merging I ended up with about half that number (4000 documents). Examining the index via Luke shows that I'm back down to one post feed - each document has a unique digest value. When I skip the dedup step (step 5), the number of documents is around 17000, and examining this index shows multiple posts from a feed. I searched for the db.signature.class value in the DeleteDuplicates.java class, which is the class that gets called when running bin/nutch dedup, but I didn't see any references to this value. Any ideas about this issue? > Feed plugin only indexes one post per feed due to identical digest > ------------------------------------------------------------------ > > Key: NUTCH-579 > URL: https://issues.apache.org/jira/browse/NUTCH-579 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.0.0 > Reporter: Joseph Chen > > When parsing an rss feed, only one post will be indexed per feed. The reason > for this is that the digest, which is calculated for based on the content (or > the url if the content is null) is always the same for each post in a feed. > I noticed this when I was examining my lucene indexes using Luke. All of the > individual feed entries were being indexed properly but then when the dedup > step ran, my merged index ended up with only one document. > As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, > by adding the following code to the filter function: > byte[] signature = MD5Hash.digest(url.toString()).getDigest(); > doc.removeField("digest"); > doc.add(new Field("digest", StringUtil.toHexString(signature), > Field.Store.YES, Field.Index.NO)); > This seems to fix the issue as the index now contains the proper number of > documents. > Anyone have any comments on whether this is a good solution or if there is a > better solution? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.