[jira] Updated: (NUTCH-471) Fix synchronization in NutchBean creation
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-471: Attachment: NutchBeanCreationSync_v2.patch >From http://www-128.ibm.com/developerworks/java/library/j-dcl.html The bottom line is that double-checked locking, in whatever form, should not be used because you cannot guarantee that it will work on any JVM implementation. JSR-133 is addressing issues regarding the memory model, however, double-checked locking will not be supported by the new memory model. Therefore, you have two options: * Accept the synchronization of a getInstance() method as shown in Listing 2. * Forgo synchronization and use a static field. We don't want to remise performance in NutchBean.get(), synchronization is not a solution. Thus as Sami has suggested, i have written a ServetContextListener and added NutchBean construction code there. And modified web.xml to register the event listener class. Also In the servlet initialization, the Configuration object is initialized and cached by NutchConfiguration, so we avoid the same problem in NutchConfiguration.get(). i have tested the implementation and it seems OK. > Fix synchronization in NutchBean creation > - > > Key: NUTCH-471 > URL: https://issues.apache.org/jira/browse/NUTCH-471 > Project: Nutch > Issue Type: Bug > Components: searcher >Affects Versions: 1.0.0 >Reporter: Enis Soztutar > Fix For: 1.0.0 > > Attachments: NutchBeanCreationSync_v1.patch, > NutchBeanCreationSync_v2.patch > > > NutchBean is created and then cached in servlet context. But > NutchBean.get(ServletContext app, Configuration conf) is not syncronized, > which causes more than one instance of the bean (and > DistributedSearch$Client) if servlet container is accessed rapidly during > startup. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: retrieving original html from database
If you need an api for getting the content, can't you just look into the cachedContent.jsp of the demo search application? That shows how to retrieve the original text/html that is stored within the segments. Perhaps I am missing something. On 4/27/07, songjue <[EMAIL PROTECTED]> wrote: You can try this command: bin/nutch readseg (-dump ... | -get ...) . If you need an API instead of the command line, you may have to hack the segment/SegmentReader.java? I'm also wondering this. BTW, make sure you set the 'http.content.limit' property to -1 to avoid content truncation. songjue 2007-04-27 发件人: Charlie Williams 发送时间: 2007-04-25 22:43:12 收件人: nutch-dev@lucene.apache.org 抄送: 主题: retrieving original html from database I have an index of pages from the web, a bit over 1 million. The fetch took several weeks to complete, since it was mainly over a small set of domains. Once we had a completed fetch, and index we began trying to work with the retrieved text, and found that the cached text is just that, flat text. Is the original HTML cached anywhere that it can be accessed after the intial fetch? It would be a shame to have to recrawl all those pages. We are using Nutch .8 Thanks for any help. -Charlie -- "Conscious decisions by conscious minds are what make reality real"
[jira] Commented: (NUTCH-468) Scoring filter should distribute score to all outlinks at once
[ https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492386 ] Andrzej Bialecki commented on NUTCH-468: - +1. I'm writing a scoring plugin now where it's impossible to correctly create the adjust value without this change. > Scoring filter should distribute score to all outlinks at once > -- > > Key: NUTCH-468 > URL: https://issues.apache.org/jira/browse/NUTCH-468 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Doğacan Güney >Priority: Minor > Fix For: 1.0.0 > > Attachments: scoring-v2.patch, scoring.patch > > > Currently ScoringFilter.distributeScoreToOutlink, as its name implies, takes > only a single outlink and works on that. I would suggest that we change it to > distributeScoreToOutlink_s_ so that it would take all the outlinks of a page > at once. This has several advantages: > 1) A ScoringFilter plugin returns a single adjust datum to set its score > instead of returning several. > 2) A ScoringFilter plugin can change the score of the original page (via > adjust datum) even if there are no outlinks. This is useful if you have a > ScoringFilter plugin that, say, scores pages based on content instead of > outlinks. > 3) Since the ScoringFilter plugin recieves all outlinks at once, it can make > better decisions on how to distribute the score. For example, right now it is > not possible to create a plugin that always distributes exactly a page's > 'cash' to outlinks(that is, if a page has score 5, it will always distribute > exactly 5 points to its outlinks no matter what the internal/external factors > are) if internal / external score factors are not 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-476) Would like to add a field to the document class for its MD5 signature
Would like to add a field to the document class for its MD5 signature -- Key: NUTCH-476 URL: https://issues.apache.org/jira/browse/NUTCH-476 Project: Nutch Issue Type: Improvement Components: indexer Environment: all Reporter: Linh Pham Priority: Minor During indexing a file, if an MD5 signature was calculated and stored along with the document as a default, it could then be used to remove duplicates from the results on retrieval. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.