date:20070427

[jira] Updated: (NUTCH-471) Fix synchronization in NutchBean creation

2007-04-27 Thread Enis Soztutar (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-471:


Attachment: NutchBeanCreationSync_v2.patch

>From http://www-128.ibm.com/developerworks/java/library/j-dcl.html

The bottom line is that double-checked locking, in whatever form, should not be 
used because you cannot guarantee that it will work on any JVM implementation. 
JSR-133 is addressing issues regarding the memory model, however, 
double-checked locking will not be supported by the new memory model. 
Therefore, you have two options:
* Accept the synchronization of a getInstance() method as shown in Listing 
2.
* Forgo synchronization and use a static field.

We don't want to remise performance in NutchBean.get(), synchronization is not 
a solution. Thus as Sami has suggested, i have written a ServetContextListener 
and added NutchBean construction code there. And modified web.xml to register 
the event listener class. Also In the servlet initialization, the Configuration 
object is initialized and cached by NutchConfiguration, so we avoid the same 
problem in NutchConfiguration.get(). 

 i have tested the implementation and it seems OK. 


> Fix synchronization in NutchBean creation
> -
>
> Key: NUTCH-471
> URL: https://issues.apache.org/jira/browse/NUTCH-471
> Project: Nutch
>  Issue Type: Bug
>  Components: searcher
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
> Fix For: 1.0.0
>
> Attachments: NutchBeanCreationSync_v1.patch, 
> NutchBeanCreationSync_v2.patch
>
>
> NutchBean is created and then cached in servlet context. But 
> NutchBean.get(ServletContext app, Configuration conf) is not syncronized, 
> which causes more than one instance of the bean (and 
> DistributedSearch$Client) if servlet container is accessed rapidly during 
> startup. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: retrieving original html from database

2007-04-27 Thread Briggs


If you need an api for getting the content, can't you just look into
the cachedContent.jsp of the demo search application?  That shows how
to retrieve the original text/html that is stored within the segments.

Perhaps I am missing something.





On 4/27/07, songjue <[EMAIL PROTECTED]> wrote:

You can try this command:  bin/nutch readseg (-dump ... | -get ...) .
If you need an API instead of the command line, you may have to hack
the segment/SegmentReader.java? I'm also wondering this.

BTW, make sure you set the 'http.content.limit' property to -1 to avoid
content truncation.




songjue
2007-04-27



发件人： Charlie Williams
发送时间： 2007-04-25 22:43:12
收件人： nutch-dev@lucene.apache.org
抄送：
主题： retrieving original html from database

I have an index of pages from the web, a bit over 1 million. The fetch took
several weeks to complete, since it was mainly over a small set of domains.
Once we had a completed fetch, and index we began trying to work with the
retrieved text, and found that the cached text is just that, flat text. Is
the original HTML cached anywhere that it can be accessed after the intial
fetch? It would be a shame to have to recrawl all those pages. We are using
Nutch  .8

Thanks for any help.

-Charlie




--
"Conscious decisions by conscious minds are what make reality real"

[jira] Commented: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-27 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492386
 ] 

Andrzej Bialecki  commented on NUTCH-468:
-

+1. I'm writing a scoring plugin now where it's impossible to correctly create 
the adjust value without this change.

> Scoring filter should distribute score to all outlinks at once
> --
>
> Key: NUTCH-468
> URL: https://issues.apache.org/jira/browse/NUTCH-468
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Doğacan Güney
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: scoring-v2.patch, scoring.patch
>
>
> Currently ScoringFilter.distributeScoreToOutlink, as its name implies, takes 
> only a single outlink and works on that. I would suggest that we change it to 
> distributeScoreToOutlink_s_ so that it would take all the outlinks of a page 
> at once. This has several advantages:
> 1) A ScoringFilter plugin returns a single adjust datum to set its score 
> instead of returning several.
> 2) A ScoringFilter plugin can change the score of the original page (via 
> adjust datum) even if there are no outlinks. This is useful if you have a 
> ScoringFilter plugin that, say, scores pages based on content instead of 
> outlinks.
> 3) Since the ScoringFilter plugin recieves all outlinks at once, it can make 
> better decisions on how to distribute the score. For example, right now it is 
> not possible to create a plugin that always distributes exactly a page's 
> 'cash' to outlinks(that is, if a page has score 5, it will always distribute 
> exactly 5 points to its outlinks no matter what the internal/external factors 
> are) if internal / external score factors are not 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-476) Would like to add a field to the document class for its MD5 signature

2007-04-27 Thread Linh Pham (JIRA)

Would like to add a field to the document class for its MD5 signature 
--

 Key: NUTCH-476
 URL: https://issues.apache.org/jira/browse/NUTCH-476
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
 Environment: all
Reporter: Linh Pham
Priority: Minor


During indexing a file, if an MD5 signature was calculated and stored along 
with the document  as a default,
it could then be used to remove duplicates from the results on retrieval.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-471) Fix synchronization in NutchBean creation

Re: retrieving original html from database

[jira] Commented: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

[jira] Created: (NUTCH-476) Would like to add a field to the document class for its MD5 signature

4 matches

Site Navigation

Mail list logo

Footer information