Re: retrieving original html from database

2007-04-27 Thread songjue
You can try this command:  bin/nutch readseg (-dump ... | -get ...) .
If you need an API instead of the command line, you may have to hack 
the segment/SegmentReader.java? I'm also wondering this.

BTW, make sure you set the 'http.content.limit' property to -1 to avoid 
content truncation.
 



songjue
2007-04-27



发件人: Charlie Williams
发送时间: 2007-04-25 22:43:12
收件人: nutch-dev@lucene.apache.org
抄送: 
主题: retrieving original html from database

I have an index of pages from the web, a bit over 1 million. The fetch took
several weeks to complete, since it was mainly over a small set of domains.
Once we had a completed fetch, and index we began trying to work with the
retrieved text, and found that the cached text is just that, flat text. Is
the original HTML cached anywhere that it can be accessed after the intial
fetch? It would be a shame to have to recrawl all those pages. We are using
Nutch  .8

Thanks for any help.

-Charlie


[jira] Updated: (NUTCH-471) Fix synchronization in NutchBean creation

2007-04-27 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-471:


Attachment: NutchBeanCreationSync_v2.patch

From http://www-128.ibm.com/developerworks/java/library/j-dcl.html

The bottom line is that double-checked locking, in whatever form, should not be 
used because you cannot guarantee that it will work on any JVM implementation. 
JSR-133 is addressing issues regarding the memory model, however, 
double-checked locking will not be supported by the new memory model. 
Therefore, you have two options:
* Accept the synchronization of a getInstance() method as shown in Listing 
2.
* Forgo synchronization and use a static field.

We don't want to remise performance in NutchBean.get(), synchronization is not 
a solution. Thus as Sami has suggested, i have written a ServetContextListener 
and added NutchBean construction code there. And modified web.xml to register 
the event listener class. Also In the servlet initialization, the Configuration 
object is initialized and cached by NutchConfiguration, so we avoid the same 
problem in NutchConfiguration.get(). 

 i have tested the implementation and it seems OK. 


 Fix synchronization in NutchBean creation
 -

 Key: NUTCH-471
 URL: https://issues.apache.org/jira/browse/NUTCH-471
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
 Fix For: 1.0.0

 Attachments: NutchBeanCreationSync_v1.patch, 
 NutchBeanCreationSync_v2.patch


 NutchBean is created and then cached in servlet context. But 
 NutchBean.get(ServletContext app, Configuration conf) is not syncronized, 
 which causes more than one instance of the bean (and 
 DistributedSearch$Client) if servlet container is accessed rapidly during 
 startup. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: retrieving original html from database

2007-04-27 Thread Briggs

If you need an api for getting the content, can't you just look into
the cachedContent.jsp of the demo search application?  That shows how
to retrieve the original text/html that is stored within the segments.

Perhaps I am missing something.





On 4/27/07, songjue [EMAIL PROTECTED] wrote:

You can try this command:  bin/nutch readseg (-dump ... | -get ...) .
If you need an API instead of the command line, you may have to hack
the segment/SegmentReader.java? I'm also wondering this.

BTW, make sure you set the 'http.content.limit' property to -1 to avoid
content truncation.




songjue
2007-04-27



发件人: Charlie Williams
发送时间: 2007-04-25 22:43:12
收件人: nutch-dev@lucene.apache.org
抄送:
主题: retrieving original html from database

I have an index of pages from the web, a bit over 1 million. The fetch took
several weeks to complete, since it was mainly over a small set of domains.
Once we had a completed fetch, and index we began trying to work with the
retrieved text, and found that the cached text is just that, flat text. Is
the original HTML cached anywhere that it can be accessed after the intial
fetch? It would be a shame to have to recrawl all those pages. We are using
Nutch  .8

Thanks for any help.

-Charlie




--
Conscious decisions by conscious minds are what make reality real


[jira] Commented: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492386
 ] 

Andrzej Bialecki  commented on NUTCH-468:
-

+1. I'm writing a scoring plugin now where it's impossible to correctly create 
the adjust value without this change.

 Scoring filter should distribute score to all outlinks at once
 --

 Key: NUTCH-468
 URL: https://issues.apache.org/jira/browse/NUTCH-468
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: scoring-v2.patch, scoring.patch


 Currently ScoringFilter.distributeScoreToOutlink, as its name implies, takes 
 only a single outlink and works on that. I would suggest that we change it to 
 distributeScoreToOutlink_s_ so that it would take all the outlinks of a page 
 at once. This has several advantages:
 1) A ScoringFilter plugin returns a single adjust datum to set its score 
 instead of returning several.
 2) A ScoringFilter plugin can change the score of the original page (via 
 adjust datum) even if there are no outlinks. This is useful if you have a 
 ScoringFilter plugin that, say, scores pages based on content instead of 
 outlinks.
 3) Since the ScoringFilter plugin recieves all outlinks at once, it can make 
 better decisions on how to distribute the score. For example, right now it is 
 not possible to create a plugin that always distributes exactly a page's 
 'cash' to outlinks(that is, if a page has score 5, it will always distribute 
 exactly 5 points to its outlinks no matter what the internal/external factors 
 are) if internal / external score factors are not 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-476) Would like to add a field to the document class for its MD5 signature

2007-04-27 Thread Linh Pham (JIRA)
Would like to add a field to the document class for its MD5 signature 
--

 Key: NUTCH-476
 URL: https://issues.apache.org/jira/browse/NUTCH-476
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
 Environment: all
Reporter: Linh Pham
Priority: Minor


During indexing a file, if an MD5 signature was calculated and stored along 
with the document  as a default,
it could then be used to remove duplicates from the results on retrieval.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.