Re: Duplicate Content Issues

2006-02-28 Thread Jérôme Charron
> How to avoid duplicate content? You can use the org.apache.nutch.crawl.TextProfileSignature implementation instead of the default MD5Signature or provide your own Signature implementation. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread John X
On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote: > thanks for the help. I dont know what happenned , but it is working no. > Did any other contributros read what I sent about parsing PDFs? > I dont think nutch is capable with this based on the text stripper code > in parse pdf > >

truncation despite 0

2006-02-28 Thread Richard Braman
I am still getting content trcuncated even though i sent the size to 0 (no truncate) to http ftp and file. some of them are getting truncated 2602 bytes. Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org Free Open So

Duplicate Content Issues

2006-02-28 Thread Jack Tang
Hi How to avoid duplicate content? 1. Mirror sites: 1 website, 2 domains. 2. Confusing the bot: dynamic URL's. As robots find dynamic content, the site may be returning a different URL with the same content… 3. Print friendly pages? Will nutch enhanced the dedup code? /Jack -- Keep Discovering ..

RE: Index aborted crawl.

2006-02-28 Thread Richard Braman
Jerome and Jeff Thanks for the help:) I found the answers in the wiki faq, to recover an aborted fetch, which has insightful It also mentions you can "indexed what was already crawled" "You should be able to index the part of the segment for crawling which is allready fetched. " I tried the com

RE: PDF Parse Error

2006-02-28 Thread Richard Braman
I set it to 0, there are some big pdfs on the sites I am crawlign. Thanks Jeff. -Original Message- From: Jeff Ritchie [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 4:37 PM To: nutch-dev@lucene.apache.org Subject: Re: PDF Parse Error In nutch-site.xml Set it to something lik

FW: Index aborted crawl.

2006-02-28 Thread Richard Braman
I had to abort a crawl mid-crawl (after 2 days of crawling becuse I realized I had an error in my filter). I know at least 6 segments were fetched, I tried the command bin/nutch index indexes crawled/linkdb crawled/segments/* but it failed. I would like to review the results of the crawl, but if

FW: pdf to xml

2006-02-28 Thread Richard Braman
-Original Message- From: Mark D. Anderson [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 4:40 PM To: [EMAIL PROTECTED] Subject: Re: pdf to xml well, i began to dislike that particular xml format, so started drafting up a real spec: http://discerning.com/hacks/docutils/pdf

Re: PDF Parse Error

2006-02-28 Thread Jeff Ritchie
In nutch-site.xml Set it to something like http.content.limit 655360 Jeff. Richard Braman wrote: I get the following errors regarding pdf: 060228 160518 fetch okay, but can't parse http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_hi .pdf, reason: failed(2,202): Conten

PDF Parse Error

2006-02-28 Thread Richard Braman
I get the following errors regarding pdf: 060228 160518 fetch okay, but can't parse http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_hi .pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser can't handle incomplete pdf file. 060228 160354 fetch okay, but can'

Re: Release Planning

2006-02-28 Thread Doug Cutting
Nutch developer wrote: What is the estimated date for a stable version of 0.8? I'm hoping to have a stable release of Hadoop by April 15th. This should substantially stablilize Nutch. So a 0.8 release of Nutch should probably follow shortly thereafter. By the way: What are the criteria f

Re: OPIC score calculation issues

2006-02-28 Thread Doug Cutting
Andrzej Bialecki wrote: * CrawlDBReducer (used by CrawlDB.update()) collects all CrawlDatum-s from crawl_parse with the same URL, which means that we get: * the original CrawlDatum * (optionally a CrawlDatum that contains just a Signature) * all CrawlDatum.LINKED entries pointing to ou

[jira] Updated: (NUTCH-218) need DOAP file for Nutch

2006-02-28 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-218?page=all ] Chris A. Mattmann updated NUTCH-218: Attachment: doap_Nutch.rdf I generated this off the DOAP generator page. Feel free to use it, or not. > need DOAP file for Nutch > -

[jira] Assigned: (NUTCH-218) need DOAP file for Nutch

2006-02-28 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-218?page=all ] Jerome Charron reassigned NUTCH-218: Assign To: Jerome Charron > need DOAP file for Nutch > > > Key: NUTCH-218 > URL: http://issues.apache.org/jira

[jira] Created: (NUTCH-218) need DOAP file for Nutch

2006-02-28 Thread Doug Cutting (JIRA)
need DOAP file for Nutch Key: NUTCH-218 URL: http://issues.apache.org/jira/browse/NUTCH-218 Project: Nutch Type: Task Reporter: Doug Cutting Can someone please draft a DOAP file for Nutch, so that we're listed at http://projects.apache

FW: Index aborted crawl.

2006-02-28 Thread Richard Braman
-Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 4:14 AM To: nutch-user@lucene.apache.org Subject: Index aborted crawl. I had to abort a crawl midcrawl (after 2 days of crawling becuse I realized I had an error in my filter). I know at

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
thanks for the help. I dont know what happenned , but it is working no. Did any other contributros read what I sent about parsing PDFs? I dont think nutch is capable with this based on the text stripper code in parse pdf http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-pd f/

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
I don’t know it seems to be working now. -Original Message- From: Jérôme Charron [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 8:46 AM To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Nutch Parsing PDFs, and general PDF extraction > Putting the wellformed ver

Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Jérôme Charron
> Putting the wellformed version of the plugin code you provided generated > the follwong exception: Does the nutch-extensionpoints plugin is activated?

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
Putting the wellformed version of the plugin code you provided generated the follwong exception: 060228 083159 SEVERE org.apache.nutch.plugin.PluginRuntimeException: extension point: org.apache.nutch.parse.Parser does not exist. Exception in thread "main" java.lang.ExceptionInInitializerError

Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread YourSoft
Richard Braman wrotte: No, you should be add to "plugin include" (in nutch-site.xml) e.g.: plugin.includes protocol-http|urlfilter-(regex|prefix)|parse-(text|html|pdf)Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. By def

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
Should I add this to nutch site: -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 7:58 AM To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: RE: Nutch Parsing P

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
I don't have the plugin configured, whats the code for doing that? -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 7:52 AM To: nutch-dev@lucene.apache.org Subject: RE: Nutch Parsing PDFs, and general PDF extraction 060228 045534 fetch

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
060228 045534 fetch okay, but can't parse http://www.irs.gov/pub/irs-pdf/f1040sab.pdf?portlet=3, reason: failed(2,203): Content-Type not text/html: application/pdf -Original Message- From: YourSoft [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 8:00 AM To: nutch-dev@lucene.apa

Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Jérôme Charron
> ge-summary.html> org.apache.nutch.parse.pdf (Nutch 0.7.1 API) > but I dont see it in the source of 0.7.1 downloaded > > I see it on cvs here: > http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-pdf/s > rc/

Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread YourSoft
Richard Braman wrotte: but my nutch doesn't seem to run the pdf parse class as my log file shows it fecthing pdfs, but saying nutch is unable to parse content type application/pdf Can you send the complette error message?

Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
I see that there is a class for parsing pdfs in nutch using pdfbox org.apache.nutch.parse.pdf (Nutch 0.7.1 API) but I dont see it in the source of 0.7.1 downloaded I see it on cvs here: http://cvs.sourcefo