I am using nutch 0.9 dev, latest from svn.
I have running a crawl successfully for about a week now. I have over 100K
documents in my index. I have 21 segments. I just finished a segment and
when going to updatedb I get an error like this:
CrawlDb update: starting
CrawlDb update: db:
.
-Original Message-
From: Richard Braman [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 05, 2006 8:38 PM
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: problem with hadoop
No matter what command I run, I get this error. index, updatedb, addurl,
every class
if I wanted to submit a url to nutch as see what keywords it scored on,
how would I do that?
Richard
[
http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372310 ]
Richard Braman commented on NUTCH-220:
--
I upgraded nutch .8 trunk to PDFBox HEAD.
The NullPointer exception Seems to be resolved by upgrading nutch to PDFBox
0.7.3
[
http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12371887 ]
Richard Braman commented on NUTCH-220:
--
Here is an example of the error from my log file. It seems it was fixed with
the latest PDFBox pre Ben Litchfiled, developer
Maybe the http post functionality should be moved to somewhere else, as
certainly the http post functionality might prove useful for other
things.
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 09, 2006 8:43 PM
To: nutch-dev@lucene.apache.org
I too have noticed menu text appearing in the search results.
-Original Message-
From: jamie [mailto:[EMAIL PROTECTED]
Sent: Friday, March 10, 2006 4:39 AM
To: nutch-dev@lucene.apache.org
Subject: quality of search text
hi everyone
i dont know if we're doing something wrong, but the
please
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, March 10, 2006 1:57 PM
To: nutch-dev@lucene.apache.org
Subject: Re: quality of search text
Richard Braman wrote:
I too have noticed menu text appearing in the search results.
The proper place
.
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, March 10, 2006 2:51 PM
To: nutch-dev@lucene.apache.org
Subject: Re: quality of search text
Richard Braman wrote:
Here is a potential algorithm:
Look first to Meta Description, if none exists
Look for continuous
+1. No need for 2 tutorials. The only descrepency I saw, was the
invertlinks command not in 0.7. I updated the wiki to note that that
command only applied to 0.8
-Original Message-
From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 08, 2006 9:30 AM
To:
No that sounds good to me. I also think that the whole web vs. crawl
needs to be better explained. I will write a bug/patch for it tomorrow.
-Original Message-
From: Piotr Kosiorowski [mailto:[EMAIL PROTECTED]
Sent: Tuesday, March 07, 2006 1:13 AM
To: nutch-dev@lucene.apache.org
: Richard Braman
When trying to invertlinks before indexing, following the tutorial, I
get the following error. [EMAIL PROTECTED] /cygdrive/t/nutch-0.7.1 $
bin/nutch invertlink taxcrawl/db/ -dir taxcrawl/segments/* run java in
C:\Program Files\Java\jdk1.5.0_04 Exception in thread main
[
http://issues.apache.org/jira/browse/NUTCH-222?page=comments#action_12368866 ]
Richard Braman commented on NUTCH-222:
--
When i look at the nutch script from my 0.7.1 distribution there is no
invertlinks class. Is this something thats only found
...)
This is a new command since nutch 0.8 please check that you have
latest nutch 0.8 (nightly) distribution and not use a nutch 0.7
script to running a nutch 0.8 command.
Stefan
Am 04.03.2006 um 17:24 schrieb Richard Braman:
That was a typo. Same thing happens with invertlinks.
$ bin/nutch invertlinks
Another compelling reason for better pdf parsing is it should enable the
ability to do in document highlighting sometime in the future.
Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice)
http://www.taxcodesoftware.org http://www.taxcodesoftware.org/
Free Open Source Tax Software
To change the skin, goto TOMCAT_HOME/webapps/ROOT(this is where nutch
web should be installed if you did it right)/en/
and edit search.html, help, faq, etc. Also edit the header in the
footer in the include directory. It's pretty confusing. I had to spend
hours looking though the mailing list
PM
To: Richard Braman
Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
[EMAIL PROTECTED]
Subject: Re: [PDFBox-user] PDF Parse Error
I believe these errors are due to a parsing bug in PDFBox that has been
fixed since the 0.7.2 release. Please give the nightly build(should be
a drop
[mailto:[EMAIL PROTECTED]
Sent: Thursday, March 02, 2006 4:46 PM
To: Richard Braman
Cc: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: Nutch Parsing PDFs, and general PDF extraction
To chime in and give my comments.
It is true that better search engine results could be obtained
https://issues.apache.org/jira/browse/NUTCH-219
-Original Message-
From: Jérôme Charron [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 02, 2006 5:41 AM
To: nutch-dev@lucene.apache.org
Subject: Re: PDF Parse Error
Yes, but please do not cross-post - many of us are subscribed to both
, the NPE should be fixed.
Ben
Richard Braman wrote:
Hi Bn,
We actually got to the bottom of all of them except for 1... The
content truncatetion was due to an inconsistancy bug in nutch config .
The no permission to extract text is actually true, the author, the NC
Department of revenue put
: SEVERE error
logged. Exiting fetcher.
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)
Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice)
http
Message-
From: John X [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 01, 2006 2:12 AM
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction
On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote:
thanks
use.
private String formatDate(Calendar date) {
String retval = null;
if(date != null) {
SimpleDateFormat formatter = new SimpleDateFormat();
retval = formatter.format(date.getTime());
}
return retval;
}
}
Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice)
http
@lucene.apache.org
Subject: Re: Nutch Parsing PDFs, and general PDF extraction
Richard Braman wrotte:
but my nutch doesn't seem to run the pdf parse class as my log file
shows it fecthing pdfs, but saying nutch is unable to parse content
type application/pdf
Can you send the complette error message?
I don't have the plugin configured, whats the code for doing that?
-Original Message-
From: Richard Braman [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 28, 2006 7:52 AM
To: nutch-dev@lucene.apache.org
Subject: RE: Nutch Parsing PDFs, and general PDF extraction
060228 045534
/pdf
pathSuffix=/
/extension
/plugin
-Original Message-
From: Richard Braman [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 28, 2006 7:58 AM
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: Nutch Parsing PDFs, and general PDF extraction
I don't
, and general PDF extraction
Richard Braman wrotte:
No, you should be add to plugin include (in nutch-site.xml) e.g.:
property
nameplugin.includes/name
valueprotocol-http|urlfilter-(regex|prefix)|parse-(text|html|pdf)desc
riptionRegular
expression naming plugin directory names to
include
I dont know it seems to be working now.
-Original Message-
From: Jérôme Charron [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 28, 2006 8:46 AM
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction
Putting the wellformed
-Original Message-
From: Richard Braman [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 28, 2006 4:14 AM
To: nutch-user@lucene.apache.org
Subject: Index aborted crawl.
I had to abort a crawl midcrawl (after 2 days of crawling becuse I
realized I had an error in my filter). I know
(2,0): Can't be handled as pdf document.
java.io.IOException: You do not have permission to extract text
I have a number of errors like this in my log, mostly the content
truncated one.
The thing is these files all open fine in acrobat.
Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002
message to 2 Open source pdf projects (PDFBox and iText). If
there is interested from nutch developers on what responses I have
received , and how a collaborative solution may be reached, let me know.
-Original Message-
From: Richard Braman [mailto:[EMAIL PROTECTED]
Sent: Tuesday
of the outcome. This
might, however take some time.
I will keep you updated.
Best regards,
Tamir
Richard Braman wrote:
I read your final report, as well as Christians report on converting
PDF to XML. I am actullay quite interested in these developments, and
would be to contribute time
32 matches
Mail list logo