problem with hadoop

2006-09-05 Thread Richard Braman
I am using nutch 0.9 dev, latest from svn. I have running a crawl successfully for about a week now. I have over 100K documents in my index. I have 21 segments. I just finished a segment and when going to updatedb I get an error like this: CrawlDb update: starting CrawlDb update: db:

RE: problem with hadoop

2006-09-05 Thread Richard Braman
. -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 05, 2006 8:38 PM To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: RE: problem with hadoop No matter what command I run, I get this error. index, updatedb, addurl, every class

Seacrh for keywords by url

2006-04-15 Thread Richard Braman
if I wanted to submit a url to nutch as see what keywords it scored on, how would I do that? Richard

[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

2006-03-29 Thread Richard Braman (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372310 ] Richard Braman commented on NUTCH-220: -- I upgraded nutch .8 trunk to PDFBox HEAD. The NullPointer exception Seems to be resolved by upgrading nutch to PDFBox 0.7.3

[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

2006-03-25 Thread Richard Braman (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12371887 ] Richard Braman commented on NUTCH-220: -- Here is an example of the error from my log file. It seems it was fixed with the latest PDFBox pre Ben Litchfiled, developer

RE: Nutch 0.7.2

2006-03-10 Thread Richard Braman
Maybe the http post functionality should be moved to somewhere else, as certainly the http post functionality might prove useful for other things. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, March 09, 2006 8:43 PM To: nutch-dev@lucene.apache.org

RE: quality of search text

2006-03-10 Thread Richard Braman
I too have noticed menu text appearing in the search results. -Original Message- From: jamie [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 4:39 AM To: nutch-dev@lucene.apache.org Subject: quality of search text hi everyone i dont know if we're doing something wrong, but the

RE: quality of search text

2006-03-10 Thread Richard Braman
please -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 1:57 PM To: nutch-dev@lucene.apache.org Subject: Re: quality of search text Richard Braman wrote: I too have noticed menu text appearing in the search results. The proper place

RE: quality of search text

2006-03-10 Thread Richard Braman
. -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 2:51 PM To: nutch-dev@lucene.apache.org Subject: Re: quality of search text Richard Braman wrote: Here is a potential algorithm: Look first to Meta Description, if none exists Look for continuous

RE: Tutorial

2006-03-08 Thread Richard Braman
+1. No need for 2 tutorials. The only descrepency I saw, was the invertlinks command not in 0.7. I updated the wiki to note that that command only applied to 0.8 -Original Message- From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 08, 2006 9:30 AM To:

RE: Nutch web site

2006-03-06 Thread Richard Braman
No that sounds good to me. I also think that the whole web vs. crawl needs to be better explained. I will write a bug/patch for it tomorrow. -Original Message- From: Piotr Kosiorowski [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 1:13 AM To: nutch-dev@lucene.apache.org

RE: [jira] Closed: (NUTCH-222) Exception in thread main java.lang.NoClassDefFoundError: invertlink

2006-03-04 Thread Richard Braman
: Richard Braman When trying to invertlinks before indexing, following the tutorial, I get the following error. [EMAIL PROTECTED] /cygdrive/t/nutch-0.7.1 $ bin/nutch invertlink taxcrawl/db/ -dir taxcrawl/segments/* run java in C:\Program Files\Java\jdk1.5.0_04 Exception in thread main

[jira] Commented: (NUTCH-222) Exception in thread main java.lang.NoClassDefFoundError: invertlink

2006-03-04 Thread Richard Braman (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-222?page=comments#action_12368866 ] Richard Braman commented on NUTCH-222: -- When i look at the nutch script from my 0.7.1 distribution there is no invertlinks class. Is this something thats only found

RE: [jira] Closed: (NUTCH-222) Exception in thread main java.lang.NoClassDefFoundError: invertlink

2006-03-04 Thread Richard Braman
...) This is a new command since nutch 0.8 please check that you have latest nutch 0.8 (nightly) distribution and not use a nutch 0.7 script to running a nutch 0.8 command. Stefan Am 04.03.2006 um 17:24 schrieb Richard Braman: That was a typo. Same thing happens with invertlinks. $ bin/nutch invertlinks

in document highlighting

2006-03-04 Thread Richard Braman
Another compelling reason for better pdf parsing is it should enable the ability to do in document highlighting sometime in the future. Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org http://www.taxcodesoftware.org/ Free Open Source Tax Software

RE: compile search.jsp

2006-03-04 Thread Richard Braman
To change the skin, goto TOMCAT_HOME/webapps/ROOT(this is where nutch web should be installed if you did it right)/en/ and edit search.html, help, faq, etc. Also edit the header in the footer in the include directory. It's pretty confusing. I had to spend hours looking though the mailing list

RE: [PDFBox-user] PDF Parse Error

2006-03-02 Thread Richard Braman
PM To: Richard Braman Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: [PDFBox-user] PDF Parse Error I believe these errors are due to a parsing bug in PDFBox that has been fixed since the 0.7.2 release. Please give the nightly build(should be a drop

RE: Nutch Parsing PDFs, and general PDF extraction

2006-03-02 Thread Richard Braman
[mailto:[EMAIL PROTECTED] Sent: Thursday, March 02, 2006 4:46 PM To: Richard Braman Cc: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: RE: Nutch Parsing PDFs, and general PDF extraction To chime in and give my comments. It is true that better search engine results could be obtained

RE: PDF Parse Error

2006-03-02 Thread Richard Braman
https://issues.apache.org/jira/browse/NUTCH-219 -Original Message- From: Jérôme Charron [mailto:[EMAIL PROTECTED] Sent: Thursday, March 02, 2006 5:41 AM To: nutch-dev@lucene.apache.org Subject: Re: PDF Parse Error Yes, but please do not cross-post - many of us are subscribed to both

[jira] Created: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

2006-03-02 Thread Richard Braman (JIRA)
, the NPE should be fixed. Ben Richard Braman wrote: Hi Bn, We actually got to the bottom of all of them except for 1... The content truncatetion was due to an inconsistancy bug in nutch config . The no permission to extract text is actually true, the author, the NC Department of revenue put

OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled

2006-03-02 Thread Richard Braman
: SEVERE error logged. Exiting fetcher. at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140) Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http

RE: Nutch Parsing PDFs, and general PDF extraction

2006-03-01 Thread Richard Braman
Message- From: John X [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 01, 2006 2:12 AM To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: Nutch Parsing PDFs, and general PDF extraction On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote: thanks

Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
use. private String formatDate(Calendar date) { String retval = null; if(date != null) { SimpleDateFormat formatter = new SimpleDateFormat(); retval = formatter.format(date.getTime()); } return retval; } } Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
@lucene.apache.org Subject: Re: Nutch Parsing PDFs, and general PDF extraction Richard Braman wrotte: but my nutch doesn't seem to run the pdf parse class as my log file shows it fecthing pdfs, but saying nutch is unable to parse content type application/pdf Can you send the complette error message?

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
I don't have the plugin configured, whats the code for doing that? -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 7:52 AM To: nutch-dev@lucene.apache.org Subject: RE: Nutch Parsing PDFs, and general PDF extraction 060228 045534

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
/pdf pathSuffix=/ /extension /plugin -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 7:58 AM To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: RE: Nutch Parsing PDFs, and general PDF extraction I don't

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
, and general PDF extraction Richard Braman wrotte: No, you should be add to plugin include (in nutch-site.xml) e.g.: property nameplugin.includes/name valueprotocol-http|urlfilter-(regex|prefix)|parse-(text|html|pdf)desc riptionRegular expression naming plugin directory names to include

RE: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
I don’t know it seems to be working now. -Original Message- From: Jérôme Charron [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 8:46 AM To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Nutch Parsing PDFs, and general PDF extraction Putting the wellformed

FW: Index aborted crawl.

2006-02-28 Thread Richard Braman
-Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 4:14 AM To: nutch-user@lucene.apache.org Subject: Index aborted crawl. I had to abort a crawl midcrawl (after 2 days of crawling becuse I realized I had an error in my filter). I know

PDF Parse Error

2006-02-28 Thread Richard Braman
(2,0): Can't be handled as pdf document. java.io.IOException: You do not have permission to extract text I have a number of errors like this in my log, mostly the content truncated one. The thing is these files all open fine in acrobat. Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002

FW: Good reading/research on PDF text extraction

2006-02-26 Thread Richard Braman
message to 2 Open source pdf projects (PDFBox and iText). If there is interested from nutch developers on what responses I have received , and how a collaborative solution may be reached, let me know. -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Tuesday

RE: FW: Good reading/research on PDF text extraction

2006-02-26 Thread Richard Braman
of the outcome. This might, however take some time. I will keep you updated. Best regards, Tamir Richard Braman wrote: I read your final report, as well as Christians report on converting PDF to XML. I am actullay quite interested in these developments, and would be to contribute time