RE: Document Classification - indexing question

2007-05-08 Thread Armel T. Nene
Bastian, When trying to classify document using the approach of dynamic classification, depending on the file type Nutch can take a awhile to parse the data. While working with Nutch I have encountered some null pointer exception due to parsing processes. This is due to a Hadoop configuration

Nutch ERROR parse.OutlinkExtractor - getOutlinks

2007-04-17 Thread Armel T. Nene
the application know what kind of url should be included in the url. Also, Nutch should not crash if the url in the outlink is not valid. Is there any other HTML parser in Nutch that I can try. Awaiting your kind reply. Regards, Armel === Armel T. Nene iDNA

Nutch java.io.exception

2007-04-10 Thread Armel T. Nene
) === Armel T. Nene iDNA Solutions LTD Tel: +44 (20) 7257 6124 Mobile: +44 (7886)950 483 Web: http://www.idna-solutions.com Blog: http://blog.idna-solutions.com

RE: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

2007-02-13 Thread Armel T. Nene
Dennis I was wondering if this patch could fix my problem which is, if not the same, very similar to this one. I am using Nutch 0.8.2-dev, I have made checkout awhile ago from SVN but never updated again. I was able to crawl 1 xml files before with no error whatsoever. This is the following

Nutch error messages

2007-02-06 Thread Armel T. Nene
Hi guys, I wrote a parser for parsing proprietary file formats. The plugin used to work until recently. Now when I try to parse simple CAD files I get the following error messages: INFO fetcher.Fetcher - fetching

RE: Fetcher2

2007-01-25 Thread Armel T. Nene
Kauu, The url for fetcher too is: https://issues.apache.org/jira/browse/NUTCH-339 Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: kauu [mailto

Modified date in crawldb

2007-01-25 Thread Armel T. Nene
- Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

RE: Modified date in crawldb

2007-01-25 Thread Armel T. Nene
this function from the core code. I am also new to Nutch, if anything wrong ,please feel free point out. - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 7:52 PM Subject: Modified date in crawldb Hi guys, I am

threads-safe methods in Nutch

2007-01-25 Thread Armel T. Nene
, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

RE: Fetcher2

2007-01-24 Thread Armel T. Nene
Chee, Can you make the code available through Jira. Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED

How to modify crawldb values

2007-01-23 Thread Armel T. Nene
to your kind support. Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

RE: How to modify crawldb values

2007-01-23 Thread Armel T. Nene
Thanks for the reply, I 'll try this and if I encounter any problem I'll send another email. This will be a good feature to have and probably will avoid the project into branching in different subprojects. Regards, Armel - Armel T. Nene iDNA

is crawldb format in Nutch 0.8 compatible with Nutch0.7

2007-01-23 Thread Armel T. Nene
iterate over the values contained in the crawldb using Nutch 0.7 API, I should think this will fix the issue. So the question is; is Nutch 0.8 backward compatible with Nutch 0.7.2 Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44

java.lang.IllegalStateException

2007-01-19 Thread Armel T. Nene
. Thanks. Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

RE: Next Nutch release

2007-01-17 Thread Armel T. Nene
- Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: 17 January 2007 15:39 To: nutch-dev@lucene.apache.org Subject: Re: Next Nutch

protocol-smb: a new protocol plugin for Windows Shares

2007-01-05 Thread Armel T. Nene
. Best regards, Armel T. Nene

Nutch site crawling

2006-12-07 Thread Armel T. Nene
Hi, Is it possible to let Nutch crawl a set of documents at a time? I have set-up Nutch with the following option: topN 20 depth 2 Therefore I wanted Nutch to crawl my directory and just as deep as 2 links from the root directory. Now the root directory itself contains more

Nutch Re-crawl same file over and over again

2006-12-06 Thread Armel T. Nene
Hi, I have setup Nutch to crawl my local filesystem. I set a topN 20 and Depth 2. But when Nutch re-crawls, it re-crawls the same files over and over again. The directory doesn't contain any other sub-directories, can someone let me what might be the cause. There are more than 20 files in the

RE: Indexing and Re-crawling site

2006-12-05 Thread Armel T. Nene
Lukas, I was wondering about running Nutch as Windows Services. I was able to implement it as follow: 1.Creating a java program that will act as a Nutch and Launcher and re-crawler. 2.Download JavaService from http://javaservice.objectweb.org/ 3.Follow the tutorial to turn

Indexing and Re-crawling site

2006-11-28 Thread Armel T. Nene
Hi guys, I have a few questions regarding the way nutch indexes and the best way a recrawl can be implemented. 1. Why does nutch has to create a new index every time when indexing, while it can just merge it with the old existing index? I try to change the value in the IndexMerger

RE: [jira] Created: (NUTCH-408) Plugin development documentation

2006-11-25 Thread Armel T. Nene
I agree with you that documentation is vital not the just extending the current version but also for any plugins and patches created. I have been spending almost two weeks trying to adapt nutch to my project but I spend more time in reading code and trying to understand what they do before I can

RE: Nutch folder configuration

2006-11-21 Thread Armel T. Nene
Also can Nutch be run as a Windows services. Let me know so that I don't waste my time trying to code something that won't work. -Original Message- From: Armel T. Nene [mailto:[EMAIL PROTECTED] Sent: 21 November 2006 21:56 To: nutch-dev@lucene.apache.org Subject: Nutch folder

RE: [jira] Commented: (NUTCH-185) XMLParser is configurable xml parser plugin.

2006-11-21 Thread Armel T. Nene
Rida, There is something I would like to clarify, when using a namespace and xpath to store content in the index, can this be seen as multi-fields. For example if we are storing customer name and customer address which are been declared in a xml configuration file, is that multi-field. Please

RE: What's the status of Nutch-GUI?

2006-11-21 Thread Armel T. Nene
: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: 20 November 2006 23:30 To: nutch-dev@lucene.apache.org Subject: Re: What's the status of Nutch-GUI? Hi Armel, On 11/20/06 1:44 PM, Armel T. Nene [EMAIL PROTECTED] wrote: Hi Chris, I am trying to extend parse-xml to enable the creation of lucene

File Protocol

2006-11-15 Thread Armel T. Nene
I want to implement Nutch crawl a filesystem and if the content of the filesystem has changed since last crawled then and the system should be fetched again. I studied the code for the Adaptive Re-Fetch cycle but the patch is out of date as Nutch has implemented other features. Also, I don't want

RE: [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2006-11-12 Thread Armel T. Nene
Andrzej, the feature that I am after can be implemented by this patch if I just adapt it right. I am not sure of this but the patch seems a little bit old to be implemented in the latest release of Nutch 0.8.1. I want to implement a feature where the fetcher will fetch files but only add them if