Bastian,
When trying to classify document using the approach of dynamic
classification, depending on the file type Nutch can take a awhile to parse
the data. While working with Nutch I have encountered some null pointer
exception due to parsing processes. This is due to a Hadoop configuration
the application know what kind of url should be included
in the url. Also, Nutch should not crash if the url in the outlink is not
valid. Is there any other HTML parser in Nutch that I can try.
Awaiting your kind reply.
Regards,
Armel
===
Armel T. Nene
iDNA
)
===
Armel T. Nene
iDNA Solutions LTD
Tel: +44 (20) 7257 6124
Mobile: +44 (7886)950 483
Web: http://www.idna-solutions.com
Blog: http://blog.idna-solutions.com
Dennis
I was wondering if this patch could fix my problem which is, if not the
same, very similar to this one. I am using Nutch 0.8.2-dev, I have made
checkout awhile ago from SVN but never updated again. I was able to crawl
1 xml files before with no error whatsoever. This is the following
Hi guys,
I wrote a parser for parsing proprietary file formats. The plugin used to work
until recently. Now when I try to parse simple CAD files I get the following
error messages:
INFO fetcher.Fetcher - fetching
Kauu,
The url for fetcher too is: https://issues.apache.org/jira/browse/NUTCH-339
Armel
-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483
http://blog.idna-solutions.com
-Original Message-
From: kauu [mailto
-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483
http://blog.idna-solutions.com/ http://blog.idna-solutions.com
this function from the core
code.
I am also new to Nutch, if anything wrong ,please feel free point out.
- Original Message -
From: Armel T. Nene [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, January 25, 2007 7:52 PM
Subject: Modified date in crawldb
Hi guys,
I am
,
Armel
-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483
http://blog.idna-solutions.com/ http://blog.idna-solutions.com
Chee,
Can you make the code available through Jira.
Thanks,
Armel
-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483
http://blog.idna-solutions.com
-Original Message-
From: chee wu [mailto:[EMAIL PROTECTED
to your kind support.
Armel
-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483
http://blog.idna-solutions.com/ http://blog.idna-solutions.com
Thanks for the reply, I 'll try this and if I encounter any problem I'll
send another email. This will be a good feature to have and probably will
avoid the project into branching in different subprojects.
Regards,
Armel
-
Armel T. Nene
iDNA
iterate over the values contained in the crawldb using
Nutch 0.7 API, I should think this will fix the issue. So the question is;
is Nutch 0.8 backward compatible with Nutch 0.7.2
Thanks,
Armel
-
Armel T. Nene
iDNA Solutions
Tel: +44
.
Thanks.
Armel
-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483
http://blog.idna-solutions.com/ http://blog.idna-solutions.com
-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483
http://blog.idna-solutions.com
-Original Message-
From: Enis Soztutar [mailto:[EMAIL PROTECTED]
Sent: 17 January 2007 15:39
To: nutch-dev@lucene.apache.org
Subject: Re: Next Nutch
.
Best regards,
Armel T. Nene
Hi,
Is it possible to let Nutch crawl a set of documents at a time?
I have set-up Nutch with the following option:
topN 20
depth 2
Therefore I wanted Nutch to crawl my directory and just as deep as 2 links
from the root directory. Now the root directory itself contains more
Hi,
I have setup Nutch to crawl my local filesystem. I set a topN 20 and Depth
2. But when Nutch re-crawls, it re-crawls the same files over and over
again. The directory doesn't contain any other sub-directories, can someone
let me what might be the cause. There are more than 20 files in the
Lukas,
I was wondering about running Nutch as Windows Services. I was able to
implement it as follow:
1.Creating a java program that will act as a Nutch and Launcher and
re-crawler.
2.Download JavaService from http://javaservice.objectweb.org/
3.Follow the tutorial to turn
Hi guys,
I have a few questions regarding the way nutch indexes and the best way a
recrawl can be implemented.
1. Why does nutch has to create a new index every time when indexing,
while it can just merge it with the old existing index? I try to change the
value in the IndexMerger
I agree with you that documentation is vital not the just extending the
current version but also for any plugins and patches created. I have been
spending almost two weeks trying to adapt nutch to my project but I spend
more time in reading code and trying to understand what they do before I can
Also can Nutch be run as a Windows services. Let me know so that I don't
waste my time trying to code something that won't work.
-Original Message-
From: Armel T. Nene [mailto:[EMAIL PROTECTED]
Sent: 21 November 2006 21:56
To: nutch-dev@lucene.apache.org
Subject: Nutch folder
Rida,
There is something I would like to clarify, when using a namespace and xpath
to store content in the index, can this be seen as multi-fields. For example
if we are storing customer name and customer address which are been declared
in a xml configuration file, is that multi-field. Please
: Chris Mattmann [mailto:[EMAIL PROTECTED]
Sent: 20 November 2006 23:30
To: nutch-dev@lucene.apache.org
Subject: Re: What's the status of Nutch-GUI?
Hi Armel,
On 11/20/06 1:44 PM, Armel T. Nene [EMAIL PROTECTED] wrote:
Hi Chris,
I am trying to extend parse-xml to enable the creation of lucene
I want to implement Nutch crawl a filesystem and if the content of the
filesystem has changed since last crawled then and the system should be
fetched again. I studied the code for the Adaptive Re-Fetch cycle but the
patch is out of date as Nutch has implemented other features. Also, I don't
want
Andrzej, the feature that I am after can be implemented by this patch if I
just adapt it right. I am not sure of this but the patch seems a little bit
old to be implemented in the latest release of Nutch 0.8.1.
I want to implement a feature where the fetcher will fetch files but only
add them if
26 matches
Mail list logo