Hello,
I have made some additions (a new parser) to the Apache Tika application and
I am trying to see if I can run my new changes using the crawl mechanism in
Nutch, but I am having some trouble updating Nutch with my modified Tika
application.
The Tika updates I made run fine if I run Tika as
Hi Please see this tutorial [1] for up to date 1.3 tutorial on wiki.
Please try it out and take on Markus' points regarding Nutch trunk as the
problems you are experiencing are usual with Trunk as it stands.
[1] http://wiki.apache.org/nutch/RunningNutchAndSolr
On Mon, Jul 11, 2011 at 10:50 PM,
I must admit Markus that I agree with you that for making ad-hoc changes to
your configuration it is usually more time efficient to use a text editor.
Hi C.B.
Is there any reaon in particular you are interested in getting it up working
with an IDE? I had contemplated getting a revised tutorial
Hi Fernando,
One point for me to mention which I did not pick up from your post. Did you
rebuild your Nutch dist after making the changes to include your new parser?
I know that this is a pretty simple suggestion but hopefully it might be the
right one.
Also can you please provide more details
There seems to be no crawl-urlfilter file indeed. Don't know why it's gone
since
the crawl command is still there. You can find the file in the 1.2 release:
http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/
Crawl-urlfilter has been removed purposefully as it did not add anything
Hey,
I'm still stuck on this. Can anyone give me a hint where to start searching for
the build-in-html-link-extractor to probably patch this error?
Or would it be easier to just swap the build-in for the tika-html-parser (do
these two behave identically, or are there more changes to be made
There seems to be no crawl-urlfilter file indeed. Don't know why it's
gone since
the crawl command is still there. You can find the file in the 1.2
release: http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/
Crawl-urlfilter has been removed purposefully as it did not add
Ah, i seem to have found what i'm looking for: io.sort.factor=K is the lever
and Finished Spill=N is the indicator.
With large map output, such as produced by the fetcher, we need to tune the
systems to get N down as much as possible by increasing K. There's probably a
sweet spot for our
Thanks for the answers. I'm not shure if the 'http.agent.name' is the
problem since I set it:
This is the configuration I'm using from nutch-1.3/conf/nutch-default.xml:
!-- HTTP properties --
property
namehttp.agent.name/name
valueMyFirstNutchCrawler/value
descriptionHTTP 'User-Agent'
Have just updated the tutorial, as of 1.3 the files shoudl be changed in
$NUTCH_HOME/runtime/local/conf/ unless you rebuild with ANT
On 12 July 2011 10:43, Paul van Hoven paul.van.ho...@googlemail.com wrote:
Thanks for the answers. I'm not shure if the 'http.agent.name' is the
problem since I
On 12 July 2011 10:30, Julien Nioche lists.digitalpeb...@gmail.com wrote:
There seems to be no crawl-urlfilter file indeed. Don't know why it's
gone since
the crawl command is still there. You can find the file in the 1.2
release:
Hi
I have duly updated both the Nutch Gotchas [1] and the tutorial [2] to
incorporate these gotchas which have been highlighted. Thanks for pointing
these out.
[1] http://wiki.apache.org/nutch/NutchGotchas
[2] http://wiki.apache.org/nutch/RunningNutchAndSolr
On Tue, Jul 12, 2011 at 12:03 AM,
I'm starting with nutch and I ran a simple job as described in the
nutch tutorial. After a while I get the following error:
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03
LinkDb: starting at
Actually I'm not shure if I look at the right log lines. Please
explain in more detail for what exactly I should look for. Anyway I
found the following line just before the error:
Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js:
failed(2,0): Can't retrieve Tika parser for
Actually I'm not shure if I look at the right log lines. Please
explain in more detail for what exactly I should look for. Anyway I
found the following line just before the error:
Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js:
failed(2,0): Can't retrieve Tika parser for
I'm not if I did understand you correct. Here is the complete output
of my crawl:
tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled
-dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50
solrUrl is not set, indexing will be skipped...
crawl started in:
Hi,
I want to do my own parser and separate all the interesting URLs into a new
segment other than Nutch’s default segments. Can I do so? How?
Thanks.
I don't see this segment 20110712114256 being parsed.
On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote:
I'm not if I did understand you correct. Here is the complete output
of my crawl:
tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled
-dir
Okay, and what does that mean? How can I repair the error?
2011/7/12 Markus Jelsma markus.jel...@openindex.io:
I don't see this segment 20110712114256 being parsed.
On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote:
I'm not if I did understand you correct. Here is the complete output
of
Hi,
I'm a beginner using Nutch 1.3 on Windows 7 with Cygwin and followed the
tutorial:
http://wiki.apache.org/nutch/NutchTutorial
I'm trying to crawl wikipedia.org as a start, and having a similar problem
with the segments/content path that does not exist. The path does indeed not
exist
Hello,
I got my development environment up, with eclipse and provided ant script.
Is there a 1.3 version of this document? Or are there any bare changes
that I should apply to.
I will just test building the recommended plugin as written in the
document, and post back the results.
By the way,
This concerns 1.3 distribution and I don't know if this is fixed in some
newer revision.
From nutch-default.xml:
property
namefetcher.max.crawl.delay/name
value30/value
description
If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip
Please open an issue on JIRA (https://issues.apache.org/jira/browse/NUTCH)
Thanks
On 12 July 2011 14:25, Nutch User - 1 nutch.use...@gmail.com wrote:
This concerns 1.3 distribution and I don't know if this is fixed in some
newer revision.
From nutch-default.xml:
property
On 07/12/2011 04:34 PM, Julien Nioche wrote:
Please open an issue on JIRA (https://issues.apache.org/jira/browse/NUTCH)
Thanks
It's now here: (https://issues.apache.org/jira/browse/NUTCH-1042).
You have internet now. Build it here and deploy elsewhere.
I need the ability to build nutch 1.3 with ANT without being connected to
the internet (looks like ivy is used to download dependent libs). Is this
possible? What do I have to modify to make this happen?
Thanks!!
--
View this
At the root of the Nutch 1.3 project, what is the magic ant incantation to run
only the tests for the plugin I'm currently hacking away on? I'm looking for
the command line syntax.
Blessings,
TwP
Fro mn the looks of it you need to parse all segments before indexing
attempting to index them.
As Markus has pointed out, the specific segment hasn't been parsed. Try
parsing as per the following link
http://wiki.apache.org/nutch/bin/nutch_parse
On Tue, Jul 12, 2011 at 1:50 PM, Paul van Hoven
Well I think in order to address the problem directly it would be better to
focus on getting something working with a distribution of Nutch you are most
comfortable working with. For the time being I would avoid working with
trunk 2.0 unless you can justify otherwise. I would also either make a
Hello,
Thanks for the replies.
I have started trying to use Nutch 1.3 after your suggestions, especially
since I am using Tika 0.9, but I am not getting anywhere with it. I am able
to build fine but whenever I try to run any command it gives the error
stating that it cannot find C:\Program. For
What plugin are you hacking away on? You're own custom one or one already
shipped with Nutch? Just so we are reading from the same page.
This, along with some further documentation for running various classes from
the command line is definately worth inclusion in the CommandLineOptions
page of
Have a good look under your hadoop.log which should be created when you
initiate a crawl with Nutch, this will be extremely valuable. In addition
there are various properties in nutch-site.xml which can be set to make
logging more verbose at various levels e.g. fetching
In order to root out
No URLs to fetch - check your seed list and URL filters
The error is quite clear. You injected URL's that did not pass your url
filters. Check your url filters, likely crawl-urlfilter since you seem to use
the
crawl command.
Thanks for updating the tutorial. I tried my setup, the crawl
Thanks, I really appreciate all the help. I used the ParserChecker and I
could see the metadata my parser extracted!
I have one more question though, I could only see the metadata my parser
extracted if I used the -forceAs mimetype option. Otherwise it is detected
as a text/plain file and my
33 matches
Mail list logo