Updating Tika in Nutch

2011-07-12 Thread Fernando Arreola
Hello, I have made some additions (a new parser) to the Apache Tika application and I am trying to see if I can run my new changes using the crawl mechanism in Nutch, but I am having some trouble updating Nutch with my modified Tika application. The Tika updates I made run fine if I run Tika as

Re: Nutch Novice help

2011-07-12 Thread lewis john mcgibbney
Hi Please see this tutorial [1] for up to date 1.3 tutorial on wiki. Please try it out and take on Markus' points regarding Nutch trunk as the problems you are experiencing are usual with Trunk as it stands. [1] http://wiki.apache.org/nutch/RunningNutchAndSolr On Mon, Jul 11, 2011 at 10:50 PM,

Re: developing nutch, either in eclipse or netbeans

2011-07-12 Thread lewis john mcgibbney
I must admit Markus that I agree with you that for making ad-hoc changes to your configuration it is usually more time efficient to use a text editor. Hi C.B. Is there any reaon in particular you are interested in getting it up working with an IDE? I had contemplated getting a revised tutorial

Re: Updating Tika in Nutch

2011-07-12 Thread lewis john mcgibbney
Hi Fernando, One point for me to mention which I did not pick up from your post. Did you rebuild your Nutch dist after making the changes to include your new parser? I know that this is a pretty simple suggestion but hopefully it might be the right one. Also can you please provide more details

Re: Nutch Novice help

2011-07-12 Thread Julien Nioche
There seems to be no crawl-urlfilter file indeed. Don't know why it's gone since the crawl command is still there. You can find the file in the 1.2 release: http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/ Crawl-urlfilter has been removed purposefully as it did not add anything

Re: Problem with href=?param=value links

2011-07-12 Thread Matthias Naber
Hey, I'm still stuck on this. Can anyone give me a hint where to start searching for the build-in-html-link-extractor to probably patch this error? Or would it be easier to just swap the build-in for the tika-html-parser (do these two behave identically, or are there more changes to be made

Re: Nutch Novice help

2011-07-12 Thread Markus Jelsma
There seems to be no crawl-urlfilter file indeed. Don't know why it's gone since the crawl command is still there. You can find the file in the 1.2 release: http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/ Crawl-urlfilter has been removed purposefully as it did not add

Re: High CPU-time when finishing fetch job

2011-07-12 Thread Markus Jelsma
Ah, i seem to have found what i'm looking for: io.sort.factor=K is the lever and Finished Spill=N is the indicator. With large map output, such as produced by the fetcher, we need to tune the systems to get N down as much as possible by increasing K. There's probably a sweet spot for our

Re: Problems with tutorial

2011-07-12 Thread Paul van Hoven
Thanks for the answers. I'm not shure if the 'http.agent.name' is the problem since I set it: This is the configuration I'm using from nutch-1.3/conf/nutch-default.xml: !-- HTTP properties -- property namehttp.agent.name/name valueMyFirstNutchCrawler/value descriptionHTTP 'User-Agent'

Re: Problems with tutorial

2011-07-12 Thread Julien Nioche
Have just updated the tutorial, as of 1.3 the files shoudl be changed in $NUTCH_HOME/runtime/local/conf/ unless you rebuild with ANT On 12 July 2011 10:43, Paul van Hoven paul.van.ho...@googlemail.com wrote: Thanks for the answers. I'm not shure if the 'http.agent.name' is the problem since I

Re: Nutch Novice help

2011-07-12 Thread Julien Nioche
On 12 July 2011 10:30, Julien Nioche lists.digitalpeb...@gmail.com wrote: There seems to be no crawl-urlfilter file indeed. Don't know why it's gone since the crawl command is still there. You can find the file in the 1.2 release:

Re: Nutch Gotchas as of release 1.3

2011-07-12 Thread lewis john mcgibbney
Hi I have duly updated both the Nutch Gotchas [1] and the tutorial [2] to incorporate these gotchas which have been highlighted. Thanks for pointing these out. [1] http://wiki.apache.org/nutch/NutchGotchas [2] http://wiki.apache.org/nutch/RunningNutchAndSolr On Tue, Jul 12, 2011 at 12:03 AM,

nutch crashes for unknown reason

2011-07-12 Thread Paul van Hoven
I'm starting with nutch and I ran a simple job as described in the nutch tutorial. After a while I get the following error: CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03 LinkDb: starting at

Re: nutch crashes for unknown reason

2011-07-12 Thread Paul van Hoven
Actually I'm not shure if I look at the right log lines. Please explain in more detail for what exactly I should look for. Anyway I found the following line just before the error: Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: failed(2,0): Can't retrieve Tika parser for

Re: nutch crashes for unknown reason

2011-07-12 Thread Markus Jelsma
Actually I'm not shure if I look at the right log lines. Please explain in more detail for what exactly I should look for. Anyway I found the following line just before the error: Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: failed(2,0): Can't retrieve Tika parser for

Re: nutch crashes for unknown reason

2011-07-12 Thread Paul van Hoven
I'm not if I did understand you correct. Here is the complete output of my crawl: tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 solrUrl is not set, indexing will be skipped... crawl started in:

Can I create my own segment containing specific URLs and other information?

2011-07-12 Thread jeffersonzhou
Hi, I want to do my own parser and separate all the interesting URLs into a new segment other than Nutch’s default segments. Can I do so? How? Thanks.

Re: nutch crashes for unknown reason

2011-07-12 Thread Markus Jelsma
I don't see this segment 20110712114256 being parsed. On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote: I'm not if I did understand you correct. Here is the complete output of my crawl: tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir

Re: nutch crashes for unknown reason

2011-07-12 Thread Paul van Hoven
Okay, and what does that mean? How can I repair the error? 2011/7/12 Markus Jelsma markus.jel...@openindex.io: I don't see this segment 20110712114256 being parsed. On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote: I'm not if I did understand you correct. Here is the complete output of

Re: Crawl fails - Input path does not exist

2011-07-12 Thread robertito
Hi, I'm a beginner using Nutch 1.3 on Windows 7 with Cygwin and followed the tutorial: http://wiki.apache.org/nutch/NutchTutorial I'm trying to crawl wikipedia.org as a start, and having a similar problem with the segments/content path that does not exist. The path does indeed not exist

http://wiki.apache.org/nutch/WritingPluginExample-1.2

2011-07-12 Thread Cam Bazz
Hello, I got my development environment up, with eclipse and provided ant script. Is there a 1.3 version of this document? Or are there any bare changes that I should apply to. I will just test building the recommended plugin as written in the document, and post back the results. By the way,

A possible bug or misleading documentation

2011-07-12 Thread Nutch User - 1
This concerns 1.3 distribution and I don't know if this is fixed in some newer revision. From nutch-default.xml: property namefetcher.max.crawl.delay/name value30/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip

Re: A possible bug or misleading documentation

2011-07-12 Thread Julien Nioche
Please open an issue on JIRA (https://issues.apache.org/jira/browse/NUTCH) Thanks On 12 July 2011 14:25, Nutch User - 1 nutch.use...@gmail.com wrote: This concerns 1.3 distribution and I don't know if this is fixed in some newer revision. From nutch-default.xml: property

Re: A possible bug or misleading documentation

2011-07-12 Thread Nutch User - 1
On 07/12/2011 04:34 PM, Julien Nioche wrote: Please open an issue on JIRA (https://issues.apache.org/jira/browse/NUTCH) Thanks It's now here: (https://issues.apache.org/jira/browse/NUTCH-1042).

Re: How to build nutch 1.3 without an internet connection

2011-07-12 Thread Markus Jelsma
You have internet now. Build it here and deploy elsewhere. I need the ability to build nutch 1.3 with ANT without being connected to the internet (looks like ivy is used to download dependent libs). Is this possible? What do I have to modify to make this happen? Thanks!! -- View this

running tests from the command line

2011-07-12 Thread Tim Pease
At the root of the Nutch 1.3 project, what is the magic ant incantation to run only the tests for the plugin I'm currently hacking away on? I'm looking for the command line syntax. Blessings, TwP

Re: nutch crashes for unknown reason

2011-07-12 Thread lewis john mcgibbney
Fro mn the looks of it you need to parse all segments before indexing attempting to index them. As Markus has pointed out, the specific segment hasn't been parsed. Try parsing as per the following link http://wiki.apache.org/nutch/bin/nutch_parse On Tue, Jul 12, 2011 at 1:50 PM, Paul van Hoven

Re: A possible solution to my URL redirection and zero scores problem

2011-07-12 Thread lewis john mcgibbney
Well I think in order to address the problem directly it would be better to focus on getting something working with a distribution of Nutch you are most comfortable working with. For the time being I would avoid working with trunk 2.0 unless you can justify otherwise. I would also either make a

Re: Updating Tika in Nutch

2011-07-12 Thread Fernando Arreola
Hello, Thanks for the replies. I have started trying to use Nutch 1.3 after your suggestions, especially since I am using Tika 0.9, but I am not getting anywhere with it. I am able to build fine but whenever I try to run any command it gives the error stating that it cannot find C:\Program. For

Re: running tests from the command line

2011-07-12 Thread lewis john mcgibbney
What plugin are you hacking away on? You're own custom one or one already shipped with Nutch? Just so we are reading from the same page. This, along with some further documentation for running various classes from the command line is definately worth inclusion in the CommandLineOptions page of

Re: Nutch Novice help

2011-07-12 Thread lewis john mcgibbney
Have a good look under your hadoop.log which should be created when you initiate a crawl with Nutch, this will be extremely valuable. In addition there are various properties in nutch-site.xml which can be set to make logging more verbose at various levels e.g. fetching In order to root out

Re: Nutch Novice help

2011-07-12 Thread Markus Jelsma
No URLs to fetch - check your seed list and URL filters The error is quite clear. You injected URL's that did not pass your url filters. Check your url filters, likely crawl-urlfilter since you seem to use the crawl command. Thanks for updating the tutorial. I tried my setup, the crawl

Re: Updating Tika in Nutch

2011-07-12 Thread Fernando Arreola
Thanks, I really appreciate all the help. I used the ParserChecker and I could see the metadata my parser extracted! I have one more question though, I could only see the metadata my parser extracted if I used the -forceAs mimetype option. Otherwise it is detected as a text/plain file and my