date:20110712

Updating Tika in Nutch

2011-07-12 Thread Fernando Arreola

Hello, I have made some additions (a new parser) to the Apache Tika application and I am trying to see if I can run my new changes using the crawl mechanism in Nutch, but I am having some trouble updating Nutch with my modified Tika application. The Tika updates I made run fine if I run Tika as

Re: Nutch Novice help

2011-07-12 Thread lewis john mcgibbney

Hi Please see this tutorial [1] for up to date 1.3 tutorial on wiki. Please try it out and take on Markus' points regarding Nutch trunk as the problems you are experiencing are usual with Trunk as it stands. [1] http://wiki.apache.org/nutch/RunningNutchAndSolr On Mon, Jul 11, 2011 at 10:50 PM,

Re: developing nutch, either in eclipse or netbeans

2011-07-12 Thread lewis john mcgibbney

I must admit Markus that I agree with you that for making ad-hoc changes to your configuration it is usually more time efficient to use a text editor. Hi C.B. Is there any reaon in particular you are interested in getting it up working with an IDE? I had contemplated getting a revised tutorial

Re: Updating Tika in Nutch

2011-07-12 Thread lewis john mcgibbney

Hi Fernando, One point for me to mention which I did not pick up from your post. Did you rebuild your Nutch dist after making the changes to include your new parser? I know that this is a pretty simple suggestion but hopefully it might be the right one. Also can you please provide more details

Re: Nutch Novice help

2011-07-12 Thread Julien Nioche

There seems to be no crawl-urlfilter file indeed. Don't know why it's gone since the crawl command is still there. You can find the file in the 1.2 release: http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/ Crawl-urlfilter has been removed purposefully as it did not add anything

Re: Problem with href=?param=value links

2011-07-12 Thread Matthias Naber

Hey, I'm still stuck on this. Can anyone give me a hint where to start searching for the build-in-html-link-extractor to probably patch this error? Or would it be easier to just swap the build-in for the tika-html-parser (do these two behave identically, or are there more changes to be made

Re: Nutch Novice help

2011-07-12 Thread Markus Jelsma

There seems to be no crawl-urlfilter file indeed. Don't know why it's gone since the crawl command is still there. You can find the file in the 1.2 release: http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/ Crawl-urlfilter has been removed purposefully as it did not add

Re: High CPU-time when finishing fetch job

2011-07-12 Thread Markus Jelsma

Ah, i seem to have found what i'm looking for: io.sort.factor=K is the lever and Finished Spill=N is the indicator. With large map output, such as produced by the fetcher, we need to tune the systems to get N down as much as possible by increasing K. There's probably a sweet spot for our

Re: Problems with tutorial

2011-07-12 Thread Paul van Hoven

Thanks for the answers. I'm not shure if the 'http.agent.name' is the problem since I set it: This is the configuration I'm using from nutch-1.3/conf/nutch-default.xml: !-- HTTP properties -- property namehttp.agent.name/name valueMyFirstNutchCrawler/value descriptionHTTP 'User-Agent'

Re: Problems with tutorial

2011-07-12 Thread Julien Nioche

Have just updated the tutorial, as of 1.3 the files shoudl be changed in $NUTCH_HOME/runtime/local/conf/ unless you rebuild with ANT On 12 July 2011 10:43, Paul van Hoven paul.van.ho...@googlemail.com wrote: Thanks for the answers. I'm not shure if the 'http.agent.name' is the problem since I

Re: Nutch Novice help

2011-07-12 Thread Julien Nioche

On 12 July 2011 10:30, Julien Nioche lists.digitalpeb...@gmail.com wrote: There seems to be no crawl-urlfilter file indeed. Don't know why it's gone since the crawl command is still there. You can find the file in the 1.2 release:

Re: Nutch Gotchas as of release 1.3

2011-07-12 Thread lewis john mcgibbney

Hi I have duly updated both the Nutch Gotchas [1] and the tutorial [2] to incorporate these gotchas which have been highlighted. Thanks for pointing these out. [1] http://wiki.apache.org/nutch/NutchGotchas [2] http://wiki.apache.org/nutch/RunningNutchAndSolr On Tue, Jul 12, 2011 at 12:03 AM,

nutch crashes for unknown reason

2011-07-12 Thread Paul van Hoven

I'm starting with nutch and I ran a simple job as described in the nutch tutorial. After a while I get the following error: CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03 LinkDb: starting at

Re: nutch crashes for unknown reason

2011-07-12 Thread Paul van Hoven

Actually I'm not shure if I look at the right log lines. Please explain in more detail for what exactly I should look for. Anyway I found the following line just before the error: Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: failed(2,0): Can't retrieve Tika parser for

Re: nutch crashes for unknown reason

2011-07-12 Thread Markus Jelsma

Actually I'm not shure if I look at the right log lines. Please explain in more detail for what exactly I should look for. Anyway I found the following line just before the error: Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: failed(2,0): Can't retrieve Tika parser for

Re: nutch crashes for unknown reason

2011-07-12 Thread Paul van Hoven

I'm not if I did understand you correct. Here is the complete output of my crawl: tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 solrUrl is not set, indexing will be skipped... crawl started in:

Can I create my own segment containing specific URLs and other information?

2011-07-12 Thread jeffersonzhou

Hi, I want to do my own parser and separate all the interesting URLs into a new segment other than Nutch’s default segments. Can I do so? How? Thanks.

Re: nutch crashes for unknown reason

2011-07-12 Thread Markus Jelsma

I don't see this segment 20110712114256 being parsed. On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote: I'm not if I did understand you correct. Here is the complete output of my crawl: tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir

Re: nutch crashes for unknown reason

2011-07-12 Thread Paul van Hoven

Okay, and what does that mean? How can I repair the error? 2011/7/12 Markus Jelsma markus.jel...@openindex.io: I don't see this segment 20110712114256 being parsed. On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote: I'm not if I did understand you correct. Here is the complete output of

Re: Crawl fails - Input path does not exist

2011-07-12 Thread robertito

Hi, I'm a beginner using Nutch 1.3 on Windows 7 with Cygwin and followed the tutorial: http://wiki.apache.org/nutch/NutchTutorial I'm trying to crawl wikipedia.org as a start, and having a similar problem with the segments/content path that does not exist. The path does indeed not exist

http://wiki.apache.org/nutch/WritingPluginExample-1.2

2011-07-12 Thread Cam Bazz

Hello, I got my development environment up, with eclipse and provided ant script. Is there a 1.3 version of this document? Or are there any bare changes that I should apply to. I will just test building the recommended plugin as written in the document, and post back the results. By the way,

A possible bug or misleading documentation

2011-07-12 Thread Nutch User - 1

This concerns 1.3 distribution and I don't know if this is fixed in some newer revision. From nutch-default.xml: property namefetcher.max.crawl.delay/name value30/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip

Re: A possible bug or misleading documentation

2011-07-12 Thread Julien Nioche

Please open an issue on JIRA (https://issues.apache.org/jira/browse/NUTCH) Thanks On 12 July 2011 14:25, Nutch User - 1 nutch.use...@gmail.com wrote: This concerns 1.3 distribution and I don't know if this is fixed in some newer revision. From nutch-default.xml: property

Re: A possible bug or misleading documentation

2011-07-12 Thread Nutch User - 1

On 07/12/2011 04:34 PM, Julien Nioche wrote: Please open an issue on JIRA (https://issues.apache.org/jira/browse/NUTCH) Thanks It's now here: (https://issues.apache.org/jira/browse/NUTCH-1042).

Re: How to build nutch 1.3 without an internet connection

2011-07-12 Thread Markus Jelsma

You have internet now. Build it here and deploy elsewhere. I need the ability to build nutch 1.3 with ANT without being connected to the internet (looks like ivy is used to download dependent libs). Is this possible? What do I have to modify to make this happen? Thanks!! -- View this

running tests from the command line

2011-07-12 Thread Tim Pease

At the root of the Nutch 1.3 project, what is the magic ant incantation to run only the tests for the plugin I'm currently hacking away on? I'm looking for the command line syntax. Blessings, TwP

Re: nutch crashes for unknown reason

2011-07-12 Thread lewis john mcgibbney

Fro mn the looks of it you need to parse all segments before indexing attempting to index them. As Markus has pointed out, the specific segment hasn't been parsed. Try parsing as per the following link http://wiki.apache.org/nutch/bin/nutch_parse On Tue, Jul 12, 2011 at 1:50 PM, Paul van Hoven

Re: A possible solution to my URL redirection and zero scores problem

2011-07-12 Thread lewis john mcgibbney

Well I think in order to address the problem directly it would be better to focus on getting something working with a distribution of Nutch you are most comfortable working with. For the time being I would avoid working with trunk 2.0 unless you can justify otherwise. I would also either make a

Re: Updating Tika in Nutch

2011-07-12 Thread Fernando Arreola

Hello, Thanks for the replies. I have started trying to use Nutch 1.3 after your suggestions, especially since I am using Tika 0.9, but I am not getting anywhere with it. I am able to build fine but whenever I try to run any command it gives the error stating that it cannot find C:\Program. For

Re: running tests from the command line

2011-07-12 Thread lewis john mcgibbney

What plugin are you hacking away on? You're own custom one or one already shipped with Nutch? Just so we are reading from the same page. This, along with some further documentation for running various classes from the command line is definately worth inclusion in the CommandLineOptions page of

Re: Nutch Novice help

2011-07-12 Thread lewis john mcgibbney

Have a good look under your hadoop.log which should be created when you initiate a crawl with Nutch, this will be extremely valuable. In addition there are various properties in nutch-site.xml which can be set to make logging more verbose at various levels e.g. fetching In order to root out

Re: Nutch Novice help

2011-07-12 Thread Markus Jelsma

No URLs to fetch - check your seed list and URL filters The error is quite clear. You injected URL's that did not pass your url filters. Check your url filters, likely crawl-urlfilter since you seem to use the crawl command. Thanks for updating the tutorial. I tried my setup, the crawl

Re: Updating Tika in Nutch

2011-07-12 Thread Fernando Arreola

Thanks, I really appreciate all the help. I used the ParserChecker and I could see the metadata my parser extracted! I have one more question though, I could only see the metadata my parser extracted if I used the -forceAs mimetype option. Otherwise it is detected as a text/plain file and my

Updating Tika in Nutch

Re: Nutch Novice help

Re: developing nutch, either in eclipse or netbeans

Re: Updating Tika in Nutch

Re: Nutch Novice help

Re: Problem with href=?param=value links

Re: Nutch Novice help

Re: High CPU-time when finishing fetch job

Re: Problems with tutorial

Re: Problems with tutorial

Re: Nutch Novice help

Re: Nutch Gotchas as of release 1.3

nutch crashes for unknown reason

Re: nutch crashes for unknown reason

Re: nutch crashes for unknown reason

Re: nutch crashes for unknown reason

Can I create my own segment containing specific URLs and other information?

Re: nutch crashes for unknown reason

Re: nutch crashes for unknown reason

Re: Crawl fails - Input path does not exist

http://wiki.apache.org/nutch/WritingPluginExample-1.2

A possible bug or misleading documentation

Re: A possible bug or misleading documentation

Re: A possible bug or misleading documentation

Re: How to build nutch 1.3 without an internet connection

running tests from the command line

Re: nutch crashes for unknown reason

Re: A possible solution to my URL redirection and zero scores problem

Re: Updating Tika in Nutch

Re: running tests from the command line

Re: Nutch Novice help

Re: Nutch Novice help

Re: Updating Tika in Nutch

33 matches

Site Navigation

Mail list logo

Footer information