Re: skipping invalid segments nutch 1.3

2011-07-21 Thread lewis john mcgibbney
Hi Leo, From the times both the fetching and parsing took, I suspecting that maybe Nutch didn't actually fetch the URL, however this may not be the case as I have nothing to benchmark it on. Unfortuantely on the occasion the URL http://wiki.apache.org actually redirects to

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Sebastian Nagel
Hi Leo, hi Lewis, From the times both the fetching and parsing took, I suspecting that maybe Nutch didn't actually fetch the URL, This may be the reason. Empty segments may break some of the crawler steps. But if I'm not wrong it looks like the updatedb-command is not quite correct:

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Leo Subscriptions
Hi Lewis, Will try your suggestion shortly, but am still puzzled why the crawl command works. Isn't it using the same filter, etc? Cheers, Leo On Thu, 2011-07-21 at 20:55 +0100, lewis john mcgibbney wrote: Hi Leo, From the times both the fetching and parsing took, I suspecting that maybe

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Leo Subscriptions
Hi Lewis, Following are the things I tried ans the relevant source/logs 1. ran 'crawl' without ending / in the url http://www.seek.com.au ; Result OK 2. ran 'crawl' with ending / in the url http://www.seek.com.au/ ; Result OK 3. Had a look at the regex-urlfilter.txt and the relevant entries

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Leo Subscriptions
Hi Sebastian, I think the problem is with the fetch not returning any results. I checked your suggestion, but it did not work. Cheers, Leo On Thu, 2011-07-21 at 22:16 +0200, Sebastian Nagel wrote: Hi Leo, hi Lewis, From the times both the fetching and parsing took, I suspecting that

Re: skipping invalid segments nutch 1.3

2011-07-20 Thread Julien Nioche
Haven't you forgotten to call parse? On 19 July 2011 23:40, Leo Subscriptions llsub...@zudiewiener.com wrote: Hi Lewis, You are correct about the last post not showing any errors. I just wanted to show that I don't get any errors if I use 'crawl' and to prove that I do not have any faults

Re: skipping invalid segments nutch 1.3

2011-07-20 Thread Cam Bazz
Hello, I think there is a mislead in the documentation, it does not tell us that we have to parse. On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Haven't you forgotten to call parse? On 19 July 2011 23:40, Leo Subscriptions llsub...@zudiewiener.com wrote:

Re: skipping invalid segments nutch 1.3

2011-07-20 Thread lewis john mcgibbney
There is no documentation for individual commands used to run a Nutch 1.3 crawl so I'm not sure where there has been a mislead. In the instance that this was required I would direct newer users to the legacy documentation for the time being. My comment to Leo was to understand whether he managed

Re: skipping invalid segments nutch 1.3

2011-07-16 Thread Leo Subscriptions
I've used crawl to ensure config is correct and I don't get any errors, so I must be doing something wrong with the individual steps, but can;t see what.

skipping invalid segments nutch 1.3

2011-07-15 Thread Leo Subscriptions
I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands and relevant output. -- llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed Injector: starting at 2011-07-15 18:32:10

Re: skipping invalid segments nutch 1.3

2011-07-15 Thread Markus Jelsma
fetch, then parse. I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands and relevant output. -- llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed Injector:

Re: skipping invalid segments nutch 1.3

2011-07-15 Thread Leo Subscriptions
Done, but now get additional errors: --- llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch updatedb /home/llist/nutchData/crawl/crawldb -dir /home/llist/nutchData/crawl/segments/20110716105826 CrawlDb update: starting at 2011-07-16 11:03:56 CrawlDb update: db:

skipping invalid segments

2011-07-08 Thread Cam Bazz
Hello, I tried to crawl manually, only a list of urls. I have issued the following commands: bin/nutch inject /home/crawl/crawldb /home/urls bin/nutch generate /home/crawl/crawldb /home/crawl/segments bin/nutch fetch /home/crawl/segments/123456789 bin/nutch updatedb /home/crawl/crawldb

Re: skipping invalid segments

2011-07-08 Thread lewis john mcgibbney
Hi C.B., It looks like you may have simply missed the '-dir' when you were specifying your crawldb directory to be updated from the fetched segment. Have a look here [1] Can you please try and post your results. [1] http://wiki.apache.org/nutch/bin/nutch_updatedb On Fri, Jul 8, 2011 at 5:06

Re: skipping invalid segments

2011-07-08 Thread Cam Bazz
Hello, It appears that in my previous message I had ommitted to write -dir in my message, but had actually written -dir in my console. However, I have found out that I need to nutch parse /home/crawl/segments/12345 before updating a db. By the way: what exactly is a segment, and how is data