Hi Leo,
From the times both the fetching and parsing took, I suspecting that maybe
Nutch didn't actually fetch the URL, however this may not be the case as I
have nothing to benchmark it on. Unfortuantely on the occasion the URL
http://wiki.apache.org actually redirects to
Hi Leo, hi Lewis,
From the times both the fetching and parsing took, I suspecting that maybe
Nutch didn't actually fetch the URL,
This may be the reason. Empty segments may break some of the crawler steps.
But if I'm not wrong it looks like the updatedb-command
is not quite correct:
Hi Lewis,
Will try your suggestion shortly, but am still puzzled why the crawl
command works. Isn't it using the same filter, etc?
Cheers,
Leo
On Thu, 2011-07-21 at 20:55 +0100, lewis john mcgibbney wrote:
Hi Leo,
From the times both the fetching and parsing took, I suspecting that
maybe
Hi Lewis,
Following are the things I tried ans the relevant source/logs
1. ran 'crawl' without ending / in the url http://www.seek.com.au ;
Result OK
2. ran 'crawl' with ending / in the url http://www.seek.com.au/ ;
Result OK
3. Had a look at the regex-urlfilter.txt and the relevant entries
Hi Sebastian,
I think the problem is with the fetch not returning any results. I
checked your suggestion, but it did not work.
Cheers,
Leo
On Thu, 2011-07-21 at 22:16 +0200, Sebastian Nagel wrote:
Hi Leo, hi Lewis,
From the times both the fetching and parsing took, I suspecting that
Haven't you forgotten to call parse?
On 19 July 2011 23:40, Leo Subscriptions llsub...@zudiewiener.com wrote:
Hi Lewis,
You are correct about the last post not showing any errors. I just
wanted to show that I don't get any errors if I use 'crawl' and to prove
that I do not have any faults
Hello,
I think there is a mislead in the documentation, it does not tell us
that we have to parse.
On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
Haven't you forgotten to call parse?
On 19 July 2011 23:40, Leo Subscriptions llsub...@zudiewiener.com wrote:
There is no documentation for individual commands used to run a Nutch 1.3
crawl so I'm not sure where there has been a mislead. In the instance that
this was required I would direct newer users to the legacy documentation for
the time being.
My comment to Leo was to understand whether he managed
I've used crawl to ensure config is correct and I don't get any errors,
so I must be doing something wrong with the individual steps, but can;t
see what.
I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands and
relevant output.
--
llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed
Injector: starting at 2011-07-15 18:32:10
fetch, then parse.
I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands and
relevant output.
--
llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed
Injector:
Done, but now get additional errors:
---
llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
updatedb /home/llist/nutchData/crawl/crawldb
-dir /home/llist/nutchData/crawl/segments/20110716105826
CrawlDb update: starting at 2011-07-16 11:03:56
CrawlDb update: db:
Hello,
I tried to crawl manually, only a list of urls. I have issued the
following commands:
bin/nutch inject /home/crawl/crawldb /home/urls
bin/nutch generate /home/crawl/crawldb /home/crawl/segments
bin/nutch fetch /home/crawl/segments/123456789
bin/nutch updatedb /home/crawl/crawldb
Hi C.B.,
It looks like you may have simply missed the '-dir' when you were specifying
your crawldb directory to be updated from the fetched segment. Have a look
here [1]
Can you please try and post your results.
[1] http://wiki.apache.org/nutch/bin/nutch_updatedb
On Fri, Jul 8, 2011 at 5:06
Hello,
It appears that in my previous message I had ommitted to write -dir in
my message, but had actually written -dir in my console.
However, I have found out that I need to nutch parse
/home/crawl/segments/12345 before updating a db.
By the way: what exactly is a segment, and how is data
15 matches
Mail list logo