Ok, I didn't read that issue on jira.
Thank you very much, I'll use the crawl script!
Inviato da iPhone
Il giorno 04/mar/2013, alle ore 18:35, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com ha scritto:
Hi,
If you look at the crawl script iirc there is no way to programmatically
obtain the
Hi all,
I am trying to better understand the counters and logging of the fetch
MapReduce executed when crawling.
When looking at the job counters in the MapReduce web UI, I note the
following counters and values:
*Map input records 162,080*
moved
Hi,
I used less command and checked, it shows the past content , not modified
one. Any other cache clearing from crawl db? or any property to set in
nutch-site so that it does re-fetch modified content?
- Cleared tomcat cache
- settings:
property
namedb.fetch.interval.default/name
robots.txt is a global standard accepted by everyone. Even google, bing use
that. I dont think that there is any db file format maintained by web
servers for the robots information.
On Tue, Mar 5, 2013 at 1:29 AM, Raja Kulasekaran cull...@gmail.com wrote:
Hi
Instead of parsing robots.txt
Hi,
I meant to move the entire crawl process in the client environment , create
robots.db and fetch only robots.db as a indexed data .
Raja
On Tue, Mar 5, 2013 at 8:27 PM, Tejas Patil tejas.patil...@gmail.comwrote:
robots.txt is a global standard accepted by everyone. Even google, bing use
Hi,
I already know that nutch provides command line tools for crawl and index.
I also read somewhere that it has a REST API. Do you have any documentation
around it? Its capabilities, limitations etc.
Regards,
Anand
Nutch is internally caching the robots rules (it uses a hash map) in every
round. It will fetch robots file for a particular host just once in a given
round. This model works out well. If you are creating a separate db for it,
then you have to ensure that it is updated frequently to take into
Thanks Tejas. Deleting the 'crawl_parse' directory worked for me today.
On Mon, Mar 4, 2013 at 11:15 PM, Tejas Patil tejas.patil...@gmail.comwrote:
Yes. After I deleted that directory, parse operation ran successfully. Even
if its an empty directory, parse wont proceed normally.
On Mon,
Hi,
I am new to nutch. I am using nutch with MySQL.
While trying to crawl http://piwik.org/xmlrpc.php
http://piwik.org/xmlrpc.php
nutch throws exception :
Parsing http://piwik.org/xmlrpc.php
Call completed
java.lang.RuntimeException: job failed: name=update-table, jobid=null
at
Tejas,
I have a total of 364k files fetched in my last crawl and i used a topN of
2000 and 2 threads per queue. The gap i have noticed is between 5-8
minutes. I had a total of 180 rounds in my crawl ( i had some big crawls at
the beginning with topN of 10k but after it crashed i changed topN to
Hi!
Looking at 'logs/hadoop.log' will give you more information on why the job
has failed.
To check if a single URL can be crawled, please use parseChecker tool [0]
[0] - http://wiki.apache.org/nutch/bin/nutch%20parsechecker
I have checked using parseChecker and it worked for me.
On Tue,
This is the log :
The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled
via the plugin.includes system property, and all claim to support the
content type text/plain, but they are not mapped to it in the
parse-plugins.xml file
2013-03-05 22:06:54,076 WARN parse.ParseUtil -
Thanks Lewis. I will give a try at this
On Tue, Mar 5, 2013 at 12:59 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
There are a few things you can do Kiran.
My preference is to use custom counters for successfully and unsuccessfully
parsed docs within the ParserJob or equivalent.
I am new to Nutch.I have already configured Nutch with MYSQL. I have few
questions :
1.Currently I am crawling all the domains from my SEED.TXT. If some
exception occurs the crawling stops and some domains are not crawled, just
because of one domain/webpage. Is there a way to force nutch to
Hi all,
When I am doing full re-crawl, the old urls that are modified should be
updated correct?That is not happening.
Please correct me where I am wrong. Below are the list of steps:
- property set db.fetch.interval.default=600sec db.injector.update=true
- crawl : bin/nutch crawl urls
Hi,
On Tue, Mar 5, 2013 at 7:22 AM, raviksingh ravisingh.air...@gmail.comwrote:
I am new to Nutch.I have already configured Nutch with MYSQL. I have few
questions :
I would like to star by saying that this is not a great idea. If you read
this list you will see why.
1.Currently I am
Documentation - No
prior art - yes -
http://www.mail-archive.com/user@nutch.apache.org/msg06927.html
Jira issues - NUTCH-932
Please let us know how you get on. Getting some concrete documentation for
this would be excellent.
Thank you
Lewis
On Tue, Mar 5, 2013 at 7:33 AM, Anand Bhagwat
Hello,
I seem to remember seeing a discussion about this in the past but I can't seem
to find it in the archives.
When using mergesegs -slice, is it possible to keep all the pages from a domain
in the same slice? I have just been messing around with this functionality
(Nutch 1.6), and it
Hi
You can't do this with -slice but you can merge segments and filter them. This
would mean you'd have to merge the segments for each domain. But that's far too
much work. Why do you want to do this? There may be better ways in achieving
you goal.
-Original message-
From:Jason S
Hi
I used less command and checked, it shows the past content , not modified
one. Any other cache clearing from crawl db? or any property to set in
nutch-site so that it does re-fetch modified content?
As far as i know, the crawl db does not use cache. As Markus says that you
can simply
Hi
Maybe you can implement SegmentMergeFilter interface to filter segments
during segment merge.
On Wed, Mar 6, 2013 at 6:02 AM, Markus Jelsma markus.jel...@openindex.iowrote:
Hi
You can't do this with -slice but you can merge segments and filter them.
This would mean you'd have to merge
I have several Solr 3.6 instances that for various reasons, I don't want to
upgrade to 4.0 yet. My index is too big to fit on one machine. I want to be
able to slice the crawl so that I can have 1 slice per solr shard, but also use
the grouping feature on solr. From what I understand, solr
Hi Jason,
There is nothing I can see here which concerns Nutch.
Try solr lists please.
Thank you
Lewis
On Tuesday, March 5, 2013, Stubblefield Jason
mr.jason.stubblefi...@gmail.com wrote:
I have several Solr 3.6 instances that for various reasons, I don't want
to upgrade to 4.0 yet. My index
23 matches
Mail list logo