Re: keep all pages from a domain in one slice

2013-03-05 Thread Lewis John Mcgibbney
Hi Jason,
There is nothing I can see here which concerns Nutch.
Try solr lists please.
Thank you
Lewis

On Tuesday, March 5, 2013, Stubblefield Jason <
mr.jason.stubblefi...@gmail.com> wrote:
> I have several Solr 3.6 instances that for various reasons, I don't want
to upgrade to 4.0 yet.  My index is too big to fit on one machine.  I want
to be able to slice the crawl so that I can have 1 slice per solr shard,
but also use the grouping feature on solr.  From what I understand, solr
grouping doesn't work properly when pages from a domain are spread across
solr shards.
>
> Basically i'm after something like this:
>
> slice1 (apache.org, linux.org) -> solr1
>
> slice2 (stackoverflow.com, wikipedia.org) -> solr2
>
> etc...
>
> I could upgrade to Solrcloud, or possibly use elasticsearch, but it would
be a fair amount of re-coding.  I was just curious if I could manage the
sharding manually.
>
> Suggestions would certainly be appreciated, it seems like I am faced with
a massive upgrade or to break the grouping functionality.
>
> ~Jason
>
> On Mar 5, 2013, at 11:02 PM, Markus Jelsma 
wrote:
>
>> Hi
>>
>> You can't do this with -slice but you can merge segments and filter
them. This would mean you'd have to merge the segments for each domain. But
that's far too much work. Why do you want to do this? There may be better
ways in achieving you goal.
>>
>>
>>
>> -Original message-
>>> From:Jason S 
>>> Sent: Tue 05-Mar-2013 22:18
>>> To: user@nutch.apache.org
>>> Subject: keep all pages from a domain in one slice
>>>
>>> Hello,
>>>
>>> I seem to remember seeing a discussion about this in the past but I
can't seem to find it in the archives.
>>>
>>> When using mergesegs -slice, is it possible to keep all the pages from
a domain in the same slice?  I have just been messing around with this
functionality (Nutch 1.6), and it seems like the records are simply split
after the counter has reached the slice size specified, sometimes splitting
the records from a single domain over multiple slices.
>>>
>>> How can I segregate a domain to a single slice?
>>>
>>> Thanks in advance,
>>>
>>> ~Jason
>
>

-- 
*Lewis*


Re: keep all pages from a domain in one slice

2013-03-05 Thread Stubblefield Jason
I have several Solr 3.6 instances that for various reasons, I don't want to 
upgrade to 4.0 yet.  My index is too big to fit on one machine.  I want to be 
able to slice the crawl so that I can have 1 slice per solr shard, but also use 
the grouping feature on solr.  From what I understand, solr grouping doesn't 
work properly when pages from a domain are spread across solr shards.

Basically i'm after something like this:

slice1 (apache.org, linux.org) -> solr1

slice2 (stackoverflow.com, wikipedia.org) -> solr2

etc...

I could upgrade to Solrcloud, or possibly use elasticsearch, but it would be a 
fair amount of re-coding.  I was just curious if I could manage the sharding 
manually.

Suggestions would certainly be appreciated, it seems like I am faced with a 
massive upgrade or to break the grouping functionality.

~Jason

On Mar 5, 2013, at 11:02 PM, Markus Jelsma  wrote:

> Hi
> 
> You can't do this with -slice but you can merge segments and filter them. 
> This would mean you'd have to merge the segments for each domain. But that's 
> far too much work. Why do you want to do this? There may be better ways in 
> achieving you goal.
> 
> 
> 
> -Original message-
>> From:Jason S 
>> Sent: Tue 05-Mar-2013 22:18
>> To: user@nutch.apache.org
>> Subject: keep all pages from a domain in one slice
>> 
>> Hello,
>> 
>> I seem to remember seeing a discussion about this in the past but I can't 
>> seem to find it in the archives.
>> 
>> When using mergesegs -slice, is it possible to keep all the pages from a 
>> domain in the same slice?  I have just been messing around with this 
>> functionality (Nutch 1.6), and it seems like the records are simply split 
>> after the counter has reached the slice size specified, sometimes splitting 
>> the records from a single domain over multiple slices. 
>> 
>> How can I segregate a domain to a single slice?
>> 
>> Thanks in advance,
>> 
>> ~Jason



Re: keep all pages from a domain in one slice

2013-03-05 Thread feng lu
Hi

Maybe you can implement SegmentMergeFilter interface to filter segments
during segment merge.


On Wed, Mar 6, 2013 at 6:02 AM, Markus Jelsma wrote:

> Hi
>
> You can't do this with -slice but you can merge segments and filter them.
> This would mean you'd have to merge the segments for each domain. But
> that's far too much work. Why do you want to do this? There may be better
> ways in achieving you goal.
>
>
>
> -Original message-
> > From:Jason S 
> > Sent: Tue 05-Mar-2013 22:18
> > To: user@nutch.apache.org
> > Subject: keep all pages from a domain in one slice
> >
> > Hello,
> >
> > I seem to remember seeing a discussion about this in the past but I
> can't seem to find it in the archives.
> >
> > When using mergesegs -slice, is it possible to keep all the pages from a
> domain in the same slice?  I have just been messing around with this
> functionality (Nutch 1.6), and it seems like the records are simply split
> after the counter has reached the slice size specified, sometimes splitting
> the records from a single domain over multiple slices.
> >
> > How can I segregate a domain to a single slice?
> >
> > Thanks in advance,
> >
> > ~Jason
>



-- 
Don't Grow Old, Grow Up... :-)


Re: Nutch Incremental Crawl

2013-03-05 Thread feng lu
Hi

<<
  I used less command and checked, it shows the past content , not modified
one. Any other cache clearing from crawl db? or any property to set in
nutch-site so that it  does re-fetch modified content?
>>
As far as i know, the crawl db does not use cache. As Markus says that you
can simply reinject the records. the nutch does not know which web page
will re-fetch again, it only controled by fetchInterval in nutch-site
configuration file.

Perhaps the only reason i can think is that the modified url fetch status
is db_notmodified, so nutch will not download that url. Maybe you can use
this command to check the status of that modified url. bin/nutch readdb
crawldb/ -url http://www.example.com/ . if it's status is 6 indicated that
web page is not modified.




On Tue, Mar 5, 2013 at 7:48 PM, David Philip wrote:

> Hi,
>   I used less command and checked, it shows the past content , not modified
> one. Any other cache clearing from crawl db? or any property to set in
> nutch-site so that it  does re-fetch modified content?
>
>
>- Cleared tomcat cache
>- settings:
>
> 
>   db.fetch.interval.default
>   600
>   
> 
>
> 
>   db.injector.update
>   true
>   
> 
>
>
>
> Crawl command : bin/nutch crawl urls -solr
> http://localhost:8080/solrnutch-dir crawltest -depth 10
> This command I executed after 1 hour (modifying some sites content and
> title) but the title or content is still not fetched. The dump (redseg
> dump) shows old content only :(
>
>
> To separately update solr, I executed this command : bin/nutch solrindex
> http://localhost:8080/solrnutch/ crawltest/crawldb -linkdb
> crawltest/linkdb
> crawltest/segments/* -deleteGone
> but no sucess, nothing updated to solr.
>
> *trace :*
> SolrIndexer: starting at 2013-03-05 17:07:15
> SolrIndexer: deleting gone documents
> Indexing 16 documents
> Deleting 1 documents
> SolrIndexer: finished at 2013-03-05 17:09:38, elapsed: 00:02:22
>
> But after this , when  I check in solr (http://localhost:8080/solrnutch/)
> it still shows 16 docs, why it can be? I use nutch 1.5.1 version and
> solr3.6
>
>
> Thanks - David
>
> P.S
> I basically wanted to achieve on demand re-crawl so that all modified
> website get updated in solr, and so when user searches, he gets accurate
> results.
>
>
>
>
>
>
>
>
>
>
> On Tue, Mar 5, 2013 at 12:54 PM, feng lu  wrote:
>
> > Hi David
> >
> > yes, it's a tomcat web service cache.
> >
> > The dump file can use "less" command to open if you use linux OS. or you
> > can use
> > "bin/nutch readseg -get segments/20130121115214/ http://www.cnbeta.com/";
> > to
> > dump the information of specified url.
> >
> >
> >
> >
> > On Tue, Mar 5, 2013 at 3:02 PM, feng lu  wrote:
> >
> > >
> > >
> > >
> > > On Tue, Mar 5, 2013 at 2:49 PM, David Philip <
> > davidphilipshe...@gmail.com>wrote:
> > >
> > >> Hi,
> > >>
> > >> web server cache - you mean /tomcat/work/; where the solr is
> > running?
> > >> Did u mean that cache?
> > >>
> > >> I tried to use the below command {bin/nutch readseg -dump
> > >> crawltest/segments/20130304185844/ crawltest/test}and it gives dump
> > file,
> > >> format is GMC link (application/x-gmc-link)  - I am not able to open
> it.
> > >> How to open this file?
> > >>
> > >> How ever when I ran :  bin/nutch readseg -list
> > >> crawltest/segments/20130304185844/
> > >> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
> > >> 20130304185844 1 2013-03-04T18:58:53 2013-03-04T18:58:53 1 1
> > >>
> > >>
> > >> - David
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Tue, Mar 5, 2013 at 11:25 AM, feng lu 
> wrote:
> > >>
> > >> > Hi David
> > >> >
> > >> > Do you clear the web server cache. Maybe the refetch is also crawl
> the
> > >> old
> > >> > page.
> > >> >
> > >> > Maybe you can dump the url content to check the modification.
> > >> > using bin/nutch readseg command.
> > >> >
> > >> > Thanks
> > >> >
> > >> >
> > >> > On Tue, Mar 5, 2013 at 1:28 PM, David Philip <
> > >> davidphilipshe...@gmail.com
> > >> > >wrote:
> > >> >
> > >> > > Hi Markus,
> > >> > >
> > >> > >   So I was trying with the *db.injector.update *point that you
> > >> mentioned,
> > >> > > please see my observations below*. *
> > >> > > Settings: I did  *db.injector.update * to* true *and   *
> > >> > > db.fetch.interval.default *to* 1hour. *
> > >> > > *
> > >> > > *
> > >> > > *
> > >> > > *
> > >> > > *Observation:*
> > >> > >
> > >> > > On first time crawl[1],  14 urls were successfully crawled and
> > >> indexed to
> > >> > > solr.
> > >> > > case 1 :
> > >> > > In those 14 urls I modified the content and title of one url (say
> > >> Aurl)
> > >> > and
> > >> > > re executed the crawl after one hour.
> > >> > > I see that this(Aurl) url is re-fetched (it shows in log) but at
> > Solr
> > >> > level
> > >> > > : for that url (aurl): content field and title field didn't get
> > >> updated.
> > >> > > Why? should I do any configuration for this to make solr index get
> > >> > updated?
> > >> > >
> > >> > > case2:
> > >> > > Added new url to

RE: keep all pages from a domain in one slice

2013-03-05 Thread Markus Jelsma
Hi

You can't do this with -slice but you can merge segments and filter them. This 
would mean you'd have to merge the segments for each domain. But that's far too 
much work. Why do you want to do this? There may be better ways in achieving 
you goal.

 
 
-Original message-
> From:Jason S 
> Sent: Tue 05-Mar-2013 22:18
> To: user@nutch.apache.org
> Subject: keep all pages from a domain in one slice
> 
> Hello,
> 
> I seem to remember seeing a discussion about this in the past but I can't 
> seem to find it in the archives.
> 
> When using mergesegs -slice, is it possible to keep all the pages from a 
> domain in the same slice?  I have just been messing around with this 
> functionality (Nutch 1.6), and it seems like the records are simply split 
> after the counter has reached the slice size specified, sometimes splitting 
> the records from a single domain over multiple slices. 
> 
> How can I segregate a domain to a single slice?
> 
> Thanks in advance,
> 
> ~Jason


keep all pages from a domain in one slice

2013-03-05 Thread Jason S
Hello,

I seem to remember seeing a discussion about this in the past but I can't seem 
to find it in the archives.

When using mergesegs -slice, is it possible to keep all the pages from a domain 
in the same slice?  I have just been messing around with this functionality 
(Nutch 1.6), and it seems like the records are simply split after the counter 
has reached the slice size specified, sometimes splitting the records from a 
single domain over multiple slices. 

How can I segregate a domain to a single slice?

Thanks in advance,

~Jason

Re: Rest API for Nutch 2.x

2013-03-05 Thread Lewis John Mcgibbney
Documentation - No
prior art - yes -
http://www.mail-archive.com/user@nutch.apache.org/msg06927.html
Jira issues - NUTCH-932
Please let us know how you get on. Getting some concrete documentation for
this would be excellent.
Thank you
Lewis

On Tue, Mar 5, 2013 at 7:33 AM, Anand Bhagwat  wrote:

> Hi,
> I already know that nutch provides command line tools for crawl and index.
> I also read somewhere that it has a REST API. Do you have any documentation
> around it? Its capabilities, limitations etc.
>
> Regards,
> Anand
>



-- 
*Lewis*


Re: Continue Nutch Crawling After Exception

2013-03-05 Thread Lewis John Mcgibbney
Hi,

On Tue, Mar 5, 2013 at 7:22 AM, raviksingh wrote:

> I am new to Nutch.I have already configured Nutch with MYSQL. I have few
> questions :
>

I would like to star by saying that this is not a great idea. If you read
this list you will see why.


>
> 1.Currently I am crawling all the domains from my SEED.TXT. If some
> exception occurs the crawling stops and some domains are not crawled, just
> because of one domain/webpage. Is there a way to force nutch to continue
> crawling after exception occurs ?
>

What are the exceptions?


>
> 2.I want domains/URLs to be crawled from DB. Currently I and reading from
> DB
> and writing to SEED.TXT before starting to crawl. Is there a better way?
>

Not yet, this has also been discussed pretty thoroughly.


>
> 3.Is there a way to provide URLFilter for scanning/restricting particular
> domain/Url programatically? I have checked org.apache.nutch.net.URLFilter.
> I
> was unable to make it work.
>
>
Please give an example of what you are trying to do here? Are you using the
de facto scripts provided with Nutch or something else to run your Nutch
server?
-- 
*Lewis*


Re: recrawl - will it re-fetch and parse all the URLS again?

2013-03-05 Thread David Philip
solr url is  http://localhost:8080/solrnutch,
Version : solr 3.6, Nutch - 1.6 Below commands and log was copy paste
problem.



-David


On Wed, Mar 6, 2013 at 12:03 AM, David Philip
wrote:

> Hi all,
>
> When I am doing full re-crawl, the old urls that are modified should be
> updated correct?That is not happening.
>
>  Please correct me where I am wrong. Below are the list of steps:
>
>
>- property set db.fetch.interval.default=600sec db.injector.update=true
>- crawl : bin/nutch crawl urls -solr http://localhost:8080/solrnutch-dir 
> crawltest -depth 10
>- after 600 sec
>- crawl : bin/nutch crawl urls -solr http://localhost:8080/solrnutch-dir 
> crawltest -depth 10
>
>
>- Nothing updated.  data in solr indexes remain same. I checked the
>fetch segments(bin/nutch readseg), it is also old, But the fetch took
>place.. please see the brief steps of log.
>- I also deleted one URL and made it site not found so that it also
>delete from indexes (using -deleteGone) but this is also not deleted. The
>log shows it deleted but in indexes it is not deleted. I still this URL
>searchable.
>This Seems to be some cache problem (I cleared cache -webserver)or any
>setting that I have to do? Please let me know.]
>
>
> Please see :  This question is related to my old thread but different
> question about update nt successful: data is not re-fetched.
>
>
> Thanks very much - David
> *
> *
> *
> *
> *
> *
> *The brief log trace while second crawl:*
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering:
> 1
> Injector: Merging injected urls into crawl db.
> http://david.wordpress.in/ overwritten with injected record but update
> was specified.
> Injector: finished at 2013-03-05 23:25:49, elapsed: 00:00:03
> Generator: starting at 2013-03-05 23:25:49
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Fetcher: segment: crawltest/segments/20130305232551
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 5 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> fetching http://david.wordpress.in/2011_09_01_archive.html
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=4
> * queue: david.wordpress.in
>
> so on..
>
> Indexing 10 documents
> Deleting 1 documents
> SolrIndexer: finished at 2013-03-05 23:27:37, elapsed: 00:00:09
> SolrDeleteDuplicates: starting at 2013-03-05 23:27:37
> SolrDeleteDuplicates: Solr url:
> http://localhost:8080/nutch_solr4/collection1/
> SolrDeleteDuplicates: finished at 2013-03-05 23:27:38, elapsed: 00:00:01
> crawl finished: crawltest
>
>
>


recrawl - will it re-fetch and parse all the URLS again?

2013-03-05 Thread David Philip
Hi all,

When I am doing full re-crawl, the old urls that are modified should be
updated correct?That is not happening.

 Please correct me where I am wrong. Below are the list of steps:


   - property set db.fetch.interval.default=600sec db.injector.update=true
   - crawl : bin/nutch crawl urls -solr
http://localhost:8080/solrnutch-dir crawltest -depth 10
   - after 600 sec
   - crawl : bin/nutch crawl urls -solr
http://localhost:8080/solrnutch-dir crawltest -depth 10


   - Nothing updated.  data in solr indexes remain same. I checked the
   fetch segments(bin/nutch readseg), it is also old, But the fetch took
   place.. please see the brief steps of log.
   - I also deleted one URL and made it site not found so that it also
   delete from indexes (using -deleteGone) but this is also not deleted. The
   log shows it deleted but in indexes it is not deleted. I still this URL
   searchable.
   This Seems to be some cache problem (I cleared cache -webserver)or any
   setting that I have to do? Please let me know.]


Please see :  This question is related to my old thread but different
question about update nt successful: data is not re-fetched.


Thanks very much - David
*
*
*
*
*
*
*The brief log trace while second crawl:*
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
http://david.wordpress.in/ overwritten with injected record but update was
specified.
Injector: finished at 2013-03-05 23:25:49, elapsed: 00:00:03
Generator: starting at 2013-03-05 23:25:49
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Fetcher: segment: crawltest/segments/20130305232551
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 5 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching http://david.wordpress.in/2011_09_01_archive.html
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=4
* queue: david.wordpress.in

so on..

Indexing 10 documents
Deleting 1 documents
SolrIndexer: finished at 2013-03-05 23:27:37, elapsed: 00:00:09
SolrDeleteDuplicates: starting at 2013-03-05 23:27:37
SolrDeleteDuplicates: Solr url:
http://localhost:8080/nutch_solr4/collection1/
SolrDeleteDuplicates: finished at 2013-03-05 23:27:38, elapsed: 00:00:01
crawl finished: crawltest


Continue Nutch Crawling After Exception

2013-03-05 Thread raviksingh
I am new to Nutch.I have already configured Nutch with MYSQL. I have few
questions :

1.Currently I am crawling all the domains from my SEED.TXT. If some
exception occurs the crawling stops and some domains are not crawled, just
because of one domain/webpage. Is there a way to force nutch to continue
crawling after exception occurs ?

2.I want domains/URLs to be crawled from DB. Currently I and reading from DB
and writing to SEED.TXT before starting to crawl. Is there a better way?

3.Is there a way to provide URLFilter for scanning/restricting particular
domain/Url programatically? I have checked org.apache.nutch.net.URLFilter. I
was unable to make it work.

Please ask any details if required.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Continue-Nutch-Crawling-After-Exception-tp4044888.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Parse statistics in Nutch

2013-03-05 Thread kiran chitturi
Thanks Lewis. I will give a try at this


On Tue, Mar 5, 2013 at 12:59 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> There are a few things you can do Kiran.
> My preference is to use custom counters for successfully and unsuccessfully
> parsed docs within the ParserJob or equivalent. I would be surprised if
> this is not already there however.
> It is not much trouble to add counters to something like this. We already
> do it in InjectorJob for instance to make explicit the number of filtered
> URLs and the number of URLs injected post filtering and normalization.
>
> On Tuesday, March 5, 2013, kiran chitturi 
> wrote:
> > Hi!
> >
> > We already get statistics for fetcher using (readdb -stats) but can we
> also
> > include parse Statistics in the statistics.
> >
> > It will be very helpful in knowing how many documents are successfully
> > parsed and we could use different methods to reparse if we see lot of
> > failing documents.
> >
> > Only way i know to get how many documents are parsed is to check Solr on
> > how many documents are indexed.
> >
> > What do you guys think of this ?
> >
> > --
> > Kiran Chitturi
> >
>
> --
> *Lewis*
>



-- 
Kiran Chitturi


Re: Parse statistics in Nutch

2013-03-05 Thread Lewis John Mcgibbney
There are a few things you can do Kiran.
My preference is to use custom counters for successfully and unsuccessfully
parsed docs within the ParserJob or equivalent. I would be surprised if
this is not already there however.
It is not much trouble to add counters to something like this. We already
do it in InjectorJob for instance to make explicit the number of filtered
URLs and the number of URLs injected post filtering and normalization.

On Tuesday, March 5, 2013, kiran chitturi  wrote:
> Hi!
>
> We already get statistics for fetcher using (readdb -stats) but can we
also
> include parse Statistics in the statistics.
>
> It will be very helpful in knowing how many documents are successfully
> parsed and we could use different methods to reparse if we see lot of
> failing documents.
>
> Only way i know to get how many documents are parsed is to check Solr on
> how many documents are indexed.
>
> What do you guys think of this ?
>
> --
> Kiran Chitturi
>

-- 
*Lewis*


Re: Find which URL created exception

2013-03-05 Thread raviksingh
This is the log : 

The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled
via the plugin.includes system property, and all claim to support the
content type text/plain, but they are not mapped to it  in the
parse-plugins.xml file
2013-03-05 22:06:54,076 WARN  parse.ParseUtil - Unable to successfully parse
content http://piwik.org/xmlrpc.php of type text/plain
2013-03-05 22:06:54,955 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 1
2013-03-05 22:06:55,706 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 1
2013-03-05 22:06:55,707 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-03-05 22:06:55,707 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-03-05 22:06:55,707 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-03-05 22:06:56,216 WARN  mapred.LocalJobRunner - job_local_0005
java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data
too long for column 'id' at row 1
at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
at
org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
Caused by: java.sql.BatchUpdateException: Data truncation: Data too long for
column 'id' at row 1
at
com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
at
com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
... 5 more
Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too
long for column 'id' at row 1
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3607)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
at
com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
at
com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/job-failed-name-update-table-jobid-null-tp4044914p4044923.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Find which URL created exception

2013-03-05 Thread kiran chitturi
Hi!

Looking at 'logs/hadoop.log' will give you more information on why the job
has failed.

To check if a single URL can be crawled, please use parseChecker tool [0]

[0] - http://wiki.apache.org/nutch/bin/nutch%20parsechecker

I have checked using parseChecker and it worked for me.





On Tue, Mar 5, 2013 at 11:38 AM, raviksingh wrote:

> Hi,
>   I am new to nutch. I am using nutch with MySQL.
> While trying to crawl  http://piwik.org/xmlrpc.php
> 
> nutch throws exception :
>
> Parsing http://piwik.org/xmlrpc.php
> Call completed
> java.lang.RuntimeException: job failed: name=update-table, jobid=null
> at
> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
> at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:98)
> at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
> at org.apache.nutch.crawl.Crawler.run(Crawler.java:181)
> at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at ravi.crawler.MyCrawl.crawl(MyCrawl.java:13)
> at ravi.crawler.Crawler.AttachCrawl(Crawler.java:88)
> at scheduler.MyTask.run(MyTask.java:15)
> at java.util.TimerThread.mainLoop(Unknown Source)
> at java.util.TimerThread.run(Unknown Source)
>
>
>
> Please check the link as it looks like a service.
>
> How can I either resolve this .
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Find-which-URL-created-exception-tp4044914.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Kiran Chitturi


Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-05 Thread kiran chitturi
Tejas,

I have a total of 364k files fetched in my last crawl and i used a topN of
2000 and 2 threads per queue. The gap i have noticed is between 5-8
minutes. I had a total of 180 rounds in my crawl ( i had some big crawls at
the beginning with  topN of 10k but after it crashed i changed topN to 2k).


Due to my hardware limitations and local mode, i think using smaller number
of rounds saved me quite some time. The downside might be having lot more
segments to go through but i am writing scripts for automating the index
and reparse tasks.





On Mon, Mar 4, 2013 at 11:18 PM, Tejas Patil wrote:

> Hi Kiran,
>
> Is the 6 mins consistent across those 5 rounds ? With 10k files is takes
> ~60 minutes for writing segments.
> With 2k file, it took 6 min gap. You will need 5 such small rounds to get
> total 10k, so total gap time would be (5 * 6) = 30 mins. Thats half of the
> time taken for the crawl with 10k !! So in a way, you saved 30 mins by
> running small crawls. Something does seem right with the math here.
>
> Thanks,
> Tejas Patil
>
> On Mon, Mar 4, 2013 at 12:45 PM, kiran chitturi
> wrote:
>
> > Thanks Sebastian for the details. This was the bottleneck i had when i am
> > fetching 10k files. Now i switched to 2k and i have a 6 mins gap now.  It
> > took me some time finding right configuration in the local node.
> >
> >
> >
> > On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel
> > wrote:
> >
> > > After all documents are fetched (and ev. parsed) the segment has to be
> > > written:
> > > finish sorting the data and copy it from local temp dir
> (hadoop.tmp.dir)
> > > to the
> > > segment directory. If IO is a bottleneck this may take a while. Also
> > looks
> > > like
> > > you have a lot of content!
> > >
> > > On 03/04/2013 06:03 AM, kiran chitturi wrote:
> > > > Thanks for your suggestion guys! The big crawl is fetching large
> amount
> > > of
> > > > big PDF files.
> > > >
> > > > For something like below, the fetcher took a lot of time to finish
> up,
> > > even
> > > > though the files are fetched. It shows more than one hour of time.
> > > >
> > > >>
> > > >> 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
> > > >> spinWaiting=0, fetchQueues.totalSize=0
> > > >> 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
> > > >> 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished
> at
> > > >> 2013-03-01 20:57:55, elapsed: 01:34:09
> > > >
> > > >
> > > > Does fetching a lot of files causes this issue ? Should i stick to
> one
> > > > thread per local mode or use pseudo distributed mode to improve
> > > performance
> > > > ?
> > > >
> > > > What is an acceptable time fetcher should finish up after fetching
> the
> > > > files ? What exactly happens in this step ?
> > > >
> > > > Thanks again!
> > > > Kiran.
> > > >
> > > >
> > > >
> > > > On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma <
> > > markus.jel...@openindex.io>wrote:
> > > >
> > > >> The default heap size of 1G is just enough for a parsing fetcher
> with
> > 10
> > > >> threads. The only problem that may rise is too large and complicated
> > PDF
> > > >> files or very large HTML files. If you generate fetch lists of a
> > > reasonable
> > > >> size there won't be a problem most of the time. And if you want to
> > > crawl a
> > > >> lot, then just generate more small segments.
> > > >>
> > > >> If there is a bug it's most likely to be the parser eating memory
> and
> > > not
> > > >> releasing it.
> > > >>
> > > >> -Original message-
> > > >>> From:Tejas Patil 
> > > >>> Sent: Sun 03-Mar-2013 22:19
> > > >>> To: user@nutch.apache.org
> > > >>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to
> create
> > > >> new native thread
> > > >>>
> > > >>> I agree with Sebastian. It was a crawl in local mode and not over a
> > > >>> cluster. The intended crawl volume is huge and if we dont override
> > the
> > > >>> default heap size to some decent value, there is high possibility
> of
> > > >> facing
> > > >>> an OOM.
> > > >>>
> > > >>>
> > > >>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <
> > > >> chitturikira...@gmail.com>wrote:
> > > >>>
> > > > If you find the time you should trace the process.
> > > > Seems to be either a misconfiguration or even a bug.
> > > >
> > > > I will try to track this down soon with the previous
> configuration.
> > > >> Right
> > >  now, i am just trying to get data crawled by Monday.
> > > 
> > >  Kiran.
> > > 
> > > 
> > > >>> Luckily, you should be able to retry via "bin/nutch parse ..."
> > > >>> Then trace the system and the Java process to catch the reason.
> > > >>>
> > > >>> Sebastian
> > > >>>
> > > >>> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> > >  Sorry, i am looking to crawl 400k documents with the crawl. I
> > > >> said
> > >  400
> > > > in
> > >  my last message.
> > > 
> > > 
> > >  On Sat, Mar 2, 2013 at 2:12 PM, kiran

Find which URL created exception

2013-03-05 Thread raviksingh
Hi, 
  I am new to nutch. I am using nutch with MySQL. 
While trying to crawl  http://piwik.org/xmlrpc.php
  
nutch throws exception :

Parsing http://piwik.org/xmlrpc.php
Call completed
java.lang.RuntimeException: job failed: name=update-table, jobid=null
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:98)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:181)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at ravi.crawler.MyCrawl.crawl(MyCrawl.java:13)
at ravi.crawler.Crawler.AttachCrawl(Crawler.java:88)
at scheduler.MyTask.run(MyTask.java:15)
at java.util.TimerThread.mainLoop(Unknown Source)
at java.util.TimerThread.run(Unknown Source)



Please check the link as it looks like a service.

How can I either resolve this .





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Find-which-URL-created-exception-tp4044914.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch 1.6 : How to reparse Nutch segments ?

2013-03-05 Thread kiran chitturi
Thanks Tejas. Deleting the 'crawl_parse' directory worked for me today.




On Mon, Mar 4, 2013 at 11:15 PM, Tejas Patil wrote:

> Yes. After I deleted that directory, parse operation ran successfully. Even
> if its an empty directory, parse wont proceed normally.
>
>
> On Mon, Mar 4, 2013 at 8:07 PM, kiran chitturi  >wrote:
>
> > Thanks Tejas for the information.
> >
> > Did you try deleting 'crawl_parse' directory ? Since, the code checks for
> > that directory, i will try deleting and reparsing.
> >
> >
> >
> > On Mon, Mar 4, 2013 at 10:49 PM, Tejas Patil  > >wrote:
> >
> > > The code [0] checks if there is already a "crawl_parse" directory in
> the
> > > segment [lines 88-89].
> > >
> > >  88 if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME))) 89 throw
> new
> > > IOException("Segment already parsed!");
> > > I am not sure what you guys meant by deleting the subsection of the
> > > directories. Did you mean deletion of the contents inside the old
> > > crawl_parse directory ? I tried that locally and it didn't work.
> > >
> > > [0] :
> > >
> > >
> >
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup
> > >
> > >
> > > On Mon, Mar 4, 2013 at 4:20 PM, kiran chitturi <
> > chitturikira...@gmail.com
> > > >wrote:
> > >
> > > > It took me close to 2 days to fetch 400k pages on my not so fast
> single
> > > > machine. I do not want to refetch unless it very crucial.
> > > >
> > > > I will check and see if deleting any sub-directories is helpful
> > > >
> > > > Thanks!
> > > >
> > > >
> > > > On Mon, Mar 4, 2013 at 5:54 PM, Lewis John Mcgibbney <
> > > > lewis.mcgibb...@gmail.com> wrote:
> > > >
> > > > > This makes perfect sense Kiran. It is something I've encountered in
> > the
> > > > > past and as my segments were not production critical I was easily
> > able
> > > to
> > > > > delete and re-fetch them then parse out the stuff I wanted to.
> > > > > As I said, I think this is the only way to get I'm afraid.
> > > > >
> > > > > On Mon, Mar 4, 2013 at 2:25 PM, kiran chitturi <
> > > > chitturikira...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Yeah. I used parse-(tika|metatags) first in the configuration and
> > > now i
> > > > > > want to use parse-(html|tika|metatags). This is due to the
> > > > parse-metatags
> > > > > > new patch upgrade.
> > > > > >
> > > > > > Thanks for the suggestions. It would be very helpful for
> reparsing
> > > > > segments
> > > > > > for 1.x like 2.x has.
> > > > > >
> > > > > > Regards,
> > > > > > Kiran.
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney <
> > > > > > lewis.mcgibb...@gmail.com> wrote:
> > > > > >
> > > > > > > Please don't go ahead and delete the parse directories just yet
> > > > before
> > > > > > you
> > > > > > > hear back from others.
> > > > > > > My suggestion would be to try and delete a subsection of the
> > > > > directories
> > > > > > > and see if this is possible.
> > > > > > > Have you changed some configuration and now want to parse out
> > some
> > > > more
> > > > > > > content/structure?
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi <
> > > > > > chitturikira...@gmail.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > I am trying to reparse Nutch segments and it says 'Segment
> > > already
> > > > > > > parsed'
> > > > > > > > when i try to parse.
> > > > > > > >
> > > > > > > > Is there any option of attribute as '-reparse' like 2.x
> series
> > > has
> > > > ?
> > > > > > > >
> > > > > > > > Should i delete some directories so that i can reparse ?
> > > > > > > >
> > > > > > > > Please give me suggestions on how to reparse segments that
> are
> > > > > already
> > > > > > > > parsed.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > --
> > > > > > > > Kiran Chitturi
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > *Lewis*
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Kiran Chitturi
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *Lewis*
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kiran Chitturi
> > > >
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
>



-- 
Kiran Chitturi


Re: Robots.db instead of robots.txt

2013-03-05 Thread Tejas Patil
Nutch is internally caching the robots rules (it uses a hash map) in every
round. It will fetch robots file for a particular host just once in a given
round. This model works out well. If you are creating a separate db for it,
then you have to ensure that it is updated frequently to take into account
the changes that are done by the server.

On Tue, Mar 5, 2013 at 7:15 AM, Raja Kulasekaran  wrote:

> Hi,
>
> I meant to move the entire crawl process in the client environment , create
>  "robots.db"  and fetch only robots.db as a indexed data .
>
> Raja
>
> On Tue, Mar 5, 2013 at 8:27 PM, Tejas Patil  >wrote:
>
> > robots.txt is a global standard accepted by everyone. Even google, bing
> use
> > that. I dont think that there is any db file format maintained by web
> > servers for the robots information.
> >
> >
> > On Tue, Mar 5, 2013 at 1:29 AM, Raja Kulasekaran 
> > wrote:
> >
> > > Hi
> > >
> > > Instead of parsing robots.txt file, why don't ask the web hoster or web
> > > administrator to create the complete parsed text in the db file format
> at
> > > the robots.txt location itself ?
> > >
> > > Is there are any standard protocol ?  It would be a better idea to stop
> > > transferring data through crawlers .
> > >
> > > Please let me know your thoughts on the same .
> > >
> > > Raja
> > >
> >
>


Rest API for Nutch 2.x

2013-03-05 Thread Anand Bhagwat
Hi,
I already know that nutch provides command line tools for crawl and index.
I also read somewhere that it has a REST API. Do you have any documentation
around it? Its capabilities, limitations etc.

Regards,
Anand


Re: Robots.db instead of robots.txt

2013-03-05 Thread Raja Kulasekaran
Hi,

I meant to move the entire crawl process in the client environment , create
 "robots.db"  and fetch only robots.db as a indexed data .

Raja

On Tue, Mar 5, 2013 at 8:27 PM, Tejas Patil wrote:

> robots.txt is a global standard accepted by everyone. Even google, bing use
> that. I dont think that there is any db file format maintained by web
> servers for the robots information.
>
>
> On Tue, Mar 5, 2013 at 1:29 AM, Raja Kulasekaran 
> wrote:
>
> > Hi
> >
> > Instead of parsing robots.txt file, why don't ask the web hoster or web
> > administrator to create the complete parsed text in the db file format at
> > the robots.txt location itself ?
> >
> > Is there are any standard protocol ?  It would be a better idea to stop
> > transferring data through crawlers .
> >
> > Please let me know your thoughts on the same .
> >
> > Raja
> >
>


Re: Robots.db instead of robots.txt

2013-03-05 Thread Tejas Patil
robots.txt is a global standard accepted by everyone. Even google, bing use
that. I dont think that there is any db file format maintained by web
servers for the robots information.


On Tue, Mar 5, 2013 at 1:29 AM, Raja Kulasekaran  wrote:

> Hi
>
> Instead of parsing robots.txt file, why don't ask the web hoster or web
> administrator to create the complete parsed text in the db file format at
> the robots.txt location itself ?
>
> Is there are any standard protocol ?  It would be a better idea to stop
> transferring data through crawlers .
>
> Please let me know your thoughts on the same .
>
> Raja
>


Re: Nutch Incremental Crawl

2013-03-05 Thread David Philip
Hi,
  I used less command and checked, it shows the past content , not modified
one. Any other cache clearing from crawl db? or any property to set in
nutch-site so that it  does re-fetch modified content?


   - Cleared tomcat cache
   - settings:


  db.fetch.interval.default
  600
  



  db.injector.update
  true
  




Crawl command : bin/nutch crawl urls -solr
http://localhost:8080/solrnutch-dir crawltest -depth 10
This command I executed after 1 hour (modifying some sites content and
title) but the title or content is still not fetched. The dump (redseg
dump) shows old content only :(


To separately update solr, I executed this command : bin/nutch solrindex
http://localhost:8080/solrnutch/ crawltest/crawldb -linkdb crawltest/linkdb
crawltest/segments/* -deleteGone
but no sucess, nothing updated to solr.

*trace :*
SolrIndexer: starting at 2013-03-05 17:07:15
SolrIndexer: deleting gone documents
Indexing 16 documents
Deleting 1 documents
SolrIndexer: finished at 2013-03-05 17:09:38, elapsed: 00:02:22

But after this , when  I check in solr (http://localhost:8080/solrnutch/)
it still shows 16 docs, why it can be? I use nutch 1.5.1 version and solr3.6


Thanks - David

P.S
I basically wanted to achieve on demand re-crawl so that all modified
website get updated in solr, and so when user searches, he gets accurate
results.










On Tue, Mar 5, 2013 at 12:54 PM, feng lu  wrote:

> Hi David
>
> yes, it's a tomcat web service cache.
>
> The dump file can use "less" command to open if you use linux OS. or you
> can use
> "bin/nutch readseg -get segments/20130121115214/ http://www.cnbeta.com/";
> to
> dump the information of specified url.
>
>
>
>
> On Tue, Mar 5, 2013 at 3:02 PM, feng lu  wrote:
>
> >
> >
> >
> > On Tue, Mar 5, 2013 at 2:49 PM, David Philip <
> davidphilipshe...@gmail.com>wrote:
> >
> >> Hi,
> >>
> >> web server cache - you mean /tomcat/work/; where the solr is
> running?
> >> Did u mean that cache?
> >>
> >> I tried to use the below command {bin/nutch readseg -dump
> >> crawltest/segments/20130304185844/ crawltest/test}and it gives dump
> file,
> >> format is GMC link (application/x-gmc-link)  - I am not able to open it.
> >> How to open this file?
> >>
> >> How ever when I ran :  bin/nutch readseg -list
> >> crawltest/segments/20130304185844/
> >> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
> >> 20130304185844 1 2013-03-04T18:58:53 2013-03-04T18:58:53 1 1
> >>
> >>
> >> - David
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Mar 5, 2013 at 11:25 AM, feng lu  wrote:
> >>
> >> > Hi David
> >> >
> >> > Do you clear the web server cache. Maybe the refetch is also crawl the
> >> old
> >> > page.
> >> >
> >> > Maybe you can dump the url content to check the modification.
> >> > using bin/nutch readseg command.
> >> >
> >> > Thanks
> >> >
> >> >
> >> > On Tue, Mar 5, 2013 at 1:28 PM, David Philip <
> >> davidphilipshe...@gmail.com
> >> > >wrote:
> >> >
> >> > > Hi Markus,
> >> > >
> >> > >   So I was trying with the *db.injector.update *point that you
> >> mentioned,
> >> > > please see my observations below*. *
> >> > > Settings: I did  *db.injector.update * to* true *and   *
> >> > > db.fetch.interval.default *to* 1hour. *
> >> > > *
> >> > > *
> >> > > *
> >> > > *
> >> > > *Observation:*
> >> > >
> >> > > On first time crawl[1],  14 urls were successfully crawled and
> >> indexed to
> >> > > solr.
> >> > > case 1 :
> >> > > In those 14 urls I modified the content and title of one url (say
> >> Aurl)
> >> > and
> >> > > re executed the crawl after one hour.
> >> > > I see that this(Aurl) url is re-fetched (it shows in log) but at
> Solr
> >> > level
> >> > > : for that url (aurl): content field and title field didn't get
> >> updated.
> >> > > Why? should I do any configuration for this to make solr index get
> >> > updated?
> >> > >
> >> > > case2:
> >> > > Added new url to the crawling site
> >> > > The url got indexed - This is success. So interested to know why the
> >> > above
> >> > > case failed? What configuration need to be made?
> >> > >
> >> > >
> >> > > Thanks - David
> >> > >
> >> > >
> >> > > *PS:*
> >> > > Apologies that I am still asking questions on same topic. I am not
> >> able
> >> > to
> >> > > find good way for incremental crawl so trying different approaches.
> >> >  Once I
> >> > > am clear I will blog this and share it. Thanks lot for replies from
> >> > mailer.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
> >> > > wrote:
> >> > >
> >> > > > You can simply reinject the records.  You can overwrite and/or
> >> update
> >> > the
> >> > > > current record. See the db.injector.update and overwrite settings.
> >> > > >
> >> > > > -Original message-
> >> > > > > From:David Philip 
> >> > > > > Sent: Wed 27-Feb-2013 11:23
> >> > > > > To: user@nutch.apache.org
> >> > > > > Subject: Re: Nutch Incremental Crawl
> >> > > > >
> >> > > > > HI Markus, I meant over riding  the injected inter

Understanding fetch MapReduce job counters and logs

2013-03-05 Thread Amit Sela
Hi all,

I am trying to better understand the counters and logging of the fetch
MapReduce executed when crawling.

When looking at the job counters in the MapReduce web UI, I note the
following counters and values:

*Map input records 162,080*
moved 345
robots_denied 4,441
robots_denied_maxcrawldelay 259
*hitByTimeLimit 7,493*
exception 3,801
notmodified 2
gone 48
access_denied 1
*success 93,583*
temp_moved3,068
notfound1,490

And summing all counters does not equal the total map input...

But, when I go to the map task logs, at the end of each log there is a line
stating:

QueueFeeder finished: total *36651* records + hit by time limit :*20975*
QueueFeeder finished: total *30248* records + hit by time limit :*25492*
QueueFeeder finished: total *44257* records + hit by time limit :*4460*
*
*
Summing all of theses numbers does equal the total map input. I also note
that the total hit by time limit here is 50927 but the job counters show
7493.

Anyone can elaborate ?

Thanks,
Amit.


Re: Nutch 2.1 crawling step by step and crawling command differences

2013-03-05 Thread Adriana Farina
Ok, I didn't read that issue on jira.

Thank you very much, I'll use the crawl script!

Inviato da iPhone

Il giorno 04/mar/2013, alle ore 18:35, Lewis John Mcgibbney 
 ha scritto:

> Hi,
> If you look at the crawl script iirc there is no way to programmatically
> obtain the generated batchId(s) from the generator.
> This sounds like the source of the problem.
> As Kiran said though, the Nutch crawl script is the way forward ;)
> 
> On Monday, March 4, 2013, kiran chitturi  wrote:
>> Hi Adriana,
>> 
>> I do not know the solution for your problem but in general crawl command
> is
>> deprecated and using crawl script (step by step) is encouraged.
>> 
>> Please check [0] for more details
>> 
>> [0] - https://issues.apache.org/jira/browse/NUTCH-1087
>> 
>> 
>> On Mon, Mar 4, 2013 at 11:23 AM, Adriana Farina
>> wrote:
>> 
>>> Hello,
>>> 
>>> I'm using Nutch 2.1 in distributed mode with Hadoop 1.0.4 and HBase
> 0.90.4
>>> as database.
>>> 
>>> When I launch the crawling job step by step everything works fine, but
> when
>>> I launch the crawl command (either through the command "hadoop jar
>>> apache-nutch-2.1.job org.apache.nutch.crawl.Crawler urls /urls.txt
> -depth 3
>>> -topN 5" or through the command "bin/nutch crawl urls /urls.txt -depth 3
>>> -topN 5" inside the folder nutch/runtime/deploy) it doesn't fetch
> anything.
>>> The crawling job runs without problem until the end and it doesn't output
>>> any exception.
>>> However, if I look inside the webpage table created in HBase it is like
> the
>>> nutch job executes only the inject phase.
>>> 
>>> I've dug in the source code of nutch but I'm not able to figure out what
>>> can be the problem. At first I thought that it could be due to the batch
>>> id, since in the "step-by-step mode" I pass it explicitly to the fetcher
>>> and the parser, but this does not exeplain why when I run the crawl
> command
>>> it does not seem to run the generator.
>>> 
>>> Can somebody help me please?
>>> 
>>> Thank you!
>>> 
>>> 
>>> --
>>> Adriana Farina
>>> 
>> 
>> 
>> 
>> --
>> Kiran Chitturi
>> 
> 
> -- 
> *Lewis*