There has been discussion about that few months back and I am not aware of the exact root cause behind it. See http://lucene.472066.n3.nabble.com/Nutch-2-1-different-batch-id-null-td4040592.html http://lucene.472066.n3.nabble.com/Re-nutch-2-1-with-mysql-different-batch-id-null-td4058698.html
There is Jira to track the same: https://issues.apache.org/jira/browse/NUTCH-1567 On Thu, Jun 13, 2013 at 2:11 PM, Weder Carlos Vieira <weder.vie...@gmail.com > wrote: > mhmmm got it... > > Tejas can you please explain to me why I put some URL inside urls/seed.txt > and many pages inside that urls aren't parsed? > > Example: > Skipping http://wiki.creativecommons.org/Integrate; different batch id > (null) > Skipping http://wiki.creativecommons.org/LRMI; different batch id (null) > Skipping http://wiki.creativecommons.org/Marking; different batch id > (null) > > This pages are example of many others pages that aren't parsed. > Like that, there are many other pages that I wanted to be read and recorded > in the database. > > > Thanks again. > > > > On Thu, Jun 13, 2013 at 6:04 PM, Tejas Patil <tejas.patil...@gmail.com > >wrote: > > > Those are all images which wont get parsed by Nutch. > > > > > > On Thu, Jun 13, 2013 at 1:33 PM, Weder Carlos Vieira < > > weder.vie...@gmail.com > > > wrote: > > > > > > > > I extracted 1 row of this urls returned... > > > > > > It attached in excel format. > > > > > > > > > > > >