Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

2013-03-01 Thread Stefan Scheffler

Hi Amit.
As i answered you before. There is a config paramter to activate the 
crawling of redirections  (db_redir_temp 4,770, db_redir_perm 56,810). 
you have to activate this in the nutch-site.xml.

Please have a look at the nutch-default.xml to find out which one it is...
Only the pages with db_fetched will be indexed.

Regards
Stefan

Am 02.03.2013 01:01, schrieb Amit Sela:

I am using the crawl script that executes Solr indexing with:
   $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
$CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
and then executes Solr dedup:
   $bin/nutch solrdedup $SOLRURL

I think it has something to do with the CrawlDB job. The job counters show:
db_redir_temp 4,770
db_redir_perm 56,810
db_notmodified 5,343
db_unfetched 27,385
db_gone  3,741
db_fetched 22,065


On Thu, Feb 28, 2013 at 10:02 PM, kiran chitturi
wrote:


This looks odd. From what i know, the successfully parsed documents are
sent to Solr. Did you check the logs for any exceptions ?

What command are you using to index ?


On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela  wrote:


Hi everyone,

I'm running with nutch 1.6 and Solr 3.6.2.
I'm trying to crawl only the seed list (depth 1) and it seems that the
process ends with only ~255 of the URLs indexed in Solr.

Seed list is about 120K.
Fetcher map input is 117K where success is 62K and temp_moved 45K.
Parse shows success of 62K.
CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
and db_fetched=22K.

And finally IndexerStatus shows 20K documents added.
What am I missing ?

Thanks!

my nutch-site.xml includes:
-
plugin.includes



protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i

metatags.names
keywords;Keywords;description;Description
index.parse.md



metatag.keywords,metatag.Keywords,metatag.description,metatag.Description

db.update.additions.allowed
false
generate.count.mode
domain
partition.url.mode
byDomain
file.content.limit
262144
http.content.limit
262144
parse.filter.urls
true
parse.normalize.urls
true




--
Kiran Chitturi





Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

2013-03-01 Thread Amit Sela
I am using the crawl script that executes Solr indexing with:
  $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
$CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
and then executes Solr dedup:
  $bin/nutch solrdedup $SOLRURL

I think it has something to do with the CrawlDB job. The job counters show:
db_redir_temp 4,770
db_redir_perm 56,810
db_notmodified 5,343
db_unfetched 27,385
db_gone  3,741
db_fetched 22,065


On Thu, Feb 28, 2013 at 10:02 PM, kiran chitturi
wrote:

> This looks odd. From what i know, the successfully parsed documents are
> sent to Solr. Did you check the logs for any exceptions ?
>
> What command are you using to index ?
>
>
> On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela  wrote:
>
> > Hi everyone,
> >
> > I'm running with nutch 1.6 and Solr 3.6.2.
> > I'm trying to crawl only the seed list (depth 1) and it seems that the
> > process ends with only ~255 of the URLs indexed in Solr.
> >
> > Seed list is about 120K.
> > Fetcher map input is 117K where success is 62K and temp_moved 45K.
> > Parse shows success of 62K.
> > CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
> > and db_fetched=22K.
> >
> > And finally IndexerStatus shows 20K documents added.
> > What am I missing ?
> >
> > Thanks!
> >
> > my nutch-site.xml includes:
> > -
> > plugin.includes
> >
> >
> protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i
> > metatags.names
> > keywords;Keywords;description;Description
> > index.parse.md
> >
> >
> metatag.keywords,metatag.Keywords,metatag.description,metatag.Description
> > db.update.additions.allowed
> > false
> > generate.count.mode
> > domain
> > partition.url.mode
> > byDomain
> > file.content.limit
> > 262144
> > http.content.limit
> > 262144
> > parse.filter.urls
> > true
> > parse.normalize.urls
> > true
> >
>
>
>
> --
> Kiran Chitturi
>


Re: Problem compiling FeedParser plugin with Nutch 2.1 source

2013-03-01 Thread Lewis John Mcgibbney
Well in addition to obtaining links from the feed content to continue your
crawl, the feed plugin as also provides an indexingfilter to index feed
documents with the following specific fields; author, tags, published,
updated and the actual feed.
Just to confirm, the feed plugin also uses rome as the underlying parser
library.

On Thursday, February 28, 2013, Anand Bhagwat  wrote:
> Thanks for quick reply.
>
> Actually I needed some plugin for ATOM feed parsing so while searching in
> the source I found FeedParser but it was giving compilation errors. Later
I
> tried Tika parser and was able to parse ATOM feed. I am not sure if I am
> missing something. Basically the tika parser extracted urls and created
new
> entries in the database and later when I ran fetch job again I was able to
> fetch those urls.
>
> So the question is does FeedParser provides some additional functionality
> which is missing in Tika parser? As far as I know Tika parser uses ROME
> which is well known library for parsing feeds.
>
> Regards,
> Anand.
>
> On 1 March 2013 03:38, kiran chitturi  wrote:
>
>> Lewis,
>>
>> On the same note, the following plugins needs to be ported when i tried
to
>> build 2.x with Eclipse
>>
>> i)   Feed
>> ii)  parse-swf
>> iii) parse-ext
>> iv) parse-zip
>> v) parse-metatags ( I wrote patch for this earlier, NUTCH-1478)
>>
>> The above plugins need to be ported to build 2.x successfully with
plugins.
>>
>>
>>
>> On Thu, Feb 28, 2013 at 4:58 PM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> wrote:
>>
>> > honestly, I think we should get this fixed.
>> > Can someone please explain to me why we don't build every plugin within
>> > Nutch 2.x?
>> > I think we should.
>> >
>> >
>> > On Thu, Feb 28, 2013 at 12:58 PM, kiran chitturi
>> > wrote:
>> >
>> > > This is a problem with the feed plugin. It is not yet ported to 2.x.
>> > >
>> > > The FeedIndexingFilter Class extends the IndexingFilter whose
interface
>> > and
>> > > method changed from 1.x to 2.x
>> > >
>> > > I fixed a similar one in Parse-metaTags which extends the ParseFilter
>> > > interface.
>> > >
>> > > [Nutch-874] was opened related to these issues but we do not know
still
>> > > what plugins need to be ported due to the API changes.
>> > >
>> > >
>> > >
>> >
>>
https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> > >
>> > >
>> > >
>> > > On Thu, Feb 28, 2013 at 3:26 PM, Lewis John Mcgibbney <
>> > > lewis.mcgibb...@gmail.com> wrote:
>> > >
>> > > > This shouldn't be happening but we are aware (the Jira instance
>> > reflects
>> > > > this) that there are some existing compatibility issues with Nutch
>> 2.x
>> > > > HEAD.
>> > > > IIRC Kiran had a patch integrated which dealt with some of these
>> > issues.
>> > > > What I have to ask is what JDK are you using? I use 1.6.0_25 (I
>> really
>> > > need
>> > > > to upgrade) on my laptop and we run the Apache Nutch nightly builds
>> for
>> > > > both 1.x trunk and 2.x branch on the latest 1.7 version of Java.
>> > > > Unless I have broken my code whilst writing some patches, my code
>> > > compiles
>> > > > flawlessly locally and as a project we do not have regular compiler
>> > > issues
>> > > > with our development nightly builds.
>> > > >
>> > > > On Wed, Feb 27, 2013 at 10:15 PM, Anand Bhagwat <
>> abbhagwa...@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > Hi,
>> > > > > I want to use FeedParser plugin which comes as part of Nutch 2.1
>> > > > > distribution. When I am trying to build it  its giving
compilation
>> > > > errors.
>> > > > > I think its using some classes from Nutch 1.6 which are not
>> > available.
>> > > > Any
>> > > > > suggestions as to how I can resolve this issue?
>> > > > >
>> > > > >   *[javac]
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
/home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:28:
>> > > > > cannot find symbol
>> > > > > [javac] symbol  : class CrawlDatum
>> > > > > [javac] location: package org.apache.nutch.crawl
>> > > > > [javac] import org.apache.nutch.crawl.Cra

-- 
*Lewis*


Re: a lot of threads spinwaiting

2013-03-01 Thread jc
Thanks a lot for all your answers, this really is an active community

Roland, I had that problem once, it's not the case here, I'll try to look
into the crawldb, though hbase is not as friendly for filtering as I would
like it to, I'm still a newbie there

Regards,
JC



--
View this message in context: 
http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4044084.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: a lot of threads spinwaiting

2013-03-01 Thread Roland

Hi JC,

I think Marcus already answered about politeness :) But without delay it 
will be worse :)


Do this missing URLs match on one of the filtering regex?
Take a look at .../conf/regex-urlfilter.txt, I had a problem with this 
regex:

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
It will just silently drop all URLs with GET parameters.

--Roland


Am 01.03.2013 15:08, schrieb jc:

Hi Roland and lufeng,

Thank you very much for your replies, I already tested lufeng advice, with
results pretty much as expected.

By the way, my nutch installation is based on 2.1 version with hbase as
crawldb storage

Roland, maybe fetcher.server.delay param has something to do with that as
well, I set it to 3 secs, setting it to 0 would be unpolite?

All info you provided has helped me a lot, only one issue remains unfixed
yet, there are more than 60 URLs from different hosts in my seed file, and
only 20 queues, things may seem that all other 40 hosts have no more URLs to
generate, but I really haven't seen any URL coming from those hosts since
the creation of the crawldb.

Based on my poor experience following params would allow a number of 60
queues for my vertical crawl, am I missing something?

topN = 1 million
fetcher.threads.per.queue = 3
fetcher.threads.per.host = 3 (just in case, I remember you told me to use
per.queue instead)
fetcher.threads.fetch = 200
seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only
urls from these hosts, they're all there, I checked)
crawldb record count > 1 million

Thanks again for all your help

Regards,
JC


RE: a lot of threads spinwaiting

2013-03-01 Thread Markus Jelsma
Hi,

Regarding politeness, 3 threads per queue is not really polite :)

Cheers

 
 
-Original message-
> From:jc 
> Sent: Fri 01-Mar-2013 15:08
> To: user@nutch.apache.org
> Subject: Re: a lot of threads spinwaiting
> 
> Hi Roland and lufeng,
> 
> Thank you very much for your replies, I already tested lufeng advice, with
> results pretty much as expected.
> 
> By the way, my nutch installation is based on 2.1 version with hbase as
> crawldb storage
> 
> Roland, maybe fetcher.server.delay param has something to do with that as
> well, I set it to 3 secs, setting it to 0 would be unpolite?
> 
> All info you provided has helped me a lot, only one issue remains unfixed
> yet, there are more than 60 URLs from different hosts in my seed file, and
> only 20 queues, things may seem that all other 40 hosts have no more URLs to
> generate, but I really haven't seen any URL coming from those hosts since
> the creation of the crawldb.
> 
> Based on my poor experience following params would allow a number of 60
> queues for my vertical crawl, am I missing something?
> 
> topN = 1 million
> fetcher.threads.per.queue = 3
> fetcher.threads.per.host = 3 (just in case, I remember you told me to use
> per.queue instead)
> fetcher.threads.fetch = 200
> seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only
> urls from these hosts, they're all there, I checked)
> crawldb record count > 1 million
> 
> Thanks again for all your help
> 
> Regards,
> JC
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4043988.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


Re: a lot of threads spinwaiting

2013-03-01 Thread jc
Hi Roland and lufeng,

Thank you very much for your replies, I already tested lufeng advice, with
results pretty much as expected.

By the way, my nutch installation is based on 2.1 version with hbase as
crawldb storage

Roland, maybe fetcher.server.delay param has something to do with that as
well, I set it to 3 secs, setting it to 0 would be unpolite?

All info you provided has helped me a lot, only one issue remains unfixed
yet, there are more than 60 URLs from different hosts in my seed file, and
only 20 queues, things may seem that all other 40 hosts have no more URLs to
generate, but I really haven't seen any URL coming from those hosts since
the creation of the crawldb.

Based on my poor experience following params would allow a number of 60
queues for my vertical crawl, am I missing something?

topN = 1 million
fetcher.threads.per.queue = 3
fetcher.threads.per.host = 3 (just in case, I remember you told me to use
per.queue instead)
fetcher.threads.fetch = 200
seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only
urls from these hosts, they're all there, I checked)
crawldb record count > 1 million

Thanks again for all your help

Regards,
JC



--
View this message in context: 
http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4043988.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: a lot of threads spinwaiting

2013-03-01 Thread Roland

Hi jc,

and one thing to add: check the robots.txt file of your crawled hosts, 
maybe they are limiting your fetches with delays:

Crawl-delay: 10

--Roland


Am 01.03.2013 03:32, schrieb feng lu:

Hi jc

<<
I don't understand why there are 19 queues, is it maybe that only 19
websites are being fetched?
Because each queue handles FetchItems which come from the same Queue ID (be
it a proto/hostname or proto/IP or proto/domain pair). And the Queue ID
will be created based on queueMode argument. So here may be there 19
different Queue ID in FetchItemQueues.

<<
  Anyways, why is it that there are 194 spinwaiting out of 200 active
threads?
First of all, i see that the parameter "fetcher.threads.per.host" has been
replaced by "fetcher.threads.per.queue" in nutch 1.6. I see that there are
200 fetching threads that can fetch items from any host. However, all
remaining items are from the different 19 hosts. And total urls count is
1. Each queue come from the same Queue ID. So the logs indicate that
only 6 threads is fetching and another 13 threads have finished fetching.
Maybe another 13 queues are too small without spend too much time.

Thanks
lufeng


On Fri, Mar 1, 2013 at 6:44 AM, jc  wrote:


Hi guys,

I'm sorry if this question has been answered before, I looked but didn't
find anything.

This is my scenario (only relevant settings I think):
seed urls: about 60 homepages from different domains
generate.max.count = 1
fetcher.threads.per.host = 3   I'm trying to be polite here :-)
partition.url.mode = byHost
fetcher.threads.fetch = 200
fetcher.threads.per.queue = 1
topN = 100
depth = 1

Since the very beggining I've got a lot of spinwaiting threads (I'm not
sure
if those are threads because it doesn't really say in the log)

194/200 spinwaiting/active, 166 pages, 3 errors, 4.7 3.8 pages/s , 1471
1412
kb/s, 1 URLs in 19 queues

I don't understand why there are 19 queues, is it maybe that only 19
websites are being fetched? Anyways, why is it that there are 194
spinwaiting out of 200 active threads?

Thanks a lot in advance for your time.

Regards,
jc



--
View this message in context:
http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801.html
Sent from the Nutch - User mailing list archive at Nabble.com.