This question was posted on solr list and not answered because nutch
related...
The indexed contents of 100 sites were imported to solr from nutch using:
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
crawl/segments/*
now, a solr admin search for 'photography'
Hallo Dinçer,
Somewhere during the crawling process I get an error that stops
everything:
file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/
segments/20110801090707
Exception in thread main
org.apache.hadoop.mapred.InvalidInputException: Input path does
Hallo Dinçer,
One more thing, will you share the stats as:
*$ bin/nutch readdb crawl-dir/crawldb -stats*
*$ bin/nutch readseg -list crawl-dir/segments/**
When I got that error, the latter list shows that one (or more)
segments is not finished well. But now you can see my segments seem
Hi,
I'm getting the following error:
$ bin/nutch readdb crawl-chat/crawldb -stats
CrawlDb statistics start: crawl-chat/crawldb
Statistics for CrawlDb: crawl-chat/crawldb
Exception in thread main java.lang.NullPointerException
at
Hi,
I'd like to crawl pages of chat logs that change whenever someone sends
a message in our chat rooms, which happens every couple of seconds.
The HTML log pages are updated instantly by the prosody jabber server
and thus have always current timestamps.
Nutch seems to reject them now because
which version of Nutch are you using?
Is chat a plain text file, with URLs in a list on per line? If this the case
there is no need to add it to your crawl command. Additionally, there is no
point in trying to read what is happeneing in your crawldb if your generator
log output indicates that
Hi Kiks,
What kind of changes have you made to your schema when transferring to Solr
instance?
You ask about the stored parsed text content, well the default Nutch schema
sets this by default to stored=false as it is not always required for all
content to be stored. Generally speaking terms that
Sorry
http://wiki.apache.org/nutch/RunNutchInEclipse
On Wed, Aug 3, 2011 at 2:12 PM, Dr.Ibrahim A Alkharashi
khara...@kacst.edu.sa wrote:
thanks for the info, would you please post a pointer to the page.
Regards
Ibrahim
On Aug 3, 2011, at 3:13 PM, lewis john mcgibbney
Hello Nutch Users,
I've googled for a while and still can not find answers to the following:
1. After I crawl a web site, how can I extract tf-idf for it?
2. How can I access original web pages crawled?
3. Is it possible to get for each word id it corresponds to?
Thanks in advance!
-Zhanibek
Potentially you need to make two changes:
1. As Lewis suggested, make sure to change the content field in
solr/conf/schema.xml as below:
field name=content type=text stored=true indexed=true/
2. Append the following as a part of search url:
hl=onhl.fl=content site url title
OR
Add the following to
That worked thanks to you and lewis.
One thing that came up was I first tried to delete the old
/apache-solr-3.3.0/example/solr/data/index
by renaming it and creating a new directory but solr wouldn't start.
After restoring the folder, changing solr schema.xml to
field name=content type=text
You are welcome. Glad it worked. Have fun.
On Wed, Aug 3, 2011 at 4:16 PM, Kiks kikstern...@gmail.com wrote:
That worked thanks to you and lewis.
One thing that came up was I first tried to delete the old
/apache-solr-3.3.0/example/solr/data/index
by renaming it and creating a new directory
Also you can use dismax in solr to setup weights for different fields, for
example, give url/id more weights than title and content.
2011/7/31 Александр Кожевников b37hr3...@yandex.ru
It is called navigational query.
You should implement custom ranking function for boosting correct sites.
Hi Christian,
I have been busy with my problems a couple of days now and noticed it was
out of something minor. But what I have learned from that is, in
conf/log4j.properties, I set
log4j.logger.org.apache.nutch.crawl.Crawl=DEBUG, cmdst
log4j.logger.org.apache.nutch.crawl.Injector=DEBUG, cmdst
Hi again,
Maybe you could try getting differential logs of the chat server, if
possible. If you are handling chat server, you could set log rotation for 10
mins. for instance, and then add those as if they are different web pages.
Or, you should check db.fetch.interval.* values and probably your
For whom it might concern,
I have achieved my solution to override the protocol-http plugin to send an
additional header like:
Cookie: mycookie=1
Regards.
2011/8/2 Dinçer Kavraal dkavr...@gmail.com
Hi,(USING: nutch 1.3 on Ubuntu 11.04 - 2.6.38-10-generic-pae)
I would like to crawl a
16 matches
Mail list logo