Re: imported to solr

2011-08-03 Thread Kiks
This question was posted on solr list and not answered because nutch related... The indexed contents of 100 sites were imported to solr from nutch using: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* now, a solr admin search for 'photography'

Re: Error Input path does not exist when crawling

2011-08-03 Thread Christian Weiske
Hallo Dinçer, Somewhere during the crawling process I get an error that stops everything: file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/ segments/20110801090707 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does

Re: Error Input path does not exist when crawling

2011-08-03 Thread Christian Weiske
Hallo Dinçer, One more thing, will you share the stats as: *$ bin/nutch readdb crawl-dir/crawldb -stats* *$ bin/nutch readseg -list crawl-dir/segments/** When I got that error, the latter list shows that one (or more) segments is not finished well. But now you can see my segments seem

NullPointerException when calling readdb on empty database

2011-08-03 Thread Christian Weiske
Hi, I'm getting the following error: $ bin/nutch readdb crawl-chat/crawldb -stats CrawlDb statistics start: crawl-chat/crawldb Statistics for CrawlDb: crawl-chat/crawldb Exception in thread main java.lang.NullPointerException at

Fetching ever-changing URLs

2011-08-03 Thread Christian Weiske
Hi, I'd like to crawl pages of chat logs that change whenever someone sends a message in our chat rooms, which happens every couple of seconds. The HTML log pages are updated instantly by the prosody jabber server and thus have always current timestamps. Nutch seems to reject them now because

Re: NullPointerException when calling readdb on empty database

2011-08-03 Thread lewis john mcgibbney
which version of Nutch are you using? Is chat a plain text file, with URLs in a list on per line? If this the case there is no need to add it to your crawl command. Additionally, there is no point in trying to read what is happeneing in your crawldb if your generator log output indicates that

Re: imported to solr

2011-08-03 Thread lewis john mcgibbney
Hi Kiks, What kind of changes have you made to your schema when transferring to Solr instance? You ask about the stored parsed text content, well the default Nutch schema sets this by default to stored=false as it is not always required for all content to be stored. Generally speaking terms that

Re: New wiki page for Running Nutch 1.3 in Eclipse

2011-08-03 Thread lewis john mcgibbney
Sorry http://wiki.apache.org/nutch/RunNutchInEclipse On Wed, Aug 3, 2011 at 2:12 PM, Dr.Ibrahim A Alkharashi khara...@kacst.edu.sa wrote: thanks for the info, would you please post a pointer to the page. Regards Ibrahim On Aug 3, 2011, at 3:13 PM, lewis john mcgibbney

how to extract tf-idf

2011-08-03 Thread Zhanibek Datbayev
Hello Nutch Users, I've googled for a while and still can not find answers to the following: 1. After I crawl a web site, how can I extract tf-idf for it? 2. How can I access original web pages crawled? 3. Is it possible to get for each word id it corresponds to? Thanks in advance! -Zhanibek

Re: imported to solr

2011-08-03 Thread Way Cool
Potentially you need to make two changes: 1. As Lewis suggested, make sure to change the content field in solr/conf/schema.xml as below: field name=content type=text stored=true indexed=true/ 2. Append the following as a part of search url: hl=onhl.fl=content site url title OR Add the following to

Re: imported to solr

2011-08-03 Thread Kiks
That worked thanks to you and lewis. One thing that came up was I first tried to delete the old /apache-solr-3.3.0/example/solr/data/index by renaming it and creating a new directory but solr wouldn't start. After restoring the folder, changing solr schema.xml to field name=content type=text

Re: imported to solr

2011-08-03 Thread Way Cool
You are welcome. Glad it worked. Have fun. On Wed, Aug 3, 2011 at 4:16 PM, Kiks kikstern...@gmail.com wrote: That worked thanks to you and lewis. One thing that came up was I first tried to delete the old /apache-solr-3.3.0/example/solr/data/index by renaming it and creating a new directory

Re: ranking in nutch/solr results

2011-08-03 Thread Way Cool
Also you can use dismax in solr to setup weights for different fields, for example, give url/id more weights than title and content. 2011/7/31 Александр Кожевников b37hr3...@yandex.ru It is called navigational query. You should implement custom ranking function for boosting correct sites.

Re: Error Input path does not exist when crawling

2011-08-03 Thread Dinçer Kavraal
Hi Christian, I have been busy with my problems a couple of days now and noticed it was out of something minor. But what I have learned from that is, in conf/log4j.properties, I set log4j.logger.org.apache.nutch.crawl.Crawl=DEBUG, cmdst log4j.logger.org.apache.nutch.crawl.Injector=DEBUG, cmdst

Re: Fetching ever-changing URLs

2011-08-03 Thread Dinçer Kavraal
Hi again, Maybe you could try getting differential logs of the chat server, if possible. If you are handling chat server, you could set log rotation for 10 mins. for instance, and then add those as if they are different web pages. Or, you should check db.fetch.interval.* values and probably your

Re: redirect and cookie

2011-08-03 Thread Dinçer Kavraal
For whom it might concern, I have achieved my solution to override the protocol-http plugin to send an additional header like: Cookie: mycookie=1 Regards. 2011/8/2 Dinçer Kavraal dkavr...@gmail.com Hi,(USING: nutch 1.3 on Ubuntu 11.04 - 2.6.38-10-generic-pae) I would like to crawl a