Exclude html-content from index

2010-10-07 Thread Matthias Paul
Hi everyone, I'm using Nutch 1.2 for indexing a intranet-site (with Solr as indexer). I would like to exclude certain parts of the html-pages like the footer for example. I found previous posts about this problem but no one with a clear solution. Can anyone point me to some relevant

Re: Custom Search

2010-10-07 Thread Yavuz Selim YILMAZ
I tried to take defined metadata with; parse.getData().getMeta(category), but it returns null. How can I get my defined metadata's value? 2010/10/6 Yavuz Selim YILMAZ yvzslmyilm...@gmail.com I solved the problem. -- Yavuz Selim YILMAZ 2010/10/6 Yavuz Selim YILMAZ

Can't find org.gora.sql.store.SqlStore

2010-10-07 Thread Markus Jelsma
Hi, I've finally fetched the latest trunk, added Gora as described in NUTCH-873 but i'm getting the following exception Exception in thread main java.lang.ClassNotFoundException: org.gora.sql.store.SqlStore It can't find the class configured in storage.data.store.class. Is it perhaps the

Re: Ip filtering

2010-10-07 Thread Julien Nioche
Hi Jean-Francois This is an interesting question. See comment below I'm looking for the best way to implemant a IP based check before fetching a URL. We have a large intranet and some URLs use dynamic subdomain (like http://2de3f7ac10.intranet/). I have a list of restricted IPs (either

Re: Exclude html-content from index

2010-10-07 Thread Israel
Hi Matthias, I don´t have the answer to your question, but wanted to ask how to integrate SOLR to nutch 1.2 and what brings benefits.

Re: fetcher.store.content and fetcher.parse

2010-10-07 Thread Markus Jelsma
Storing content will take up about as much disk space as the content you are fetching. If you don't store, there is nothing to parse. On Thu, 7 Oct 2010 05:42:00 -0700 (PDT), webdev1977 webdev1...@gmail.com wrote: Could someone please clarify the relationship between these two properties? I

Re: Exclude html-content from index

2010-10-07 Thread Israel
Thanks Matthias, I regret not being able to help you with your problem . Regards

Re: Can't find org.gora.sql.store.SqlStore

2010-10-07 Thread Mattmann, Chris A (388J)
Hi Markus, Do you have Gora installed to your local Ivy repo? That should ensure that the class is found... Cheers, Chris On 10/7/10 3:31 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, I've finally fetched the latest trunk, added Gora as described in NUTCH-873 but i'm getting

Re: solrindex with a pseudo-cluster

2010-10-07 Thread Steve Cohen
Well, there are two tutorials that I found. http://thewiki4opentech.org/index.php/Nutch http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ As far as benefits of solr go, I am not entirely sure. solr is a search engine, but nutch seems to have one of its own. You can either use

Re: Ip filtering

2010-10-07 Thread Jean-Francois Gingras
Hi Julien, Thank you for your quick response. See my comment below. Envoyé de mon iPhone Le 2010-10-07 à 08:09, Julien Nioche lists.digitalpeb...@gmail.com a écrit : Hi Jean-Francois This is an interesting question. See comment below I'm looking for the best way to implemant a IP based check

Re: Ip filtering

2010-10-07 Thread Markus Jelsma
I suppose you would create an URL filter. It, as i understand, filters URL's that are about to enter the CrawlDB (during UpdateDB) as well as read from the CrawlDB (the generator). The LinkDB just holds a list of anchor's for URL's that are in the CrawlDB. Be sure to have a local DNS cache

Re: fetcher.store.content and fetcher.parse

2010-10-07 Thread Markus Jelsma
On Thu, 7 Oct 2010 09:48:57 -0700 (PDT), webdev1977 webdev1...@gmail.com wrote: So how is it that one is able to crawl huge websites with the crawl script and not use the parse = false? You would have to have enormous amounts of disk space to run the parse later. You can run smaller batches