Hi everyone,
I'm using Nutch 1.2 for indexing a intranet-site (with Solr as indexer). I
would like to exclude certain parts of the html-pages like the footer for
example. I found previous posts about this problem but no one with a clear
solution.
Can anyone point me to some relevant
I tried to take defined metadata with;
parse.getData().getMeta(category),
but it returns null.
How can I get my defined metadata's value?
2010/10/6 Yavuz Selim YILMAZ yvzslmyilm...@gmail.com
I solved the problem.
--
Yavuz Selim YILMAZ
2010/10/6 Yavuz Selim YILMAZ
Hi,
I've finally fetched the latest trunk, added Gora as described in
NUTCH-873 but i'm getting the following exception
Exception in thread main java.lang.ClassNotFoundException:
org.gora.sql.store.SqlStore
It can't find the class configured in storage.data.store.class. Is it
perhaps the
Hi Jean-Francois
This is an interesting question. See comment below
I'm looking for the best way to implemant a IP based check before fetching a
URL. We have a large intranet and some URLs use dynamic subdomain (like
http://2de3f7ac10.intranet/). I have a list of restricted IPs (either
Hi Matthias, I don´t have the answer to your question, but wanted to ask how
to integrate SOLR to nutch 1.2 and what brings benefits.
Storing content will take up about as much disk space as the content
you are fetching. If you don't store, there is nothing to parse.
On Thu, 7 Oct 2010 05:42:00 -0700 (PDT), webdev1977
webdev1...@gmail.com wrote:
Could someone please clarify the relationship between these two
properties?
I
Thanks Matthias, I regret not being able to help you with your problem .
Regards
Hi Markus,
Do you have Gora installed to your local Ivy repo? That should ensure that the
class is found...
Cheers,
Chris
On 10/7/10 3:31 AM, Markus Jelsma markus.jel...@openindex.io wrote:
Hi,
I've finally fetched the latest trunk, added Gora as described in
NUTCH-873 but i'm getting
Well, there are two tutorials that I found.
http://thewiki4opentech.org/index.php/Nutch
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
As far as benefits of solr go, I am not entirely sure. solr is a search
engine, but nutch seems to have one of its own. You can either use
Hi Julien,
Thank you for your quick response. See my comment below.
Envoyé de mon iPhone
Le 2010-10-07 à 08:09, Julien Nioche lists.digitalpeb...@gmail.com a
écrit :
Hi Jean-Francois
This is an interesting question. See comment below
I'm looking for the best way to implemant a IP based check
I suppose you would create an URL filter. It, as i understand, filters
URL's that are about to enter the CrawlDB (during UpdateDB) as well as
read from the CrawlDB (the generator). The LinkDB just holds a list of
anchor's for URL's that are in the CrawlDB.
Be sure to have a local DNS cache
On Thu, 7 Oct 2010 09:48:57 -0700 (PDT), webdev1977
webdev1...@gmail.com wrote:
So how is it that one is able to crawl huge websites with the crawl
script
and not use the parse = false? You would have to have enormous
amounts of
disk space to run the parse later.
You can run smaller batches
12 matches
Mail list logo