I have some sort of same requirement where I need to move to a good crawler.
Currently I am using a custom crawler, I mean my own crawler to crawl some
public domains and uses Lucene to index all downloaded pages. After doing lots
of research I came across JSpider with Lucene.
ALso I was looking for Nutch for doing crawler job but I dont think that is
possible, I mean feasible.
- BR
"A. Banji Oyebisi" <[EMAIL PROTECTED]> wrote:
I am interested in this too. any ideas?
A. Banji Oyebisi Choicegen, LLC. Email: [EMAIL PROTECTED] Web URL:
http://www.choicegen.com Choicegen... Helping you make better choices!
Notice: This email message, together with any attachments, may contain
information of Choicegen, LLC., its subsidiaries and affiliated
entities, that may be confidential, proprietary, copyrighted and/or legally
privileged, and is intended solely for the use of the individual or entity
named in this message. If you are not the intended recipient, and have received
this message in error, please immediately return this by email and then
delete it.
George Everitt wrote: I'm looking for a web crawler to use with Solr. The
objective is to crawl about a dozen public web sites regarding a specific
topic.
After a lot of googling, I came across Heritrix, which seems to be the most
robust well supported open source crawler out there. Heritrix has an
integration with Nutch (NutchWax), but not with Solr. I'm wondering if
anybody can share any experience using Heritrix with Solr.
It seems that there are three options for integration:
1. Write a custom Heritrix "Writer" class which submits documents to Solr for
indexing.
2. Write an ARC to Sol input XML format converter to import the ARC files.
3. Use the filesystem mirror writer and then another program to walk the
downloaded files.
Has anybody looked into this or have any suggestions on an alternative
approach? The optimal answer would be "You dummy, just use XXX to crawl your
web sites - there's no 'integration' required at all. Can you believe the
temerity? What a poltroon."
Yours in Revolution,
George
---------------------------------
Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.