Re: Website (crawler for) indexing

2012-09-10 Thread Bernd Fehling
Some month ago I have tested YaCy, this works pretty well. http://yacy.net/en/ You can install it as stand-alone and setup your own crawler (single or cluster). Very nice admin and control surface. After installation disable the internal database and enable the feed to SOLR, thats it.

Re: Website (crawler for) indexing

2012-09-07 Thread Dominique Bejean
May be you can take a look at Crawl-Anywhere which have administration web interface, solr indexer and search web application. www.crawl-anywhere.com Regards. Dominique Le 05/09/12 17:05, Lochschmied, Alexander a écrit : This may be a bit off topic: How do you index an existing website and

AW: Website (crawler for) indexing

2012-09-06 Thread Lochschmied, Alexander
Thanks Rafał and Markus for your comments. I think Droids it has serious problem with URL parameters in current version (0.2.0) from Maven central: https://issues.apache.org/jira/browse/DROIDS-144 I knew about Nutch, but I haven't been able to implement a crawler with it. Have you done that or

Re: AW: Website (crawler for) indexing

2012-09-06 Thread Rafał Kuć
Hello! I think that really depends on what you want to achieve and what parts of your current system you would like to reuse. If it is only HTML processing I would let Nutch and Solr do that. Of course you can extend Nutch (it has a plugin API) and implement the custom logic you need as a Nutch

RE: Website (crawler for) indexing

2012-09-06 Thread Markus Jelsma
-Original message- From:Lochschmied, Alexander alexander.lochschm...@vishay.com Sent: Thu 06-Sep-2012 16:04 To: solr-user@lucene.apache.org Subject: AW: Website (crawler for) indexing Thanks Rafał and Markus for your comments. I think Droids it has serious problem with URL

Website (crawler for) indexing

2012-09-05 Thread Lochschmied, Alexander
This may be a bit off topic: How do you index an existing website and control the data going into index? We already have Java code to process the HTML (or XHTML) and turn it into a SolrJ Document (removing tags and other things we do not want in the index). We use SolrJ for indexing. So I

RE: Website (crawler for) indexing

2012-09-05 Thread Markus Jelsma
Please take a look at the Apache Nutch project. http://nutch.apache.org/ -Original message- From:Lochschmied, Alexander alexander.lochschm...@vishay.com Sent: Wed 05-Sep-2012 17:09 To: solr-user@lucene.apache.org Subject: Website (crawler for) indexing This may be a bit off

Re: Website (crawler for) indexing

2012-09-05 Thread Rafał Kuć
Hello! You can implement your own crawler using Droids (http://incubator.apache.org/droids/) or use Apache Nutch (http://nutch.apache.org/), which is very easy to integrate with Solr and is very powerful crawler. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch