Please take a look at the Apache Nutch project.  
http://nutch.apache.org/
 
-----Original message-----
> From:Lochschmied, Alexander <alexander.lochschm...@vishay.com>
> Sent: Wed 05-Sep-2012 17:09
> To: solr-user@lucene.apache.org
> Subject: Website (crawler for) indexing
> 
> This may be a bit off topic: How do you index an existing website and control 
> the data going into index?
> 
> We already have Java code to process the HTML (or XHTML) and turn it into a 
> SolrJ Document (removing tags and other things we do not want in the index). 
> We use SolrJ for indexing.
> So I guess the question is essentially which Java crawler could be useful.
> 
> We used to use wget on command line in our publishing process, but we do no 
> longer want to do that.
> 
> Thanks,
> Alexander
> 
> 

Reply via email to