Hello! You can implement your own crawler using Droids (http://incubator.apache.org/droids/) or use Apache Nutch (http://nutch.apache.org/), which is very easy to integrate with Solr and is very powerful crawler.
-- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch > This may be a bit off topic: How do you index an existing website > and control the data going into index? > We already have Java code to process the HTML (or XHTML) and turn > it into a SolrJ Document (removing tags and other things we do not > want in the index). We use SolrJ for indexing. > So I guess the question is essentially which Java crawler could be useful. > We used to use wget on command line in our publishing process, but we do no > longer want to do that. > Thanks, > Alexander