Re: Website (crawler for) indexing

2012-09-10 Thread Bernd Fehling

Some month ago I have tested YaCy, this works pretty well.
http://yacy.net/en/

You can install it as stand-alone and setup your own crawler (single or 
cluster).
Very nice admin and control surface.
After installation disable the internal database and enable the feed to SOLR, 
thats it.

Regards,
Bernd


Am 05.09.2012 17:05, schrieb Lochschmied, Alexander:
> This may be a bit off topic: How do you index an existing website and control 
> the data going into index?
> 
> We already have Java code to process the HTML (or XHTML) and turn it into a 
> SolrJ Document (removing tags and other things we do not want in the index). 
> We use SolrJ for indexing.
> So I guess the question is essentially which Java crawler could be useful.
> 
> We used to use wget on command line in our publishing process, but we do no 
> longer want to do that.
> 
> Thanks,
> Alexander
> 
> 



Re: Website (crawler for) indexing

2012-09-07 Thread Dominique Bejean
May be you can take a look at Crawl-Anywhere which have administration 
web interface, solr indexer and search web application.


www.crawl-anywhere.com

Regards.

Dominique

Le 05/09/12 17:05, Lochschmied, Alexander a écrit :

This may be a bit off topic: How do you index an existing website and control 
the data going into index?

We already have Java code to process the HTML (or XHTML) and turn it into a 
SolrJ Document (removing tags and other things we do not want in the index). We 
use SolrJ for indexing.
So I guess the question is essentially which Java crawler could be useful.

We used to use wget on command line in our publishing process, but we do no 
longer want to do that.

Thanks,
Alexander






RE: Website (crawler for) indexing

2012-09-06 Thread Markus Jelsma

-Original message-
> From:Lochschmied, Alexander 
> Sent: Thu 06-Sep-2012 16:04
> To: solr-user@lucene.apache.org
> Subject: AW: Website (crawler for) indexing
> 
> Thanks Rafał and Markus for your comments.
> 
> I think Droids it has serious problem with URL parameters in current version 
> (0.2.0) from Maven central:
> https://issues.apache.org/jira/browse/DROIDS-144
> 
> I knew about Nutch, but I haven't been able to implement a crawler with it. 
> Have you done that or seen an example application?

We've been using it for some years now for our site search customers and are 
happy but it can be quite a beast to begin with. The Nutch tutorial will walk 
you through the first steps, crawling and indexing to Solr.

> It's probably easy to call a Nutch jar and make it index a website and maybe 
> I will have to do that.
> But as we already have a Java implementation to index other sources, it would 
> be nice if we could integrate the crawling part too.

You can control Nutch from within another application but i'd not recommend it, 
it's batch based and can take quite some time and resources to run. We usually 
prefer running a custom shell script controlling the process and call it via 
the cron.

> 
> Regards,
> Alexander 
> 
> 
> 
> Hello!
> 
> You can implement your own crawler using Droids
> (http://incubator.apache.org/droids/) or use Apache Nutch 
> (http://nutch.apache.org/), which is very easy to integrate with Solr and is 
> very powerful crawler.
> 
> --
> Regards,
>  Rafał Kuć
>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch
> 
> > This may be a bit off topic: How do you index an existing website and 
> > control the data going into index?
> 
> > We already have Java code to process the HTML (or XHTML) and turn it 
> > into a SolrJ Document (removing tags and other things we do not want 
> > in the index). We use SolrJ for indexing.
> > So I guess the question is essentially which Java crawler could be useful.
> 
> > We used to use wget on command line in our publishing process, but we do no 
> > longer want to do that.
> 
> > Thanks,
> > Alexander
> 
> 


Re: Website (crawler for) indexing

2012-09-05 Thread Rafał Kuć
Hello!

You can implement your own crawler using Droids
(http://incubator.apache.org/droids/) or use Apache Nutch
(http://nutch.apache.org/), which is very easy to integrate with Solr
and is very powerful crawler.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> This may be a bit off topic: How do you index an existing website
> and control the data going into index?

> We already have Java code to process the HTML (or XHTML) and turn
> it into a SolrJ Document (removing tags and other things we do not
> want in the index). We use SolrJ for indexing.
> So I guess the question is essentially which Java crawler could be useful.

> We used to use wget on command line in our publishing process, but we do no 
> longer want to do that.

> Thanks,
> Alexander



RE: Website (crawler for) indexing

2012-09-05 Thread Markus Jelsma
Please take a look at the Apache Nutch project.  
http://nutch.apache.org/
 
-Original message-
> From:Lochschmied, Alexander 
> Sent: Wed 05-Sep-2012 17:09
> To: solr-user@lucene.apache.org
> Subject: Website (crawler for) indexing
> 
> This may be a bit off topic: How do you index an existing website and control 
> the data going into index?
> 
> We already have Java code to process the HTML (or XHTML) and turn it into a 
> SolrJ Document (removing tags and other things we do not want in the index). 
> We use SolrJ for indexing.
> So I guess the question is essentially which Java crawler could be useful.
> 
> We used to use wget on command line in our publishing process, but we do no 
> longer want to do that.
> 
> Thanks,
> Alexander
> 
>