subject:"Solr for Whole Web Search"

Solr for Whole Web Search

2008-10-22 Thread John Martyniak


I am very new to Solr, but I have played with Nutch and Lucene.

Has anybody used Solr for a whole web indexing application?

Which Spider did you use?

How does it compare to Nutch?

Thanks in advance for all of the info.

-John

Re: Solr for Whole Web Search

2008-10-22 Thread Grant Ingersoll



On Oct 22, 2008, at 7:57 AM, John Martyniak wrote:


I am very new to Solr, but I have played with Nutch and Lucene.

Has anybody used Solr for a whole web indexing application?

Which Spider did you use?

How does it compare to Nutch?


There is a patch that combines Nutch + Solr.  Nutch is used for  
crawling, Solr for searching.  Can't say I've used it for whole web  
searching, but I believe some are trying it.


At the end of the day, I'm sure Solr could do it, but it will take  
some work to setup the architecture (distributed, replicated) and deal  
properly with fault tolerance and fail over.There are also some  
examples on Hadoop about Hadoop + Lucene integration.


How big are you talking?




Thanks in advance for all of the info.

-John



--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Solr for Whole Web Search

2008-10-22 Thread John Martyniak


Grant thanks for the response.

A couple of other people have recommended trying the Nutch + Solr  
approach, but I am not sure what the real benefit of doing that is.   
Since Nutch provides most of the same features as Solr and Solr has  
some nice additional features (like spell checking, incremental index).


So I currently have a Nutch Index of around 500,000+ Urls, but expect  
it to get much bigger.  And am generally pretty happy with it, but I  
just want to make sure that I am going down the correct path, for the  
best feature set.  As far as implementation to the front end is  
concerned, I have been using the Nutch search app as basically a  
webservice to feed the main app (So using RSS).  The main app takes  
that and manipulates the results for display.


As far as the Hadoop + Lucene integration, I haven't used that  
directly just the Hadoop integration with Nutch.  And of course Hadoop  
independently.


-John


On Oct 22, 2008, at 10:08 AM, Grant Ingersoll wrote:



On Oct 22, 2008, at 7:57 AM, John Martyniak wrote:


I am very new to Solr, but I have played with Nutch and Lucene.

Has anybody used Solr for a whole web indexing application?

Which Spider did you use?

How does it compare to Nutch?


There is a patch that combines Nutch + Solr.  Nutch is used for  
crawling, Solr for searching.  Can't say I've used it for whole web  
searching, but I believe some are trying it.


At the end of the day, I'm sure Solr could do it, but it will take  
some work to setup the architecture (distributed, replicated) and  
deal properly with fault tolerance and fail over.There are also  
some examples on Hadoop about Hadoop + Lucene integration.


How big are you talking?




Thanks in advance for all of the info.

-John



--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

If that is the case you should look @ the DataImportHandler examples
as they can already index RSS, im doing it now for ~ a dozen feeds on
an hourly basis. (This is also for any XML-based feed for XHTML, XML,
etc). I find Nutch more useful for plain vanilla HTML (something that
was built non-dynamic), since otherwise you can bring your DB content
in that you would have to the page to begin with. As well as Nutch
for other types of documents I think (PDF) and anything that Tika (http://incubator.apache.org/tika/
) can extract.

- Jon

On Oct 22, 2008, at 11:08 AM, John Martyniak wrote:

Grant thanks for the response.

A couple of other people have recommended trying the Nutch + Solr
approach, but I am not sure what the real benefit of doing that is.
Since Nutch provides most of the same features as Solr and Solr has
some nice additional features (like spell checking, incremental
index).

So I currently have a Nutch Index of around 500,000+ Urls, but
expect it to get much bigger. And am generally pretty happy with
it, but I just want to make sure that I am going down the correct
path, for the best feature set. As far as implementation to the
front end is concerned, I have been using the Nutch search app as
basically a webservice to feed the main app (So using RSS). The
main app takes that and manipulates the results for display.

As far as the Hadoop + Lucene integration, I haven't used that
directly just the Hadoop integration with Nutch. And of course
Hadoop independently.

-John

On Oct 22, 2008, at 10:08 AM, Grant Ingersoll wrote:

On Oct 22, 2008, at 7:57 AM, John Martyniak wrote:

I am very new to Solr, but I have played with Nutch and Lucene.

Has anybody used Solr for a whole web indexing application?

Which Spider did you use?

How does it compare to Nutch?

There is a patch that combines Nutch + Solr. Nutch is used for
crawling, Solr for searching. Can't say I've used it for whole web
searching, but I believe some are trying it.

At the end of the day, I'm sure Solr could do it, but it will take
some work to setup the architecture (distributed, replicated) and
deal properly with fault tolerance and fail over.There are also
some examples on Hadoop about Hadoop + Lucene integration.

How big are you talking?

Thanks in advance for all of the info.

-John

--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Solr for Whole Web Search

Re: Solr for Whole Web Search

Re: Solr for Whole Web Search

Re: Solr for Whole Web Search

4 matches

Site Navigation

Mail list logo

Footer information