subject:"Solr and Nutch\/Droids \- to use or not to use\?"

Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread MitchK


Otis,

And again I wished I were registred.

I will check the JIRA and when I feel comfortable with it, I will open it.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p904145.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread Otis Gospodnetic

I didn't open the issue, Mitch, but feel free to do it.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: MitchK 
> To: solr-user@lucene.apache.org
> Sent: Thu, June 17, 2010 12:07:13 PM
> Subject: Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
> 
> 
Otis,

you are right. I wasn't aware of this. At least not with such a 
> large
dataList (let's think of an index with 4mio docs, this would mean we 
> got an
ExternalFile with 4mio records). But from what I've read at 
> 
search-lucene.com it seems to perform very well. Thanks for the 
> idea!

Btw: Otis, did you open a JIRA Issue for the distributed indexing 
> ability of
Solr?
I would like to follow the issue, if it is open. 
> 

Regards
- Mitch


Otis Gospodnetic-2 wrote:
> 
> 
> Mitch,
> 
> Yes, one day.  But it sounds like you are not aware 
> of ExternalFieldFile,
> which you can use today:
> 
> 
> href="http://search-lucene.com/?q=ExternalFileField&fc_project=Solr"; 
> target=_blank 
> >http://search-lucene.com/?q=ExternalFileField&fc_project=Solr
> 
> 
> Otis
> 
> Sematext :: 
> target=_blank >http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene 
> ecosystem search :: 
> >http://search-lucene.com/
> 
> 
> 
> - Original 
> Message 
>> From: MitchK <
> href="mailto:mitc...@web.de";>mitc...@web.de>
>> To: 
> ymailto="mailto:solr-user@lucene.apache.org"; 
> href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org
>> 
> Sent: Thu, June 17, 2010 4:15:27 AM
>> Subject: Re: Re: Re: Solr and 
> Nutch/Droids - to use or not to use?
>> 
>> 
> 
> 
> 
>> Solr doesn't know anything about OPIC, but I suppose you can 
> 
>> feed the OPIC
>> score computed by Nutch into a Solr field 
> and use it 
>> during scoring, if
>> you want, say with a 
> function query. 
>> 
> Oh! 
>> Yes, that makes more 
> sense than using the OPIC as doc-boost-value. 
>> :-)
> Anywhere 
> at the Lucene Mailing lists I read that in future it will 
>> 
> be
> possible to change field's contents without reindexing the whole 
> 
>> document.
> If one stores the OPIC-Score (which is 
> independent from the page's 
>> content)
> in a field and uses 
> functionQuery to influence the score of a 
>> document, one
> 
> saves the effort of reindexing the whole doc, if the content 
>> did 
> not change.
> 
> Regards
> - Mitch
> -- 
> View 
> this message in 
>> context: 
>> href="
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html";
> > 
> 
>> target=_blank 
>> >
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html
> 
> Sent 
>> from the Solr - User mailing list archive at 
> Nabble.com.
> 
> 
-- 
View this message in context: 
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p903148.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p903148.html
Sent 
> from the Solr - User mailing list archive at Nabble.com.

Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread MitchK


Otis,

you are right. I wasn't aware of this. At least not with such a large
dataList (let's think of an index with 4mio docs, this would mean we got an
ExternalFile with 4mio records). But from what I've read at 
search-lucene.com it seems to perform very well. Thanks for the idea!

Btw: Otis, did you open a JIRA Issue for the distributed indexing ability of
Solr?
I would like to follow the issue, if it is open. 

Regards
- Mitch


Otis Gospodnetic-2 wrote:
> 
> Mitch,
> 
> Yes, one day.  But it sounds like you are not aware of ExternalFieldFile,
> which you can use today:
> 
> http://search-lucene.com/?q=ExternalFileField&fc_project=Solr
> 
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> - Original Message 
>> From: MitchK 
>> To: solr-user@lucene.apache.org
>> Sent: Thu, June 17, 2010 4:15:27 AM
>> Subject: Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
>> 
>> 
> 
> 
>> Solr doesn't know anything about OPIC, but I suppose you can 
>> feed the OPIC
>> score computed by Nutch into a Solr field and use it 
>> during scoring, if
>> you want, say with a function query. 
>> 
> Oh! 
>> Yes, that makes more sense than using the OPIC as doc-boost-value. 
>> :-)
> Anywhere at the Lucene Mailing lists I read that in future it will 
>> be
> possible to change field's contents without reindexing the whole 
>> document.
> If one stores the OPIC-Score (which is independent from the page's 
>> content)
> in a field and uses functionQuery to influence the score of a 
>> document, one
> saves the effort of reindexing the whole doc, if the content 
>> did not change.
> 
> Regards
> - Mitch
> -- 
> View this message in 
>> context: 
>> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html";
>>  
>> target=_blank 
>> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html
> Sent 
>> from the Solr - User mailing list archive at Nabble.com.
> 
> 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p903148.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread Otis Gospodnetic

Mitch,

Yes, one day.  But it sounds like you are not aware of ExternalFieldFile, which 
you can use today:

http://search-lucene.com/?q=ExternalFileField&fc_project=Solr

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: MitchK 
> To: solr-user@lucene.apache.org
> Sent: Thu, June 17, 2010 4:15:27 AM
> Subject: Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
> 
> 


> Solr doesn't know anything about OPIC, but I suppose you can 
> feed the OPIC
> score computed by Nutch into a Solr field and use it 
> during scoring, if
> you want, say with a function query. 
> 
Oh! 
> Yes, that makes more sense than using the OPIC as doc-boost-value. 
> :-)
Anywhere at the Lucene Mailing lists I read that in future it will 
> be
possible to change field's contents without reindexing the whole 
> document.
If one stores the OPIC-Score (which is independent from the page's 
> content)
in a field and uses functionQuery to influence the score of a 
> document, one
saves the effort of reindexing the whole doc, if the content 
> did not change.

Regards
- Mitch
-- 
View this message in 
> context: 
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html
Sent 
> from the Solr - User mailing list archive at Nabble.com.

Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread MitchK




> Solr doesn't know anything about OPIC, but I suppose you can feed the OPIC
> score computed by Nutch into a Solr field and use it during scoring, if
> you want, say with a function query. 
> 
Oh! Yes, that makes more sense than using the OPIC as doc-boost-value. :-)
Anywhere at the Lucene Mailing lists I read that in future it will be
possible to change field's contents without reindexing the whole document.
If one stores the OPIC-Score (which is independent from the page's content)
in a field and uses functionQuery to influence the score of a document, one
saves the effort of reindexing the whole doc, if the content did not change.

Regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread Otis Gospodnetic

Mitch,

If you use Nutch+Solr then you wouldn't *index* the fetched content with Nutch.
Solr doesn't know anything about OPIC, but I suppose you can feed the OPIC 
score computed by Nutch into a Solr field and use it during scoring, if you 
want, say with a function query.

Yes, ES has built-in support for sharding and replication.  It also makes it 
easy to implement custom scoring, which may work for OPIC here.


Yes, ask questions here. :)

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: MitchK 
> To: solr-user@lucene.apache.org
> Sent: Thu, June 17, 2010 1:52:32 AM
> Subject: RE: Re: Re: Solr and Nutch/Droids - to use or not to use?
> 
> 
Good morning!

Great feedback from you all. This really helped a lot 
> to get an impression
of what is possible and what is not.

What is 
> interesting to me are some detail questions.

Let's assume Solr is 
> possible to work on his own with distributed indexing,
so that the client 
> does not need to know anything about shards etc.

What is interesting to 
> me is:
I. 
The scoring - Nutch uses special Scoring-implementations like 
> the
OPIC-algorithm. Can Solr use such improvements or do I need to 
> reimplement
it for Solr?

II. 
The indexing.
At the moment it 
> really sounds like nutch would index the whole stuff and
afterwards Solr does 
> the job again.
Regarding to indexing it would make sense, if Nutch computes 
> things like the
document boost (I am not sure, but I think the results of the 
> OPIC-algorithm
were added to each document as a boost) and sends an 
> indexing-request to
Solr afterwards.
However, if Nutch indexes the page's 
> content and Solr does it, too - I would
waste some time, no?
Is this the 
> case or do I missunderstood something here?

III.
I am no 
> Java-Expert.
However, in a few month I will start to study computer-science 
> at an
university. Maybe I will find some literature to learn more 
> about
distributed software and how hashing needs to work, to do the job it 
> should
do, to make distributed indexing work.
Maybe than I can help to 
> implement this feature into  Solr.
On the other hand, not much is known 
> about Solr's distributed search-concept
and which classes are responsible for 
> that - but such things one could ask
on the mailing list, no? 

As far 
> as I know Elastic Search already supports distributed indexing. 
Maybe one 
> can reuse the responsible implementation for Solr.


Btw:
I think a 
> great benefit of using Solr + Nutch would be to extend the search.
I could 
> create several Solr cores for different kinds of search - one 
> for
picture-search, one for video-search etc. *and* with the help of Nutch I 
> can
index some of the needed content in special directories. So Solr does 
> not
need to care about indexing a picture - Nutch already does the job. 
> 

Kind regards,
- Mitch
-- 
View this message in context: 
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p901943.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p901943.html
Sent 
> from the Solr - User mailing list archive at Nabble.com.

RE: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK

Good morning!

Great feedback from you all. This really helped a lot to get an impression
of what is possible and what is not.

What is interesting to me are some detail questions.

Let's assume Solr is possible to work on his own with distributed indexing,
so that the client does not need to know anything about shards etc.

What is interesting to me is:
I.
The scoring - Nutch uses special Scoring-implementations like the
OPIC-algorithm. Can Solr use such improvements or do I need to reimplement
it for Solr?

II.
The indexing.
At the moment it really sounds like nutch would index the whole stuff and
afterwards Solr does the job again.
Regarding to indexing it would make sense, if Nutch computes things like the
document boost (I am not sure, but I think the results of the OPIC-algorithm
were added to each document as a boost) and sends an indexing-request to
Solr afterwards.
However, if Nutch indexes the page's content and Solr does it, too - I would
waste some time, no?
Is this the case or do I missunderstood something here?

III.
I am no Java-Expert.
However, in a few month I will start to study computer-science at an
university. Maybe I will find some literature to learn more about
distributed software and how hashing needs to work, to do the job it should
do, to make distributed indexing work.
Maybe than I can help to implement this feature into Solr.
On the other hand, not much is known about Solr's distributed search-concept
and which classes are responsible for that - but such things one could ask
on the mailing list, no?

As far as I know Elastic Search already supports distributed indexing.
Maybe one can reuse the responsible implementation for Solr.

Btw:
I think a great benefit of using Solr + Nutch would be to extend the search.
I could create several Solr cores for different kinds of search - one for
picture-search, one for video-search etc. *and* with the help of Nutch I can
index some of the needed content in special directories. So Solr does not
need to care about indexing a picture - Nutch already does the job.

Kind regards,
- Mitch
--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p901943.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread Markus Jelsma

You're right. Currently clients need to take care of this, in this case, Nutch 
would be the client but it cannot be configured as such. It would, indeed, be 
more appropriate for Solr to take care of this. We can already query any server 
with a set of shard hosts specified, so it would make sense if Solr also 
supported some kind of consistent hashing and shard management configuration.

 

With CouchDB-Lounge we can easily create a shard map that supports redundant 
shards on different servers for fail-over. It would be marvelous if Solr would 
support it as well.
 
-Original message-
From: Otis Gospodnetic 
Sent: Wed 16-06-2010 21:41
To: solr-user@lucene.apache.org; 
Subject: Re: Re: Solr and Nutch/Droids - to use or not to use?

Well, it's not that Nutch doesn't support it.  Solr itself doesn't support it.  
Indexing applications need to know which shard they want to send documents to.  
This may be a good case for a new wish issue in Solr JIRA?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Markus Jelsma 
> To: solr-user@lucene.apache.org
> Sent: Wed, June 16, 2010 3:31:49 PM
> Subject: RE: Re: Solr and Nutch/Droids - to use or not to use?
> 
> Nutch does not, at this moment, support some form of consistent hashing to 
> select an appropriate shard. It would be nice if someone could file an issue 
> in 
> Nutch' Jira to add sharding support to it, perhaps someone with a better 
> understanding and more experience with Solr's distributed search than i have 
> at 
> the moment. I can't point Nutch' developers to the right piece of 
> documentation 
> on this one ;)

-Original message-
From: Otis Gospodnetic 
> <
> href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.com>
Sent: 
> Wed 16-06-2010 21:03
To: 
> href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org; 
> 
Subject: Re: Solr and Nutch/Droids - to use or not to use?

Hi 
> Mitch,

Solr can do distributed search, so it can definitely handle 
> indices that can't fit on a single server without sharding.  What I think 
> *might* be the case that the Nutch indexer that sends docs to Solr might not 
> be 
> capable of sending documents to multiple Solr cores/shards.  If that is the 
> case, I think you need to move this to the Nutch user/dev list and see how to 
> feed multiple Solr indices/cores/shards with Nutch 
> data.

Otis

Sematext :: 
> target=_blank >http://sematext.com/ :: Solr - Lucene - Nutch
Lucene 
> ecosystem search :: http://search-lucene.com/
> 



- Original Message 
> From: MitchK <
> ymailto="mailto:mitc...@web.de"; 
> href="mailto:mitc...@web.de";>mitc...@web.de>
> To: 
> ymailto="mailto:solr-user@lucene.apache.org"; 
> href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org
> 
> Sent: Wed, June 16, 2010 2:27:16 PM
> Subject: Re: Solr and Nutch/Droids - 
> to use or not to use?
> 
> 
Thanks, that really helps to find the 
> right beginning for such a journey. 
> :-)



> * Use Solr, 
> not Nutch's search webapp 
> 
As 
> far as I have read, Solr 
> can't scale, if the index gets too large for 
> 
> one
Server



> The setup explained here has one significant 
> 
> caveat you also need to keep
> in mind: scale. You cannot use 
> this kind of 
> setup with vertical scale
> (collection size) that 
> goes beyond one Solr 
> box. The horizontal scaling
> (query 
> throughput) is still possible with 
> the standard Solr 
> replication
> tools.
> 
...from 
> 
> Lucidimagination.com

Is this still the case?
Furthermore, as far as I 
> 
> have understood this blogpost: 

> href="
> href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; 
> target=_blank 
> >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; target=_blank 
> 
> >
> href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
> " target=_blank >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
> 
Lucidimagination.com 
> : Nutch and Solr , they index the whole 
> stuff with
nutch and reindex it to 
> Solr - sounds like a lot of 
> redundant work.

Lucid, Sematext and the 
> Nutch-wiki are the only 
> information-sources where I
can find talks about 
> Nutch and Solr, but 
> no one seems to talk about these
facts - except this one 
> 
> blogpost.

If you say this is wrong or contingent on the shown setup, can 
> 
> you tell me
how to avoid these problems?

A lot of questions, 
> but it's 
> su

Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread Otis Gospodnetic

Well, it's not that Nutch doesn't support it.  Solr itself doesn't support it.  
Indexing applications need to know which shard they want to send documents to.  
This may be a good case for a new wish issue in Solr JIRA?

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Markus Jelsma 
> To: solr-user@lucene.apache.org
> Sent: Wed, June 16, 2010 3:31:49 PM
> Subject: RE: Re: Solr and Nutch/Droids - to use or not to use?
> 
> Nutch does not, at this moment, support some form of consistent hashing to 
> select an appropriate shard. It would be nice if someone could file an issue 
> in 
> Nutch' Jira to add sharding support to it, perhaps someone with a better 
> understanding and more experience with Solr's distributed search than i have 
> at 
> the moment. I can't point Nutch' developers to the right piece of 
> documentation 
> on this one ;)
 
-Original message-
From: Otis Gospodnetic 
> <
> href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.com>
Sent: 
> Wed 16-06-2010 21:03
To: 
> href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org; 
> 
Subject: Re: Solr and Nutch/Droids - to use or not to use?

Hi 
> Mitch,

Solr can do distributed search, so it can definitely handle 
> indices that can't fit on a single server without sharding.  What I think 
> *might* be the case that the Nutch indexer that sends docs to Solr might not 
> be 
> capable of sending documents to multiple Solr cores/shards.  If that is the 
> case, I think you need to move this to the Nutch user/dev list and see how to 
> feed multiple Solr indices/cores/shards with Nutch 
> data.

Otis

Sematext :: 
> target=_blank >http://sematext.com/ :: Solr - Lucene - Nutch
Lucene 
> ecosystem search :: http://search-lucene.com/
> 



- Original Message 
> From: MitchK <
> ymailto="mailto:mitc...@web.de"; 
> href="mailto:mitc...@web.de";>mitc...@web.de>
> To: 
> ymailto="mailto:solr-user@lucene.apache.org"; 
> href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org
> 
> Sent: Wed, June 16, 2010 2:27:16 PM
> Subject: Re: Solr and Nutch/Droids - 
> to use or not to use?
> 
> 
Thanks, that really helps to find the 
> right beginning for such a journey. 
> :-)



> * Use Solr, 
> not Nutch's search webapp 
> 
As 
> far as I have read, Solr 
> can't scale, if the index gets too large for 
> 
> one
Server



> The setup explained here has one significant 
> 
> caveat you also need to keep
> in mind: scale. You cannot use 
> this kind of 
> setup with vertical scale
> (collection size) that 
> goes beyond one Solr 
> box. The horizontal scaling
> (query 
> throughput) is still possible with 
> the standard Solr 
> replication
> tools.
> 
...from 
> 
> Lucidimagination.com

Is this still the case?
Furthermore, as far as I 
> 
> have understood this blogpost: 

> href="
> href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; 
> target=_blank 
> >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; target=_blank 
> 
> >
> href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
> " target=_blank >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
> 
Lucidimagination.com 
> : Nutch and Solr , they index the whole 
> stuff with
nutch and reindex it to 
> Solr - sounds like a lot of 
> redundant work.

Lucid, Sematext and the 
> Nutch-wiki are the only 
> information-sources where I
can find talks about 
> Nutch and Solr, but 
> no one seems to talk about these
facts - except this one 
> 
> blogpost.

If you say this is wrong or contingent on the shown setup, can 
> 
> you tell me
how to avoid these problems?

A lot of questions, 
> but it's 
> such an exciting topic...

Hopefully you can answer some 
> of 
> them.

Again, thank you for the feedback, Otis.

- 
> Mitch
-- 
> 
View this message in context: 
> href="
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html";
> > 
> 
> target=_blank 
> >
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html
> " target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html
> 
Sent 
> from the Solr - User mailing list archive at 
> Nabble.com.

RE: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread Markus Jelsma

Nutch does not, at this moment, support some form of consistent hashing to 
select an appropriate shard. It would be nice if someone could file an issue in 
Nutch' Jira to add sharding support to it, perhaps someone with a better 
understanding and more experience with Solr's distributed search than i have at 
the moment. I can't point Nutch' developers to the right piece of documentation 
on this one ;)
 
-Original message-
From: Otis Gospodnetic 
Sent: Wed 16-06-2010 21:03
To: solr-user@lucene.apache.org; 
Subject: Re: Solr and Nutch/Droids - to use or not to use?

Hi Mitch,

Solr can do distributed search, so it can definitely handle indices that can't 
fit on a single server without sharding.  What I think *might* be the case that 
the Nutch indexer that sends docs to Solr might not be capable of sending 
documents to multiple Solr cores/shards.  If that is the case, I think you need 
to move this to the Nutch user/dev list and see how to feed multiple Solr 
indices/cores/shards with Nutch data.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: MitchK 
> To: solr-user@lucene.apache.org
> Sent: Wed, June 16, 2010 2:27:16 PM
> Subject: Re: Solr and Nutch/Droids - to use or not to use?
> 
> 
Thanks, that really helps to find the right beginning for such a journey. 
> :-)



> * Use Solr, not Nutch's search webapp 
> 
As 
> far as I have read, Solr can't scale, if the index gets too large for 
> one
Server



> The setup explained here has one significant 
> caveat you also need to keep
> in mind: scale. You cannot use this kind of 
> setup with vertical scale
> (collection size) that goes beyond one Solr 
> box. The horizontal scaling
> (query throughput) is still possible with 
> the standard Solr replication
> tools.
> 
...from 
> Lucidimagination.com

Is this still the case?
Furthermore, as far as I 
> have understood this blogpost: 

> href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; 
> target=_blank 
> >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
Lucidimagination.com 
> : Nutch and Solr , they index the whole stuff with
nutch and reindex it to 
> Solr - sounds like a lot of redundant work.

Lucid, Sematext and the 
> Nutch-wiki are the only information-sources where I
can find talks about 
> Nutch and Solr, but no one seems to talk about these
facts - except this one 
> blogpost.

If you say this is wrong or contingent on the shown setup, can 
> you tell me
how to avoid these problems?

A lot of questions, but it's 
> such an exciting topic...

Hopefully you can answer some of 
> them.

Again, thank you for the feedback, Otis.

- Mitch
-- 
> 
View this message in context: 
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html
Sent 
> from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread Otis Gospodnetic

Hi Mitch,

Solr can do distributed search, so it can definitely handle indices that can't 
fit on a single server without sharding.  What I think *might* be the case that 
the Nutch indexer that sends docs to Solr might not be capable of sending 
documents to multiple Solr cores/shards.  If that is the case, I think you need 
to move this to the Nutch user/dev list and see how to feed multiple Solr 
indices/cores/shards with Nutch data.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: MitchK 
> To: solr-user@lucene.apache.org
> Sent: Wed, June 16, 2010 2:27:16 PM
> Subject: Re: Solr and Nutch/Droids - to use or not to use?
> 
> 
Thanks, that really helps to find the right beginning for such a journey. 
> :-)



> * Use Solr, not Nutch's search webapp 
> 
As 
> far as I have read, Solr can't scale, if the index gets too large for 
> one
Server



> The setup explained here has one significant 
> caveat you also need to keep
> in mind: scale. You cannot use this kind of 
> setup with vertical scale
> (collection size) that goes beyond one Solr 
> box. The horizontal scaling
> (query throughput) is still possible with 
> the standard Solr replication
> tools.
> 
...from 
> Lucidimagination.com

Is this still the case?
Furthermore, as far as I 
> have understood this blogpost: 

> href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; 
> target=_blank 
> >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
Lucidimagination.com 
> : Nutch and Solr , they index the whole stuff with
nutch and reindex it to 
> Solr - sounds like a lot of redundant work.

Lucid, Sematext and the 
> Nutch-wiki are the only information-sources where I
can find talks about 
> Nutch and Solr, but no one seems to talk about these
facts - except this one 
> blogpost.

If you say this is wrong or contingent on the shown setup, can 
> you tell me
how to avoid these problems?

A lot of questions, but it's 
> such an exciting topic...

Hopefully you can answer some of 
> them.

Again, thank you for the feedback, Otis.

- Mitch
-- 
> 
View this message in context: 
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html
Sent 
> from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK


Thanks, that really helps to find the right beginning for such a journey. :-)



> * Use Solr, not Nutch's search webapp 
> 
As far as I have read, Solr can't scale, if the index gets too large for one
Server



> The setup explained here has one significant caveat you also need to keep
> in mind: scale. You cannot use this kind of setup with vertical scale
> (collection size) that goes beyond one Solr box. The horizontal scaling
> (query throughput) is still possible with the standard Solr replication
> tools.
> 
...from Lucidimagination.com

Is this still the case?
Furthermore, as far as I have understood this blogpost: 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
Lucidimagination.com : Nutch and Solr , they index the whole stuff with
nutch and reindex it to Solr - sounds like a lot of redundant work.

Lucid, Sematext and the Nutch-wiki are the only information-sources where I
can find talks about Nutch and Solr, but no one seems to talk about these
facts - except this one blogpost.

If you say this is wrong or contingent on the shown setup, can you tell me
how to avoid these problems?

A lot of questions, but it's such an exciting topic...

Hopefully you can answer some of them.

Again, thank you for the feedback, Otis.

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread Otis Gospodnetic

Mitch,

I think you really have 2 distinct questions there:

One question is Nutch vs. Droids.
The other one is Solr vs. Nutch for search.

My suggestions:
* Use Nutch, not Droids, if scaling is important
* Use Solr, not Nutch's search webapp

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: MitchK 
> To: solr-user@lucene.apache.org
> Sent: Wed, June 16, 2010 1:37:23 PM
> Subject: Re: Solr and Nutch/Droids - to use or not to use?
> 
> 
Thank you for the feedback, Otis.
Yes, I thought that such an approach is 
> usefull if the number of pages to
crawl is relatively low.

However, 
> what about using solr + nutch?
Exists the problem that this would not scale, 
> if the index becomes too
large, up to now?

What about extending nutch 
> with features such as the DisMaxRequestHandler,
is the amount of work larger 
> than it would be in Solr?

The big pro of Solr is that I can enhance the 
> whole thing in a few minutes,
if I need more extra-information to improve the 
> search.
That makes it very easy to experiment with boostings, filters 
> etc.
As far as I know, Nutch does not offer such greatefull features.
Do 
> you know a little bit more about that?

Probably I should ask such 
> question at the Nutch-mailing list, but at the
moment I hope that I can 
> achieve as much as I can with Solr, because I have
no experiences with Hadoop 
> but Nutch seems to require it.

Thank you!
- Mitch
-- 
View this 
> message in context: 
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900480.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900480.html
Sent 
> from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK


Thank you for the feedback, Otis.
Yes, I thought that such an approach is usefull if the number of pages to
crawl is relatively low.

However, what about using solr + nutch?
Exists the problem that this would not scale, if the index becomes too
large, up to now?

What about extending nutch with features such as the DisMaxRequestHandler,
is the amount of work larger than it would be in Solr?

The big pro of Solr is that I can enhance the whole thing in a few minutes,
if I need more extra-information to improve the search.
That makes it very easy to experiment with boostings, filters etc.
As far as I know, Nutch does not offer such greatefull features.
Do you know a little bit more about that?

Probably I should ask such question at the Nutch-mailing list, but at the
moment I hope that I can achieve as much as I can with Solr, because I have
no experiences with Hadoop but Nutch seems to require it.

Thank you!
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900480.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread Otis Gospodnetic

My quick feedback would be:
Try using Nutch first, because it is a more complete "platform".  From what I 
know, Droids is just the crawler with an in-memory queue + link extractor.  We 
did use it for crawling Lucene project sites (for the index on 
http://search-lucene.com/ ), but that is because the data volume is low, the 
crawl very narrow, scaling requirements low, etc.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: MitchK 
> To: solr-user@lucene.apache.org
> Sent: Wed, June 16, 2010 11:27:20 AM
> Subject: Solr and Nutch/Droids - to use or not to use?
> 
> 
Hello community, 

from several discussions about Solr and Nutch, I 
> got some questions for a
virtual web-search-engine. 
I know I've posted 
> this message to the mailing list a few days ago, but the
thread got injected 
> and at least I did not get any more postings about the
topic and so I try to 
> reopen it, hopefully no one gets upset here :-).
Please, bear with me. Thank 
> you.

The requirements: 
I. I need a scalable solution for a growing 
> index that becomes larger than
one machine can handle. If I add more 
> hardware, I want to linear improve the
performance. 

II. I want to use 
> technologies like the OPIC-algorithm (default algorithm in
Nutch) or PageRank 
> or... whatever is out there to improve the ranking of the
webpages. 
> 

III. I want to be able to easily add more fields to my documents. 
> Imagine
one retrives information from a webpage's content, than I want to 
> make it
searchable. 

IV. While fetching my data, I want to make 
> special-searches possible. For
example I want to retrive pictures from a 
> webpage and want to index
picture-related content into another search-index 
> plus I want to save a
small thumbnail of the picture itself. Btw: This is (as 
> far as I know) not
possible with solr, because solr was not intended to do 
> such special
indexing-logic. 

V. I want to use filter queries (i.e. 
> main-query "christopher lee" returns
1.5mio results, subquery "action" -> 
> the main-query would be a filter-query
and "action" would be the actual 
> query. So a search within search-results
would be easily made available). 
> 

VI. I want to be able to use different logics for different pages. Maybe 
> I
got a pool of 100 domains that I know better than others and I got 
> special
scripts that retrive more special information from those 100 domains. 
> Than I
want to apply my special logic to those 100 domains, but every other 
> domain
should use the default logic. 

- 

The 
> project is only virtual. So why I am asking? 
I want to learn more about 
> websearch and I would like to make some new
experiences. 

What do I 
> know about Solr + Nutch: 
As it is said on lucidimagination.com, Solr + Nutch 
> does not scale if the
index is too large. 
The article was a little bit 
> older and I don't know whether this problem
gets fixed with the new 
> distributed abilities of Solr. 

Furthermore I don't want to index the 
> pages with nutch and reindex them with
solr. 
The only exception would be: 
> If the content of a webpage get's indexed by
nutch, I want to use the already 
> tokenized content of the body with some
Solr copyfield operations to extend 
> the search (i.e. making fuzzy search
possible). At the moment: I don't think 
> this is possible. 

I don't know much about the droids project and how 
> well it is documented. 
But from what I can read by some posts of Otis, it 
> seems to be usable as a
crawler-framework. 


Pros for Nutch are: It 
> is very scalable! Thanks to hadoop and MapReduce it
is a scaling-monster 
> (from what I've read). 

Cons: The search is not as rich as it is possible 
> with Solr. Extend Nutch's
search-abilities *seems* to be more complicated 
> than with Solr. Furthermore,
if I want to use Solr to search nutch's index, 
> looking at my requirements I
would need to reindex the whole thing - without 
> the benefits of Hadoop. 

What I don't know at the moment is, how it is 
> possible to use algorithms
like in II. mentioned with Solr. 

I hope 
> you understand the problem here - Solr *seems* to me as it would not
be the 
> best solution for a web-search-engine, because of scaling reasons 
> in
indexing. 


Where should I dive deeper? 
Solr + Droids? 
> 
Solr + Nutch? 
Nutch + howToExtendNutchToMakeSearchBetter? 
> 


Thanks for the discussion! 
- Mitch
-- 
View this message 
> in context: 
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900069.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900069.html
Sent 
> from the Solr - User mailing list archive at Nabble.com.

Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK

Hello community,

from several discussions about Solr and Nutch, I got some questions for a
virtual web-search-engine.
I know I've posted this message to the mailing list a few days ago, but the
thread got injected and at least I did not get any more postings about the
topic and so I try to reopen it, hopefully no one gets upset here :-).
Please, bear with me. Thank you.

The requirements:
I. I need a scalable solution for a growing index that becomes larger than
one machine can handle. If I add more hardware, I want to linear improve the
performance.

II. I want to use technologies like the OPIC-algorithm (default algorithm in
Nutch) or PageRank or... whatever is out there to improve the ranking of the
webpages.

III. I want to be able to easily add more fields to my documents. Imagine
one retrives information from a webpage's content, than I want to make it
searchable.

IV. While fetching my data, I want to make special-searches possible. For
example I want to retrive pictures from a webpage and want to index
picture-related content into another search-index plus I want to save a
small thumbnail of the picture itself. Btw: This is (as far as I know) not
possible with solr, because solr was not intended to do such special
indexing-logic.

V. I want to use filter queries (i.e. main-query "christopher lee" returns
1.5mio results, subquery "action" -> the main-query would be a filter-query
and "action" would be the actual query. So a search within search-results
would be easily made available).

VI. I want to be able to use different logics for different pages. Maybe I
got a pool of 100 domains that I know better than others and I got special
scripts that retrive more special information from those 100 domains. Than I
want to apply my special logic to those 100 domains, but every other domain
should use the default logic.

The project is only virtual. So why I am asking?
I want to learn more about websearch and I would like to make some new
experiences.

What do I know about Solr + Nutch:
As it is said on lucidimagination.com, Solr + Nutch does not scale if the
index is too large.
The article was a little bit older and I don't know whether this problem
gets fixed with the new distributed abilities of Solr.

Furthermore I don't want to index the pages with nutch and reindex them with
solr.
The only exception would be: If the content of a webpage get's indexed by
nutch, I want to use the already tokenized content of the body with some
Solr copyfield operations to extend the search (i.e. making fuzzy search
possible). At the moment: I don't think this is possible.

I don't know much about the droids project and how well it is documented.
But from what I can read by some posts of Otis, it seems to be usable as a
crawler-framework.

Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it
is a scaling-monster (from what I've read).

Cons: The search is not as rich as it is possible with Solr. Extend Nutch's
search-abilities *seems* to be more complicated than with Solr. Furthermore,
if I want to use Solr to search nutch's index, looking at my requirements I
would need to reindex the whole thing - without the benefits of Hadoop.

What I don't know at the moment is, how it is possible to use algorithms
like in II. mentioned with Solr.

I hope you understand the problem here - Solr *seems* to me as it would not
be the best solution for a web-search-engine, because of scaling reasons in
indexing.

Where should I dive deeper?
Solr + Droids?
Solr + Nutch?
Nutch + howToExtendNutchToMakeSearchBetter?

Thanks for the discussion!
- Mitch
--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900069.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Nutch/Droids - to use or not to use?

2010-06-14 Thread MitchK


Just wanted to push the topic a little bit, because those question come up
quite often and it's very interesting for me.

Thank you!

- Mitch


MitchK wrote:
> 
> Hello community and a nice satureday,
> 
> from several discussions about Solr and Nutch, I got some questions for a
> virtual web-search-engine.
> 
> The requirements:
> I. I need a scalable solution for a growing index that becomes larger than
> one machine can handle. If I add more hardware, I want to linear improve
> the performance.
> 
> II. I want to use technologies like the OPIC-algorithm (default algorithm
> in Nutch) or PageRank or... whatever is out there to improve the ranking
> of the webpages. 
> 
> III. I want to be able to easily add more fields to my documents. Imagine
> one retrives information from a webpage's content, than I want to make it
> searchable.
> 
> IV. While fetching my data, I want to make special-searches possible. For
> example I want to retrive pictures from a webpage and want to index
> picture-related content into another search-index plus I want to save a
> small thumbnail of the picture itself. Btw: This is (as far as I know) not
> possible with solr, because solr was not intended to do such special
> indexing-logic.
> 
> V. I want to use filter queries (i.e. main-query "christopher lee" returns
> 1.5mio results, subquery "action" -> the main-query would be a
> filter-query and "action" would be the actual query. So a search within
> search-results would be easily made available).
> 
> VI. I want to be able to use different logics for different pages. Maybe I
> got a pool of 100 domains that I know better than others and I got special
> scripts that retrive more special information from those 100 domains. Than
> I want to apply my special logic to those 100 domains, but every other
> domain should use the default logic.
> 
> -
> 
> The project is only virtual. So why I am asking?
> I want to learn more about websearch and I would like to make some new
> experiences.
> 
> What do I know about Solr + Nutch:
> As it is said on lucidimagination.com, Solr + Nutch does not scale if the
> index is too large.
> The article was a little bit older and I don't know whether this problem
> gets fixed with the new distributed abilities of Solr.
> 
> Furthermore I don't want to index the pages with nutch and reindex them
> with solr. 
> The only exception would be: If the content of a webpage get's indexed by
> nutch, I want to use the already tokenized content of the body with some
> Solr copyfield operations to extend the search (i.e. making fuzzy search
> possible). At the moment: I don't think this is possible.
> 
> I don't know much about the droids project and how well it is documented.
> But from what I can read by some posts of Otis, it seems to be usable as a
> crawler-framework.
> 
> 
> Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it
> is a scaling-monster (from what I've read).
> 
> Cons: The search is not as rich as it is possible with Solr. Extend
> Nutch's search-abilities *seems* to be more complicated than with Solr.
> Furthermore, if I want to use Solr to search nutch's index, looking at my
> requirements I would need to reindex the whole thing - without the
> benefits of Hadoop. 
> 
> What I don't know at the moment is, how it is possible to use algorithms
> like in II. mentioned with Solr.
> 
> I hope you understand the problem here - Solr *seems* to me as it would
> not be the best solution for a web-search-engine, because of scaling
> reasons in indexing. 
> 
> 
> Where should I dive deeper? 
> Solr + Droids?
> Solr + Nutch?
> Nutch + howToExtendNutchToMakeSearchBetter?
> 
> 
> Thanks for the discussion!
> - Mitch
> 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p894391.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr and Nutch/Droids - to use or not to use?

2010-06-12 Thread MitchK

Hello community and a nice satureday,

from several discussions about Solr and Nutch, I got some questions for a
virtual web-search-engine.