Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
Otis, And again I wished I were registred. I will check the JIRA and when I feel comfortable with it, I will open it. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p904145.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
I didn't open the issue, Mitch, but feel free to do it. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: MitchK > To: solr-user@lucene.apache.org > Sent: Thu, June 17, 2010 12:07:13 PM > Subject: Re: Re: Re: Solr and Nutch/Droids - to use or not to use? > > Otis, you are right. I wasn't aware of this. At least not with such a > large dataList (let's think of an index with 4mio docs, this would mean we > got an ExternalFile with 4mio records). But from what I've read at > search-lucene.com it seems to perform very well. Thanks for the > idea! Btw: Otis, did you open a JIRA Issue for the distributed indexing > ability of Solr? I would like to follow the issue, if it is open. > Regards - Mitch Otis Gospodnetic-2 wrote: > > > Mitch, > > Yes, one day. But it sounds like you are not aware > of ExternalFieldFile, > which you can use today: > > > href="http://search-lucene.com/?q=ExternalFileField&fc_project=Solr"; > target=_blank > >http://search-lucene.com/?q=ExternalFileField&fc_project=Solr > > > Otis > > Sematext :: > target=_blank >http://sematext.com/ :: Solr - Lucene - Nutch > Lucene > ecosystem search :: > >http://search-lucene.com/ > > > > - Original > Message >> From: MitchK < > href="mailto:mitc...@web.de";>mitc...@web.de> >> To: > ymailto="mailto:solr-user@lucene.apache.org"; > href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org >> > Sent: Thu, June 17, 2010 4:15:27 AM >> Subject: Re: Re: Re: Solr and > Nutch/Droids - to use or not to use? >> >> > > > >> Solr doesn't know anything about OPIC, but I suppose you can > >> feed the OPIC >> score computed by Nutch into a Solr field > and use it >> during scoring, if >> you want, say with a > function query. >> > Oh! >> Yes, that makes more > sense than using the OPIC as doc-boost-value. >> :-) > Anywhere > at the Lucene Mailing lists I read that in future it will >> > be > possible to change field's contents without reindexing the whole > >> document. > If one stores the OPIC-Score (which is > independent from the page's >> content) > in a field and uses > functionQuery to influence the score of a >> document, one > > saves the effort of reindexing the whole doc, if the content >> did > not change. > > Regards > - Mitch > -- > View > this message in >> context: >> href=" > href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html"; > > target=_blank > >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html"; > > > >> target=_blank >> > > href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html"; > > target=_blank > >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html > > Sent >> from the Solr - User mailing list archive at > Nabble.com. > > -- View this message in context: > href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p903148.html"; > > target=_blank > >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p903148.html Sent > from the Solr - User mailing list archive at Nabble.com.
Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
Otis, you are right. I wasn't aware of this. At least not with such a large dataList (let's think of an index with 4mio docs, this would mean we got an ExternalFile with 4mio records). But from what I've read at search-lucene.com it seems to perform very well. Thanks for the idea! Btw: Otis, did you open a JIRA Issue for the distributed indexing ability of Solr? I would like to follow the issue, if it is open. Regards - Mitch Otis Gospodnetic-2 wrote: > > Mitch, > > Yes, one day. But it sounds like you are not aware of ExternalFieldFile, > which you can use today: > > http://search-lucene.com/?q=ExternalFileField&fc_project=Solr > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message >> From: MitchK >> To: solr-user@lucene.apache.org >> Sent: Thu, June 17, 2010 4:15:27 AM >> Subject: Re: Re: Re: Solr and Nutch/Droids - to use or not to use? >> >> > > >> Solr doesn't know anything about OPIC, but I suppose you can >> feed the OPIC >> score computed by Nutch into a Solr field and use it >> during scoring, if >> you want, say with a function query. >> > Oh! >> Yes, that makes more sense than using the OPIC as doc-boost-value. >> :-) > Anywhere at the Lucene Mailing lists I read that in future it will >> be > possible to change field's contents without reindexing the whole >> document. > If one stores the OPIC-Score (which is independent from the page's >> content) > in a field and uses functionQuery to influence the score of a >> document, one > saves the effort of reindexing the whole doc, if the content >> did not change. > > Regards > - Mitch > -- > View this message in >> context: >> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html"; >> >> target=_blank >> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html > Sent >> from the Solr - User mailing list archive at Nabble.com. > > -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p903148.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
Mitch, Yes, one day. But it sounds like you are not aware of ExternalFieldFile, which you can use today: http://search-lucene.com/?q=ExternalFileField&fc_project=Solr Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: MitchK > To: solr-user@lucene.apache.org > Sent: Thu, June 17, 2010 4:15:27 AM > Subject: Re: Re: Re: Solr and Nutch/Droids - to use or not to use? > > > Solr doesn't know anything about OPIC, but I suppose you can > feed the OPIC > score computed by Nutch into a Solr field and use it > during scoring, if > you want, say with a function query. > Oh! > Yes, that makes more sense than using the OPIC as doc-boost-value. > :-) Anywhere at the Lucene Mailing lists I read that in future it will > be possible to change field's contents without reindexing the whole > document. If one stores the OPIC-Score (which is independent from the page's > content) in a field and uses functionQuery to influence the score of a > document, one saves the effort of reindexing the whole doc, if the content > did not change. Regards - Mitch -- View this message in > context: > href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html"; > > target=_blank > >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html Sent > from the Solr - User mailing list archive at Nabble.com.
Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
> Solr doesn't know anything about OPIC, but I suppose you can feed the OPIC > score computed by Nutch into a Solr field and use it during scoring, if > you want, say with a function query. > Oh! Yes, that makes more sense than using the OPIC as doc-boost-value. :-) Anywhere at the Lucene Mailing lists I read that in future it will be possible to change field's contents without reindexing the whole document. If one stores the OPIC-Score (which is independent from the page's content) in a field and uses functionQuery to influence the score of a document, one saves the effort of reindexing the whole doc, if the content did not change. Regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
Mitch, If you use Nutch+Solr then you wouldn't *index* the fetched content with Nutch. Solr doesn't know anything about OPIC, but I suppose you can feed the OPIC score computed by Nutch into a Solr field and use it during scoring, if you want, say with a function query. Yes, ES has built-in support for sharding and replication. It also makes it easy to implement custom scoring, which may work for OPIC here. Yes, ask questions here. :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: MitchK > To: solr-user@lucene.apache.org > Sent: Thu, June 17, 2010 1:52:32 AM > Subject: RE: Re: Re: Solr and Nutch/Droids - to use or not to use? > > Good morning! Great feedback from you all. This really helped a lot > to get an impression of what is possible and what is not. What is > interesting to me are some detail questions. Let's assume Solr is > possible to work on his own with distributed indexing, so that the client > does not need to know anything about shards etc. What is interesting to > me is: I. The scoring - Nutch uses special Scoring-implementations like > the OPIC-algorithm. Can Solr use such improvements or do I need to > reimplement it for Solr? II. The indexing. At the moment it > really sounds like nutch would index the whole stuff and afterwards Solr does > the job again. Regarding to indexing it would make sense, if Nutch computes > things like the document boost (I am not sure, but I think the results of the > OPIC-algorithm were added to each document as a boost) and sends an > indexing-request to Solr afterwards. However, if Nutch indexes the page's > content and Solr does it, too - I would waste some time, no? Is this the > case or do I missunderstood something here? III. I am no > Java-Expert. However, in a few month I will start to study computer-science > at an university. Maybe I will find some literature to learn more > about distributed software and how hashing needs to work, to do the job it > should do, to make distributed indexing work. Maybe than I can help to > implement this feature into Solr. On the other hand, not much is known > about Solr's distributed search-concept and which classes are responsible for > that - but such things one could ask on the mailing list, no? As far > as I know Elastic Search already supports distributed indexing. Maybe one > can reuse the responsible implementation for Solr. Btw: I think a > great benefit of using Solr + Nutch would be to extend the search. I could > create several Solr cores for different kinds of search - one > for picture-search, one for video-search etc. *and* with the help of Nutch I > can index some of the needed content in special directories. So Solr does > not need to care about indexing a picture - Nutch already does the job. > Kind regards, - Mitch -- View this message in context: > href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p901943.html"; > > target=_blank > >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p901943.html Sent > from the Solr - User mailing list archive at Nabble.com.
RE: Re: Re: Solr and Nutch/Droids - to use or not to use?
Good morning! Great feedback from you all. This really helped a lot to get an impression of what is possible and what is not. What is interesting to me are some detail questions. Let's assume Solr is possible to work on his own with distributed indexing, so that the client does not need to know anything about shards etc. What is interesting to me is: I. The scoring - Nutch uses special Scoring-implementations like the OPIC-algorithm. Can Solr use such improvements or do I need to reimplement it for Solr? II. The indexing. At the moment it really sounds like nutch would index the whole stuff and afterwards Solr does the job again. Regarding to indexing it would make sense, if Nutch computes things like the document boost (I am not sure, but I think the results of the OPIC-algorithm were added to each document as a boost) and sends an indexing-request to Solr afterwards. However, if Nutch indexes the page's content and Solr does it, too - I would waste some time, no? Is this the case or do I missunderstood something here? III. I am no Java-Expert. However, in a few month I will start to study computer-science at an university. Maybe I will find some literature to learn more about distributed software and how hashing needs to work, to do the job it should do, to make distributed indexing work. Maybe than I can help to implement this feature into Solr. On the other hand, not much is known about Solr's distributed search-concept and which classes are responsible for that - but such things one could ask on the mailing list, no? As far as I know Elastic Search already supports distributed indexing. Maybe one can reuse the responsible implementation for Solr. Btw: I think a great benefit of using Solr + Nutch would be to extend the search. I could create several Solr cores for different kinds of search - one for picture-search, one for video-search etc. *and* with the help of Nutch I can index some of the needed content in special directories. So Solr does not need to care about indexing a picture - Nutch already does the job. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p901943.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Re: Re: Solr and Nutch/Droids - to use or not to use?
You're right. Currently clients need to take care of this, in this case, Nutch would be the client but it cannot be configured as such. It would, indeed, be more appropriate for Solr to take care of this. We can already query any server with a set of shard hosts specified, so it would make sense if Solr also supported some kind of consistent hashing and shard management configuration. With CouchDB-Lounge we can easily create a shard map that supports redundant shards on different servers for fail-over. It would be marvelous if Solr would support it as well. -Original message- From: Otis Gospodnetic Sent: Wed 16-06-2010 21:41 To: solr-user@lucene.apache.org; Subject: Re: Re: Solr and Nutch/Droids - to use or not to use? Well, it's not that Nutch doesn't support it. Solr itself doesn't support it. Indexing applications need to know which shard they want to send documents to. This may be a good case for a new wish issue in Solr JIRA? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Markus Jelsma > To: solr-user@lucene.apache.org > Sent: Wed, June 16, 2010 3:31:49 PM > Subject: RE: Re: Solr and Nutch/Droids - to use or not to use? > > Nutch does not, at this moment, support some form of consistent hashing to > select an appropriate shard. It would be nice if someone could file an issue > in > Nutch' Jira to add sharding support to it, perhaps someone with a better > understanding and more experience with Solr's distributed search than i have > at > the moment. I can't point Nutch' developers to the right piece of > documentation > on this one ;) -Original message- From: Otis Gospodnetic > < > href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.com> Sent: > Wed 16-06-2010 21:03 To: > href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org; > Subject: Re: Solr and Nutch/Droids - to use or not to use? Hi > Mitch, Solr can do distributed search, so it can definitely handle > indices that can't fit on a single server without sharding. What I think > *might* be the case that the Nutch indexer that sends docs to Solr might not > be > capable of sending documents to multiple Solr cores/shards. If that is the > case, I think you need to move this to the Nutch user/dev list and see how to > feed multiple Solr indices/cores/shards with Nutch > data. Otis Sematext :: > target=_blank >http://sematext.com/ :: Solr - Lucene - Nutch Lucene > ecosystem search :: http://search-lucene.com/ > - Original Message > From: MitchK < > ymailto="mailto:mitc...@web.de"; > href="mailto:mitc...@web.de";>mitc...@web.de> > To: > ymailto="mailto:solr-user@lucene.apache.org"; > href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org > > Sent: Wed, June 16, 2010 2:27:16 PM > Subject: Re: Solr and Nutch/Droids - > to use or not to use? > > Thanks, that really helps to find the > right beginning for such a journey. > :-) > * Use Solr, > not Nutch's search webapp > As > far as I have read, Solr > can't scale, if the index gets too large for > > one Server > The setup explained here has one significant > > caveat you also need to keep > in mind: scale. You cannot use > this kind of > setup with vertical scale > (collection size) that > goes beyond one Solr > box. The horizontal scaling > (query > throughput) is still possible with > the standard Solr > replication > tools. > ...from > > Lucidimagination.com Is this still the case? Furthermore, as far as I > > have understood this blogpost: > href=" > href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; > target=_blank > >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; target=_blank > > > > href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ > " target=_blank >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ > Lucidimagination.com > : Nutch and Solr , they index the whole > stuff with nutch and reindex it to > Solr - sounds like a lot of > redundant work. Lucid, Sematext and the > Nutch-wiki are the only > information-sources where I can find talks about > Nutch and Solr, but > no one seems to talk about these facts - except this one > > blogpost. If you say this is wrong or contingent on the shown setup, can > > you tell me how to avoid these problems? A lot of questions, > but it's > su
Re: Re: Solr and Nutch/Droids - to use or not to use?
Well, it's not that Nutch doesn't support it. Solr itself doesn't support it. Indexing applications need to know which shard they want to send documents to. This may be a good case for a new wish issue in Solr JIRA? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Markus Jelsma > To: solr-user@lucene.apache.org > Sent: Wed, June 16, 2010 3:31:49 PM > Subject: RE: Re: Solr and Nutch/Droids - to use or not to use? > > Nutch does not, at this moment, support some form of consistent hashing to > select an appropriate shard. It would be nice if someone could file an issue > in > Nutch' Jira to add sharding support to it, perhaps someone with a better > understanding and more experience with Solr's distributed search than i have > at > the moment. I can't point Nutch' developers to the right piece of > documentation > on this one ;) -Original message- From: Otis Gospodnetic > < > href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.com> Sent: > Wed 16-06-2010 21:03 To: > href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org; > Subject: Re: Solr and Nutch/Droids - to use or not to use? Hi > Mitch, Solr can do distributed search, so it can definitely handle > indices that can't fit on a single server without sharding. What I think > *might* be the case that the Nutch indexer that sends docs to Solr might not > be > capable of sending documents to multiple Solr cores/shards. If that is the > case, I think you need to move this to the Nutch user/dev list and see how to > feed multiple Solr indices/cores/shards with Nutch > data. Otis Sematext :: > target=_blank >http://sematext.com/ :: Solr - Lucene - Nutch Lucene > ecosystem search :: http://search-lucene.com/ > - Original Message > From: MitchK < > ymailto="mailto:mitc...@web.de"; > href="mailto:mitc...@web.de";>mitc...@web.de> > To: > ymailto="mailto:solr-user@lucene.apache.org"; > href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org > > Sent: Wed, June 16, 2010 2:27:16 PM > Subject: Re: Solr and Nutch/Droids - > to use or not to use? > > Thanks, that really helps to find the > right beginning for such a journey. > :-) > * Use Solr, > not Nutch's search webapp > As > far as I have read, Solr > can't scale, if the index gets too large for > > one Server > The setup explained here has one significant > > caveat you also need to keep > in mind: scale. You cannot use > this kind of > setup with vertical scale > (collection size) that > goes beyond one Solr > box. The horizontal scaling > (query > throughput) is still possible with > the standard Solr > replication > tools. > ...from > > Lucidimagination.com Is this still the case? Furthermore, as far as I > > have understood this blogpost: > href=" > href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; > target=_blank > >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; target=_blank > > > > href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ > " target=_blank >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ > Lucidimagination.com > : Nutch and Solr , they index the whole > stuff with nutch and reindex it to > Solr - sounds like a lot of > redundant work. Lucid, Sematext and the > Nutch-wiki are the only > information-sources where I can find talks about > Nutch and Solr, but > no one seems to talk about these facts - except this one > > blogpost. If you say this is wrong or contingent on the shown setup, can > > you tell me how to avoid these problems? A lot of questions, > but it's > such an exciting topic... Hopefully you can answer some > of > them. Again, thank you for the feedback, Otis. - > Mitch -- > View this message in context: > href=" > href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html"; > > target=_blank > >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html"; > > > > target=_blank > > > href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html > " target=_blank > >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html > Sent > from the Solr - User mailing list archive at > Nabble.com.
RE: Re: Solr and Nutch/Droids - to use or not to use?
Nutch does not, at this moment, support some form of consistent hashing to select an appropriate shard. It would be nice if someone could file an issue in Nutch' Jira to add sharding support to it, perhaps someone with a better understanding and more experience with Solr's distributed search than i have at the moment. I can't point Nutch' developers to the right piece of documentation on this one ;) -Original message- From: Otis Gospodnetic Sent: Wed 16-06-2010 21:03 To: solr-user@lucene.apache.org; Subject: Re: Solr and Nutch/Droids - to use or not to use? Hi Mitch, Solr can do distributed search, so it can definitely handle indices that can't fit on a single server without sharding. What I think *might* be the case that the Nutch indexer that sends docs to Solr might not be capable of sending documents to multiple Solr cores/shards. If that is the case, I think you need to move this to the Nutch user/dev list and see how to feed multiple Solr indices/cores/shards with Nutch data. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: MitchK > To: solr-user@lucene.apache.org > Sent: Wed, June 16, 2010 2:27:16 PM > Subject: Re: Solr and Nutch/Droids - to use or not to use? > > Thanks, that really helps to find the right beginning for such a journey. > :-) > * Use Solr, not Nutch's search webapp > As > far as I have read, Solr can't scale, if the index gets too large for > one Server > The setup explained here has one significant > caveat you also need to keep > in mind: scale. You cannot use this kind of > setup with vertical scale > (collection size) that goes beyond one Solr > box. The horizontal scaling > (query throughput) is still possible with > the standard Solr replication > tools. > ...from > Lucidimagination.com Is this still the case? Furthermore, as far as I > have understood this blogpost: > href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"; > target=_blank > >http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ Lucidimagination.com > : Nutch and Solr , they index the whole stuff with nutch and reindex it to > Solr - sounds like a lot of redundant work. Lucid, Sematext and the > Nutch-wiki are the only information-sources where I can find talks about > Nutch and Solr, but no one seems to talk about these facts - except this one > blogpost. If you say this is wrong or contingent on the shown setup, can > you tell me how to avoid these problems? A lot of questions, but it's > such an exciting topic... Hopefully you can answer some of > them. Again, thank you for the feedback, Otis. - Mitch -- > View this message in context: > href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html"; > > target=_blank > >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html Sent > from the Solr - User mailing list archive at Nabble.com.