Re: solr+hadoop = next solr
On 6/7/07, Rafael Rossini [EMAIL PROTECTED] wrote: Hi, Jeff and Mike. Would you mind telling us about the architecture of your solutions a little bit? Mike, you said that you implemented a highly-distributed search engine using Solr as indexing nodes. What does that mean? You guys implemented a master, multi-slave solution for replication? Or the whole index shards for high availability and fail over? Our solution doesn't use solr, but goes directly to lucene. It's built on windows, so the interop communication service is built on .net remoting (tcp based). Microsoft has deprecated ongoing development with .net remoting, in favor of other more standard mechanisms, i.e. http. So, we're looking to migrate our solution to a more community-supported model. The underlying structure sounds similar to what others have done: index shards distributed to various servers, each responsible for a subset of the index. A merging server handles coordination of concurrent thread requests and synchronizes the results as they're returned. The thread coordination and search results interleaving process is functional but not really scalable. It works for our user model, where users tend not to page deeply through results. We want to change that so we can use solr as our primary data source read mechanism for our site. -- j
Re: solr+hadoop = next solr
On 6-Jun-07, at 7:44 PM, Jeff Rodenburg wrote: I've been exploring distributed search, as of late. I don't know about the next solr but I could certainly see a distributed solr grow out of such an expansion. I've implemented a highly-distributed search engine using Solr (200m docs and growing, 60+ servers). It is not a Solr-based solution in the vein of FederatedSearch--it is a higher-level architecture that uses Solr as indexing nodes. I'll note that it is a lot of work and would be even more work to develop in the generic extensible philosophy that Solr espouses. It is not really suitable for contribution, unfortunately (being written in python and proprietary). In terms of the FederatedSearch wiki entry (updated last year), has there been any progress made this year on this topic, at least something worthy of being added or updated to the wiki page? Not to splinter efforts here, but maybe a working group that was focused on that topic could help to move things forward a bit. I don't believe that absence of organization has been the cause of lack of forward progress on this issue, but simply that there has been no-one sufficiently interested and committed to prioritizing this huge task to work on it. There is no need to form a working group (not when there are only a handful of active committers to begin with)--all interested people could just use solr-dev@ for discussion. Solr is an open-source project, so huge features will get implemented when there is a person or group of people devoted to leading the charge on the issue. If you're interested in being that person, that's great! -Mike
Re: solr+hadoop = next solr
Mike - thanks for the comments. Some responses added below. On 6/7/07, Mike Klaas [EMAIL PROTECTED] wrote: I've implemented a highly-distributed search engine using Solr (200m docs and growing, 60+ servers). It is not a Solr-based solution in the vein of FederatedSearch--it is a higher-level architecture that uses Solr as indexing nodes. I'll note that it is a lot of work and would be even more work to develop in the generic extensible philosophy that Solr espouses. Yeah, we've done the same thing in the .Net world, and it's a tough slog. We're in the same situation -- making our solution generically extensible is pretty much a non-starter. In terms of the FederatedSearch wiki entry (updated last year), has there been any progress made this year on this topic, at least something worthy of being added or updated to the wiki page? Not to splinter efforts here, but maybe a working group that was focused on that topic could help to move things forward a bit. I don't believe that absence of organization has been the cause of lack of forward progress on this issue, but simply that there has been no-one sufficiently interested and committed to prioritizing this huge task to work on it. There is no need to form a working group (not when there are only a handful of active committers to begin with)--all interested people could just use solr-dev@ for discussion. That makes sense, just didn't want to bombard the list with the subject if it was a detractor from the core project, i.e. keep lucene messages on lucene, solr messages on solr, etc. The good-community-participant approach, if you will. Solr is an open-source project, so huge features will get implemented when there is a person or group of people devoted to leading the charge on the issue. If you're interested in being that person, that's great! Glad to jump in, not sure I qualify as such for that, but certainly a big cheerleader nonetheless.
Re: solr+hadoop = next solr
Hi, Jeff and Mike. Would you mind telling us about the architecture of your solutions a little bit? Mike, you said that you implemented a highly-distributed search engine using Solr as indexing nodes. What does that mean? You guys implemented a master, multi-slave solution for replication? Or the whole index shards for high availability and fail over? On 6/7/07, Jeff Rodenburg [EMAIL PROTECTED] wrote: Mike - thanks for the comments. Some responses added below. On 6/7/07, Mike Klaas [EMAIL PROTECTED] wrote: I've implemented a highly-distributed search engine using Solr (200m docs and growing, 60+ servers). It is not a Solr-based solution in the vein of FederatedSearch--it is a higher-level architecture that uses Solr as indexing nodes. I'll note that it is a lot of work and would be even more work to develop in the generic extensible philosophy that Solr espouses. Yeah, we've done the same thing in the .Net world, and it's a tough slog. We're in the same situation -- making our solution generically extensible is pretty much a non-starter. In terms of the FederatedSearch wiki entry (updated last year), has there been any progress made this year on this topic, at least something worthy of being added or updated to the wiki page? Not to splinter efforts here, but maybe a working group that was focused on that topic could help to move things forward a bit. I don't believe that absence of organization has been the cause of lack of forward progress on this issue, but simply that there has been no-one sufficiently interested and committed to prioritizing this huge task to work on it. There is no need to form a working group (not when there are only a handful of active committers to begin with)--all interested people could just use solr-dev@ for discussion. That makes sense, just didn't want to bombard the list with the subject if it was a detractor from the core project, i.e. keep lucene messages on lucene, solr messages on solr, etc. The good-community-participant approach, if you will. Solr is an open-source project, so huge features will get implemented when there is a person or group of people devoted to leading the charge on the issue. If you're interested in being that person, that's great! Glad to jump in, not sure I qualify as such for that, but certainly a big cheerleader nonetheless.
Re: solr+hadoop = next solr
On 6/6/07, James liu [EMAIL PROTECTED] wrote: anyone agree? No ;-) At least not if you mean using map-reduce for queries. When I started looking at distributed search, I immediately went and read the map-reduce paper (easier concept than it first appeared), and realized it's really more for the indexing side of things (big batch jobs, making data from data, etc). Nutch uses map reduce for crawling/indexing, but not for querying. -Yonik
Re: solr+hadoop = next solr
I've been exploring distributed search, as of late. I don't know about the next solr but I could certainly see a distributed solr grow out of such an expansion. In terms of the FederatedSearch wiki entry (updated last year), has there been any progress made this year on this topic, at least something worthy of being added or updated to the wiki page? Not to splinter efforts here, but maybe a working group that was focused on that topic could help to move things forward a bit. - j On 6/6/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 6/6/07, James liu [EMAIL PROTECTED] wrote: anyone agree? No ;-) At least not if you mean using map-reduce for queries. When I started looking at distributed search, I immediately went and read the map-reduce paper (easier concept than it first appeared), and realized it's really more for the indexing side of things (big batch jobs, making data from data, etc). Nutch uses map reduce for crawling/indexing, but not for querying. -Yonik
Re: solr+hadoop = next solr
2007/6/7, Yonik Seeley [EMAIL PROTECTED]: On 6/6/07, James liu [EMAIL PROTECTED] wrote: anyone agree? No ;-) At least not if you mean using map-reduce for queries. When I started looking at distributed search, I immediately went and read the map-reduce paper (easier concept than it first appeared), and realized it's really more for the indexing side of things (big batch jobs, making data from data, etc). Nutch uses map reduce for crawling/indexing, but not for querying. Yes, nutch use map reduce only for crawling/indexing, not for querying. http://www.nabble.com/something-i-think-important-and-should-be-added-tf3813838.html#a10796136 map-reduce just for indexing to decrease Master solr query *instance *index size and increase query speed. It will cost many time to index and merge but it will increase query accuracy. index and data not in same box. so we just only sure master query server hardware is powerful and slave query server hardware is not very important. Master index server should support multi index. If solr support it. I think user who use solr will quick setup their search. It just my thought. how do u think, yonik,,,and how do u think next solr? -Yonik -- regards jl
Re: solr+hadoop = next solr
On 6/6/07, Jeff Rodenburg [EMAIL PROTECTED] wrote: In terms of the FederatedSearch wiki entry (updated last year), has there been any progress made this year on this topic, at least something worthy of being added or updated to the wiki page? Priorities shifted, and I dropped it for a while. I recently started working with a CNET group that may need it, so I could start working on it again in the next few months. Don't wait for me if you have ideas though... I'll try to follow along and chime in. -Yonik