Re: solr+hadoop = next solr

2007-06-08 Thread Jeff Rodenburg

On 6/7/07, Rafael Rossini [EMAIL PROTECTED] wrote:


Hi, Jeff and Mike.

   Would you mind telling us about the architecture of your solutions a
little bit? Mike, you said that you implemented a highly-distributed
search
engine using Solr as indexing nodes. What does that mean? You guys
implemented a master, multi-slave solution for replication? Or the whole
index shards for high availability and fail over?



Our solution doesn't use solr, but goes directly to lucene.  It's built on
windows, so the interop communication service is built on .net remoting (tcp
based).  Microsoft has deprecated ongoing development with .net remoting, in
favor of other more standard mechanisms, i.e. http.  So, we're looking to
migrate our solution to a more community-supported model.

The underlying structure sounds similar to what others have done: index
shards distributed to various servers, each responsible for a subset of the
index.  A merging server handles coordination of concurrent thread requests
and synchronizes the results as they're returned.  The thread coordination
and search results interleaving process is functional but not really
scalable.  It works for our user model, where users tend not to page deeply
through results.  We want to change that so we can use solr as our primary
data source read mechanism for our site.

-- j


Re: solr+hadoop = next solr

2007-06-07 Thread Mike Klaas

On 6-Jun-07, at 7:44 PM, Jeff Rodenburg wrote:

I've been exploring distributed search, as of late.  I don't know  
about the
next solr but I could certainly see a distributed solr grow out  
of such

an expansion.


I've implemented a highly-distributed search engine using Solr (200m  
docs and growing, 60+ servers).   It is not a Solr-based solution in  
the vein of FederatedSearch--it is a higher-level architecture that  
uses Solr as indexing nodes.  I'll note that it is a lot of work and  
would be even more work to develop in the generic extensible  
philosophy that Solr espouses.


It is not really suitable for contribution, unfortunately (being  
written in python and proprietary).


In terms of the FederatedSearch wiki entry (updated last year), has  
there
been any progress made this year on this topic, at least something  
worthy of
being added or updated to the wiki page?  Not to splinter efforts  
here, but
maybe a working group that was focused on that topic could help to  
move

things forward a bit.


I don't believe that absence of organization has been the cause of  
lack of forward progress on this issue, but simply that there has  
been no-one sufficiently interested and committed to prioritizing  
this huge task to work on it.  There is no need to form a working  
group (not when there are only a handful of active committers to  
begin with)--all interested people could just use solr-dev@ for  
discussion.


Solr is an open-source project, so huge features will get implemented  
when there is a person or group of people devoted to leading the  
charge on the issue.  If you're interested in being that person,  
that's great!


-Mike


Re: solr+hadoop = next solr

2007-06-07 Thread Jeff Rodenburg

Mike - thanks for the comments.  Some responses added below.

On 6/7/07, Mike Klaas [EMAIL PROTECTED] wrote:



I've implemented a highly-distributed search engine using Solr (200m
docs and growing, 60+ servers).   It is not a Solr-based solution in
the vein of FederatedSearch--it is a higher-level architecture that
uses Solr as indexing nodes.  I'll note that it is a lot of work and
would be even more work to develop in the generic extensible
philosophy that Solr espouses.



Yeah, we've done the same thing in the .Net world, and it's a tough slog.
We're in the same situation -- making our solution generically extensible is
pretty much a non-starter.


In terms of the FederatedSearch wiki entry (updated last year), has
 there
 been any progress made this year on this topic, at least something
 worthy of
 being added or updated to the wiki page?  Not to splinter efforts
 here, but
 maybe a working group that was focused on that topic could help to
 move
 things forward a bit.

I don't believe that absence of organization has been the cause of
lack of forward progress on this issue, but simply that there has
been no-one sufficiently interested and committed to prioritizing
this huge task to work on it.  There is no need to form a working
group (not when there are only a handful of active committers to
begin with)--all interested people could just use solr-dev@ for
discussion.



That makes sense, just didn't want to bombard the list with the subject if
it was a detractor from the core project, i.e. keep lucene messages on
lucene, solr messages on solr, etc.  The good-community-participant
approach, if you will.

Solr is an open-source project, so huge features will get implemented

when there is a person or group of people devoted to leading the
charge on the issue.  If you're interested in being that person,
that's great!



Glad to jump in, not sure I qualify as such for that, but certainly a big
cheerleader nonetheless.


Re: solr+hadoop = next solr

2007-06-07 Thread Rafael Rossini

Hi, Jeff and Mike.

  Would you mind telling us about the architecture of your solutions a
little bit? Mike, you said that you implemented a highly-distributed search
engine using Solr as indexing nodes. What does that mean? You guys
implemented a master, multi-slave solution for replication? Or the whole
index shards for high availability and fail over?


On 6/7/07, Jeff Rodenburg [EMAIL PROTECTED] wrote:


Mike - thanks for the comments.  Some responses added below.

On 6/7/07, Mike Klaas [EMAIL PROTECTED] wrote:


 I've implemented a highly-distributed search engine using Solr (200m
 docs and growing, 60+ servers).   It is not a Solr-based solution in
 the vein of FederatedSearch--it is a higher-level architecture that
 uses Solr as indexing nodes.  I'll note that it is a lot of work and
 would be even more work to develop in the generic extensible
 philosophy that Solr espouses.


Yeah, we've done the same thing in the .Net world, and it's a tough slog.
We're in the same situation -- making our solution generically extensible
is
pretty much a non-starter.

 In terms of the FederatedSearch wiki entry (updated last year), has
  there
  been any progress made this year on this topic, at least something
  worthy of
  being added or updated to the wiki page?  Not to splinter efforts
  here, but
  maybe a working group that was focused on that topic could help to
  move
  things forward a bit.

 I don't believe that absence of organization has been the cause of
 lack of forward progress on this issue, but simply that there has
 been no-one sufficiently interested and committed to prioritizing
 this huge task to work on it.  There is no need to form a working
 group (not when there are only a handful of active committers to
 begin with)--all interested people could just use solr-dev@ for
 discussion.


That makes sense, just didn't want to bombard the list with the subject if
it was a detractor from the core project, i.e. keep lucene messages on
lucene, solr messages on solr, etc.  The good-community-participant
approach, if you will.

Solr is an open-source project, so huge features will get implemented
 when there is a person or group of people devoted to leading the
 charge on the issue.  If you're interested in being that person,
 that's great!


Glad to jump in, not sure I qualify as such for that, but certainly a big
cheerleader nonetheless.



Re: solr+hadoop = next solr

2007-06-06 Thread Yonik Seeley

On 6/6/07, James liu [EMAIL PROTECTED] wrote:

anyone agree?


No ;-)

At least not if you mean using map-reduce for queries.

When I started looking at distributed search, I immediately went and
read the map-reduce paper (easier concept than it first appeared), and
realized it's really more for the indexing side of things (big batch
jobs, making data from data, etc).  Nutch uses map reduce for
crawling/indexing, but not for querying.

-Yonik


Re: solr+hadoop = next solr

2007-06-06 Thread Jeff Rodenburg

I've been exploring distributed search, as of late.  I don't know about the
next solr but I could certainly see a distributed solr grow out of such
an expansion.

In terms of the FederatedSearch wiki entry (updated last year), has there
been any progress made this year on this topic, at least something worthy of
being added or updated to the wiki page?  Not to splinter efforts here, but
maybe a working group that was focused on that topic could help to move
things forward a bit.

- j

On 6/6/07, Yonik Seeley [EMAIL PROTECTED] wrote:


On 6/6/07, James liu [EMAIL PROTECTED] wrote:
 anyone agree?

No ;-)

At least not if you mean using map-reduce for queries.

When I started looking at distributed search, I immediately went and
read the map-reduce paper (easier concept than it first appeared), and
realized it's really more for the indexing side of things (big batch
jobs, making data from data, etc).  Nutch uses map reduce for
crawling/indexing, but not for querying.

-Yonik



Re: solr+hadoop = next solr

2007-06-06 Thread James liu

2007/6/7, Yonik Seeley [EMAIL PROTECTED]:


On 6/6/07, James liu [EMAIL PROTECTED] wrote:
 anyone agree?

No ;-)

At least not if you mean using map-reduce for queries.

When I started looking at distributed search, I immediately went and
read the map-reduce paper (easier concept than it first appeared), and
realized it's really more for the indexing side of things (big batch
jobs, making data from data, etc).  Nutch uses map reduce for
crawling/indexing, but not for querying.



Yes, nutch use map reduce only for crawling/indexing, not for querying.


http://www.nabble.com/something-i-think-important-and-should-be-added-tf3813838.html#a10796136

map-reduce just for indexing to decrease Master solr query *instance *index
size and increase query speed.

It will cost many time to index and merge but it will increase query
accuracy.

index and data not in same box. so we just only sure master query server
hardware is powerful and
slave query server hardware is not very important.

Master index server should support multi index.

If solr support it.

I think user who use solr will quick setup their search.


It just my thought.

how do u think, yonik,,,and how do u think next solr?


-Yonik






--
regards
jl


Re: solr+hadoop = next solr

2007-06-06 Thread Yonik Seeley

On 6/6/07, Jeff Rodenburg [EMAIL PROTECTED] wrote:

In terms of the FederatedSearch wiki entry (updated last year), has there
been any progress made this year on this topic, at least something worthy of
being added or updated to the wiki page?


Priorities shifted, and I dropped it for a while.
I recently started working with a CNET group that may need it, so I
could start working on it again in the next few months.  Don't wait
for me if you have ideas though... I'll try to follow along and chime
in.

-Yonik