Re: Searching Across Multiple Cores

Jonathan Rochkind Wed, 03 Nov 2010 08:21:59 -0700

Basically, Solr doesn't do that. It seems to be a frequent topic on thelistserv, people wanting Solr to be able to do something like that. But,as far as I know, it doesn't -- and I don't have a good idea ofalternate ways to solve that kind of problem either.


Try put everything in the same core, is the general answer.

Solr shard distribution is designed for performance scaling, not foraccomplishing "join" like behavior accross two different schemas, thedistribution/shard thing isn't going to get you to that.


Lohrenz, Steven wrote:

Sorry about the late response to this, but was on holidays.No, as of right now there is not the same schema in each shard.I need to be able to search a set of data resources with manually defined data fields. All of those fields are searchable.Any one of these resources can be added to an individual's favourites list with the possibility of them adding additional tags, which are also searchable. The favourites folder needs to be searchable on all the same fields as the main data set and on the additional user defined tags.
Search fields for the main data schema are:
resourceId
resourceType
resourceGradeLevel
resourceKeywords
resourceLength
resourceSubjectArea
and about 30 more fields

The searchable fields for the My Favourites schema are:
userId
userFolder
userDefinedGradeLevel
userDefinedTags
plus all of those in the main data set.
Search queries:
1. Search the main data set for all those resources with keyword 'foo'.
2. Search the main data set for all those resources with keyword 'foo' and are for grade 3.3. Search the main data set for all those resources with subject area of 'grammar'.4. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the keyword 'foo'.5. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the keyword 'foo' and are for grade 3 and are in the folder 'testing'.6. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the subject area of 'grammar' and I have tagged with 'interesting'.7. Various combinations of the above.The simplest way I came up with to do this is to have 2 separate schemas. One for the main data set and one for My Favourites. When someone adds a resource from the main data set to their My Favourites folder all the data from the main data set is copied over the My Favourites schema and the userId, folder and other user specific information is added also.
But there could be 1 million copies of basically the same data in the My 
Favourites (if 1 million users add the same resource to their favourites). I 
thought that would waste a lot of space, so was looking for another way to do 
this (using a type of join - see below). Are there any other possibilities?

Cheers,
Steve

-----Original Message-----
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]Sent: 14 October 2010 18:58
To: solr-user@lucene.apache.org
Subject: Re: Searching Across Multiple Cores
The point/use-case of sharding/distributed search is for performance,not for segregating different data in different places. Distributedsearch assumes the same schema in each shard -- do you have that?
I don't think distributed search means to support the kind of "joining"you describe, that's not really what Solr does.
But if you actually do have the same schema accross your shards, andhave distributed search set up properly -- then you don't need to do anyspecial "joining", the shards end up forming one 'logical' index, that'sthe point of it. I don't think you can do what you describe. Solrdoesn't do "joins" like an rdbms, Solr works on a single set ofdocuments, not multiple "tables" or "collections".If you describe your data and the kind of queries you want to run,someone might be able to figure out a way to "de-normalize" the data tosupport what you want to do. Which won't really have anything to dowith shards/distributed search -- you add in distributed search forperformance or giant-size-of-index purposes, but it doesn't change yourschema design or queries.
Lohrenz, Steven wrote:
Ken,Ok, I understand how the distributed search works, but I don't understand how to build my query appropriately so that the results returned from the two shards only return values that exist in both result sets.In essence, I'm doing a join across the two shards on the resourceId.
So Core0 has:
resourceId (unique key)
titletag1tag2tag3
And Core1 has:
resourceId + folder + userId + grade (concatenated - this is the uniqueId)
resourceId
folder
userId
grade
For example, I would want to find all the content with userId = 893489 and tag1 = 'contentTagX'.My thought of how to do this is to search Core1 for all the items with userId = 893489. This would return a set of results for that user with resourceId. Then I would need to search Core0 for where tag1 = 'contentTagX' and where resourceId = those returned in the result set from Core1.
I can probably do this in a search handler (say Core3 with a mashup of the 2 
schemas but just redirects to the other shards), but is there an easier way to 
do it?

Or am I missing something?

Thanks for your help,
Steve


-----Original Message-----
From: Ken Stanley [mailto:doh...@gmail.com]Sent: 14 October 2010 18:19
To: solr-user@lucene.apache.org
Subject: Re: Searching Across Multiple Cores

Steve,

Using shards is actually quite simple; it's just a matter of setting up your
shards (via multiple cores, or multiple instances of SOLR) and then passing
the shards parameter in the query string. The shards parameter is a
comma-separated list of the servers/cores you wish to use together.

So, let's try this using a fictitious example. You have two cores, one
called main for your main data set of metadata and favorites for your user
favorites meta data. You set up each schema accordingly, and you've indexed
your data. When you want to do a query on both sets of data you would build
your query appropriately, and then use the following URL (the host is
assumed to be localhost for simplicity):

http://localhost/solr/main/select?q=id:[*+TO+*]&shards=localhost/solr/main,localhost/solr/favorites&rows=100&start=0

I am personally investigating using this technique to tie together two cores
that utilize different schemas; one schema will contain news articles,
blogs, and similar types of data, while another schema will contain
company-specific information, such as addresses, etc. If you're still having
trouble after trying this, let me know and I'd be more than happy to share
any findings that I come across.

I hope that this helps to clear things up for you. :)

- Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
                -- Douglas Adams, "The Hitchhikers Guide to the Galaxy"


On Thu, Oct 14, 2010 at 4:25 AM, Lohrenz, Steven
<steven.lohr...@hmhpub.com>wrote:
Ken,

I have been through that page many times. I could use Distributed search
for what? The first scenario or the second?

The question is: can I merge a set of results from the two cores/shards and
only return results that exist in both (determined by the resourceId, which
exists on both)?

Cheers,
Steve

-----Original Message-----
From: Ken Stanley [mailto:doh...@gmail.com]
Sent: 13 October 2010 20:08
To: solr-user@lucene.apache.org
Subject: Re: Searching Across Multiple Cores

On Wed, Oct 13, 2010 at 2:11 PM, Lohrenz, Steven
<steven.lohr...@hmhpub.com>wrote:
Hi,

I am trying to figure out if how I can accomplish the following:

I have a fairly static and large set of resources I need to have indexed
and searchable. Solr seems to be a perfect fit for that. In addition I
need
to have the ability for my users to add resources from the main data set
to
a 'Favourites' folder (which can include a few more tags added by them).
The
Favourites needs to be searchable in the same manner as the main data
set,
across all the same fields.

My first thought was to have two separate schemas
- the first  for the main data set and its metadata
- the second for the Favourites folder with all of the metadata from the
main set copied over and then adding the additional fields.

Then I thought that would probably waste quite a bit of space (the number
of users is much larger than the number of main resources).

So then I thought I could have the main data set with its metadata. Then
there would be second one for the Favourites folder with the unique id
from
the first and the additional fields it needs (userId, grade, folder,
tag).
In addition, I would create another schema/core with all the fields from
the
other two and have a request handler defined on it that searches across
the
other 2 cores and returns the results through this core.

This third core would have searches run against it where the results
would
expect to only be returned for a single user. For example, a user
searches
their Favourites folder for all the items with Foo. The result is only
those
items the user has added to their Favourites with Foo somewhere in their
main data set metadata.

Could this be made to work? What would the consequences be? Any
alternative
suggestions?

Thanks,
Steve
Steve,

From your description, it really sounds like you could reap the benefits of
using Distributed Search in SOLR:

http://wiki.apache.org/solr/DistributedSearch

I hope that this helps.

- Ken

Re: Searching Across Multiple Cores

Reply via email to