RE: Searching Across Multiple Cores

Lohrenz, Steven Wed, 03 Nov 2010 03:46:45 -0700

Sorry about the late response to this, but was on holidays. 

No, as of right now there is not the same schema in each shard.

I need to be able to search a set of data resources with manually defined data 
fields. All of those fields are searchable. 

Any one of these resources can be added to an individual's favourites list with 
the possibility of them adding additional tags, which are also searchable. The 
favourites folder needs to be searchable on all the same fields as the main 
data set and on the additional user defined tags. 

Search fields for the main data schema are:
resourceId
resourceType
resourceGradeLevel
resourceKeywords
resourceLength
resourceSubjectArea
and about 30 more fields

The searchable fields for the My Favourites schema are:
userId
userFolder
userDefinedGradeLevel
userDefinedTags
plus all of those in the main data set. 

Search queries:
1. Search the main data set for all those resources with keyword 'foo'.
2. Search the main data set for all those resources with keyword 'foo' and are 
for grade 3. 
3. Search the main data set for all those resources with subject area of 
'grammar'.
4. Search My Favourites folder for all the resources I have moved there (userId 
= 12321) with the keyword 'foo'. 
5. Search My Favourites folder for all the resources I have moved there (userId 
= 12321) with the keyword 'foo' and are for grade 3 and are in the folder 
'testing'. 
6. Search My Favourites folder for all the resources I have moved there (userId 
= 12321) with the subject area of 'grammar' and I have tagged with 
'interesting'. 
7. Various combinations of the above. 

The simplest way I came up with to do this is to have 2 separate schemas. One 
for the main data set and one for My Favourites. When someone adds a resource 
from the main data set to their My Favourites folder all the data from the main 
data set is copied over the My Favourites schema and the userId, folder and 
other user specific information is added also. 

But there could be 1 million copies of basically the same data in the My 
Favourites (if 1 million users add the same resource to their favourites). I 
thought that would waste a lot of space, so was looking for another way to do 
this (using a type of join - see below). Are there any other possibilities?

Cheers,
Steve

-----Original Message-----
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: 14 October 2010 18:58
To: solr-user@lucene.apache.org
Subject: Re: Searching Across Multiple Cores

The point/use-case of sharding/distributed search is for performance, 
not for segregating different data in different places. Distributed 
search assumes the same schema in each shard -- do you have that?

I don't think distributed search means to support the kind of "joining" 
you describe, that's not really what Solr does.

But if you actually do have the same schema accross your shards, and 
have distributed search set up properly -- then you don't need to do any 
special "joining", the shards end up forming one 'logical' index, that's 
the point of it.  I don't think you can do what you describe. Solr 
doesn't do "joins" like an rdbms, Solr works on a single set of 
documents, not multiple "tables" or "collections". 

If you describe your data and the kind of queries you want to run, 
someone might be able to figure out a way to "de-normalize" the data to 
support what you want to do.  Which won't really have anything to do 
with shards/distributed search -- you add in distributed search for 
performance or giant-size-of-index purposes, but it doesn't change your 
schema design or queries.

Lohrenz, Steven wrote:
> Ken, 
>
> Ok, I understand how the distributed search works, but I don't understand how 
> to build my query appropriately so that the results returned from the two 
> shards only return values that exist in both result sets. 
>
> In essence, I'm doing a join across the two shards on the resourceId. 
>
> So Core0 has:
> resourceId (unique key)
> title 
> tag1
> tag2 
> tag3
>
> And Core1 has:
> resourceId + folder + userId + grade (concatenated - this is the uniqueId)
> resourceId
> folder
> userId
> grade
>
> For example, I would want to find all the content with userId = 893489 and 
> tag1 = 'contentTagX'. 
>
> My thought of how to do this is to search Core1 for all the items with userId 
> = 893489. This would return a set of results for that user with resourceId. 
> Then I would need to search Core0 for where tag1 = 'contentTagX' and where 
> resourceId = those returned in the result set from Core1. 
>
> I can probably do this in a search handler (say Core3 with a mashup of the 2 
> schemas but just redirects to the other shards), but is there an easier way 
> to do it?
>
> Or am I missing something?
>
> Thanks for your help,
> Steve
>
>
> -----Original Message-----
> From: Ken Stanley [mailto:doh...@gmail.com] 
> Sent: 14 October 2010 18:19
> To: solr-user@lucene.apache.org
> Subject: Re: Searching Across Multiple Cores
>
> Steve,
>
> Using shards is actually quite simple; it's just a matter of setting up your
> shards (via multiple cores, or multiple instances of SOLR) and then passing
> the shards parameter in the query string. The shards parameter is a
> comma-separated list of the servers/cores you wish to use together.
>
> So, let's try this using a fictitious example. You have two cores, one
> called main for your main data set of metadata and favorites for your user
> favorites meta data. You set up each schema accordingly, and you've indexed
> your data. When you want to do a query on both sets of data you would build
> your query appropriately, and then use the following URL (the host is
> assumed to be localhost for simplicity):
>
> http://localhost/solr/main/select?q=id:[*+TO+*]&shards=localhost/solr/main,localhost/solr/favorites&rows=100&start=0
>
> I am personally investigating using this technique to tie together two cores
> that utilize different schemas; one schema will contain news articles,
> blogs, and similar types of data, while another schema will contain
> company-specific information, such as addresses, etc. If you're still having
> trouble after trying this, let me know and I'd be more than happy to share
> any findings that I come across.
>
> I hope that this helps to clear things up for you. :)
>
> - Ken
>
> It looked like something resembling white marble, which was
> probably what it was: something resembling white marble.
>                 -- Douglas Adams, "The Hitchhikers Guide to the Galaxy"
>
>
> On Thu, Oct 14, 2010 at 4:25 AM, Lohrenz, Steven
> <steven.lohr...@hmhpub.com>wrote:
>
>   
>> Ken,
>>
>> I have been through that page many times. I could use Distributed search
>> for what? The first scenario or the second?
>>
>> The question is: can I merge a set of results from the two cores/shards and
>> only return results that exist in both (determined by the resourceId, which
>> exists on both)?
>>
>> Cheers,
>> Steve
>>
>> -----Original Message-----
>> From: Ken Stanley [mailto:doh...@gmail.com]
>> Sent: 13 October 2010 20:08
>> To: solr-user@lucene.apache.org
>> Subject: Re: Searching Across Multiple Cores
>>
>> On Wed, Oct 13, 2010 at 2:11 PM, Lohrenz, Steven
>> <steven.lohr...@hmhpub.com>wrote:
>>
>>     
>>> Hi,
>>>
>>> I am trying to figure out if how I can accomplish the following:
>>>
>>> I have a fairly static and large set of resources I need to have indexed
>>> and searchable. Solr seems to be a perfect fit for that. In addition I
>>>       
>> need
>>     
>>> to have the ability for my users to add resources from the main data set
>>>       
>> to
>>     
>>> a 'Favourites' folder (which can include a few more tags added by them).
>>>       
>> The
>>     
>>> Favourites needs to be searchable in the same manner as the main data
>>>       
>> set,
>>     
>>> across all the same fields.
>>>
>>> My first thought was to have two separate schemas
>>> - the first  for the main data set and its metadata
>>> - the second for the Favourites folder with all of the metadata from the
>>> main set copied over and then adding the additional fields.
>>>
>>> Then I thought that would probably waste quite a bit of space (the number
>>> of users is much larger than the number of main resources).
>>>
>>> So then I thought I could have the main data set with its metadata. Then
>>> there would be second one for the Favourites folder with the unique id
>>>       
>> from
>>     
>>> the first and the additional fields it needs (userId, grade, folder,
>>>       
>> tag).
>>     
>>> In addition, I would create another schema/core with all the fields from
>>>       
>> the
>>     
>>> other two and have a request handler defined on it that searches across
>>>       
>> the
>>     
>>> other 2 cores and returns the results through this core.
>>>
>>> This third core would have searches run against it where the results
>>>       
>> would
>>     
>>> expect to only be returned for a single user. For example, a user
>>>       
>> searches
>>     
>>> their Favourites folder for all the items with Foo. The result is only
>>>       
>> those
>>     
>>> items the user has added to their Favourites with Foo somewhere in their
>>> main data set metadata.
>>>
>>> Could this be made to work? What would the consequences be? Any
>>>       
>> alternative
>>     
>>> suggestions?
>>>
>>> Thanks,
>>> Steve
>>>
>>>
>>>       
>> Steve,
>>
>> From your description, it really sounds like you could reap the benefits of
>> using Distributed Search in SOLR:
>>
>> http://wiki.apache.org/solr/DistributedSearch
>>
>> I hope that this helps.
>>
>> - Ken
>>
>>

RE: Searching Across Multiple Cores

Reply via email to