Re: One index or multiple?

Billy Newman Sun, 07 Oct 2012 07:01:07 -0700

Walter,

Thanks!  You bring up a very important 'commit' problem which I had
not thought about.  So I am running a DIH that is wiping out part of
the index (ie all animals), then re-indexing/re-importing.  I have
another DIH that is wiping out part if the index (minerals), then
re-indexing/re-importing.


I see this problem (which I think you already realized):
1. Index is full and people are querying.
2. DIH for animals starts running and wipes out all animals
3. DIH for minerals starts running and wipes out all minerals.
4. DIH for animals finishes, and commits.
5. User queries for minerals which might return 0 or a subset of
results.  Because the animals DIH 'commited' the changes made by the
mineral DIH (lets assume only the clear happened in the mineral DIH
when the animal DIH committed).

To further complicate things I have a third SolrJ application that
will be processing another dataset and updating/committing to the
index.  Is there a recommended way to handle multiple applications
that are wiping out and writing to part of the index, such that the
commits do not commit at an inopportune time (ie commit by one
application right after another application just wiped part of the
index before repopulating it)?

I need to update the index every so often (~30 minutes).  I could
write an app that chains the other 'indexer' apps (DIH1, DIH2,
SolrJApp1) together such that they run serially and then do one commit
at the end.  Not too bad, but wondering if there is anything I can
take advantage of in Solr that would help with this problem.  I am
using Solr 4.0-BETA if that makes a difference.

Thanks again!

Billy

On Sat, Oct 6, 2012 at 6:05 PM, Walter Underwood <wun...@wunderwood.org> wrote:
> Right. You define three update handlers, something like /update-animal, 
> /update-mineral, and /update-vegetable. Each one has a separate DIH config. 
> Each config deletes documents of that type and loads documents of that type.
>
> You will not want to run them at the same time, because a commit in one will 
> commit all the pending changes from any other one. It would be much less 
> confusing to run them separately.
>
> wunder
>
> On Oct 6, 2012, at 2:30 PM, Erick Erickson wrote:
>
>> Sure, you need to define the appropriate delete query for each DIH entry.
>>
>> Best
>> Erick
>>
>> On Fri, Oct 5, 2012 at 5:40 PM, Billy Newman <newman...@gmail.com> wrote:
>>> Does DIH support only deleting/re-indexing docs of a certain type?
>>>
>>> I.E. can I have a DIH for type:vegetable and another for type:mineral
>>> and each only deletes/recreates the right types?
>>>
>>> Thanks.
>>>
>>> On Fri, Oct 5, 2012 at 1:04 PM, Walter Underwood <wun...@wunderwood.org> 
>>> wrote:
>>>> Using the same unique key doesn't handle documents which disappear from 
>>>> one indexing to the next.
>>>>
>>>> Instead, add a field for the type of item, like type:animal, 
>>>> type:vegetable, or type:mineral. Then the query used to clean up before 
>>>> indexing can delete all items of that type.
>>>>
>>>> wunder
>>>>
>>>> On Oct 5, 2012, at 12:00 PM, Erick Erickson wrote:
>>>>
>>>>> DIH always gives me indigestion.....
>>>>>
>>>>> Couple of things:
>>>>> See the 'clean' parameter here for full import:
>>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>> it defaults to true. I think if you set it to "false"
>>>>> _and_ assuming that your <uniqueKey> is
>>>>> defined, it should work OK.
>>>>>
>>>>> The other approach would be to control the
>>>>> indexing of your XML from, say, a SolrJ program
>>>>> combined with a cron job....
>>>>>
>>>>> Does that work?
>>>>> Erick
>>>>>
>>>>> On Fri, Oct 5, 2012 at 2:39 PM, Billy Newman <newman...@gmail.com> wrote:
>>>>>> Erick,
>>>>>>
>>>>>> I did mention using the DIH to index the first two datasets, that is
>>>>>> where my the root of my problem lies.
>>>>>>
>>>>>> I do see the benefit of one index.  However the question still
>>>>>> remains, can I use the DIH to index xml from data set 1 and 2, every
>>>>>> 15 minutes or so (full index) without wiping out all the indexed data
>>>>>> in the index from data set 3.
>>>>>>
>>>>>> I.E. From a couple of quick tests the DIH full import destroys all
>>>>>> data in the index before it repopulates it.  Not sure I can just have
>>>>>> it destroy/re-index data of a certain type.  Basically DIH full-import
>>>>>> on my_index for type 'dataset1', and DIH full-import on my-index for
>>>>>> type 'dataset2'.  Both full-imports leaving alone the type 'dataset3'
>>>>>> data in the index.
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Thanks,
>>>>>> Billy
>>>>>>
>>>>>> On Fri, Oct 5, 2012 at 10:42 AM, Erick Erickson 
>>>>>> <erickerick...@gmail.com> wrote:
>>>>>>> The very first question is "what form are your XML docs in?"
>>>>>>> Solr does NOT index arbitrary XML, so I'm guessing
>>>>>>> you're using DIH and some of the xml stuff there. Do note
>>>>>>> that the XSLT is a subset of the full capabilities....
>>>>>>>
>>>>>>> Second, I'd recommend you just put it all in a single index, it'll be
>>>>>>> simpler. Index a field indicating which of your three sources
>>>>>>> the doc belongs to. Then you can group (aka Field Collapse) by
>>>>>>> source and your result sets will contain the top N docs from each
>>>>>>> type and you can do whatever you want with them at the app
>>>>>>> level. See: http://wiki.apache.org/solr/FieldCollapsing
>>>>>>>
>>>>>>> By including a type, you an also do nifty things like delete all the
>>>>>>> records for a particular type by query.
>>>>>>>
>>>>>>> Best
>>>>>>> Erick
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 5, 2012 at 11:22 AM, Billy Newman <newman...@gmail.com> 
>>>>>>> wrote:
>>>>>>>> I am looking into Solr to index a few of my data sets, 3 to be exact.
>>>>>>>>
>>>>>>>> The first 2 are really small xml docs retrieved via url, ~300 records
>>>>>>>> each.  The data behind both of these changes very frequently ~5
>>>>>>>> minutes.  The data itself does not have timestamps so delta-import
>>>>>>>> using DIH would not work (at least I don't think it would work).  I am
>>>>>>>> thinking about just re-indexing these 2 data sources every 15 minutes
>>>>>>>> or so to keep the indexes up to date.
>>>>>>>>
>>>>>>>> The 3rd data set is a lot more complicated in which I will probably
>>>>>>>> have to use SolrJ and write some custom code to handle
>>>>>>>> inserts/updates/deletes.
>>>>>>>>
>>>>>>>> I need to be able to search all the data sets once they are indexed in
>>>>>>>> one search.
>>>>>>>>
>>>>>>>> A couple options:
>>>>>>>>
>>>>>>>> 1.  Store the data from all 3 datasets in different indexes, allowing
>>>>>>>> the DIH import handler to re-index datasets 1 and 2 without affecting
>>>>>>>> indexed data from data set 3.   Not sure this is advised as I am not
>>>>>>>> sure it is a good idea, or even possible to search multiple cores.
>>>>>>>>
>>>>>>>> 2. Store all the data from all 3 datasets in the same index.  Yet this
>>>>>>>> brings the question of how to re-index datasets 1 and 2 using a DIH
>>>>>>>> full-import and not lose indexed data from data set 3.
>>>>>>>>
>>>>>>>> Just starting with Solr so please go easy ;).  Thanks in advance.
>>>>>>>>
>>>>>>>> Billy
>>>>
>>>> --
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>>
>>>>
>>>>
>
> --
> Walter Underwood
> wun...@wunderwood.org
>
>
>

Re: One index or multiple?

Reply via email to