Re: Storing 2 dimension array in Solr

David Philip Mon, 14 Oct 2013 05:39:45 -0700

Hi,

  I will check for pesudo join.


Jack,
I doubt further de-normalization. Rest of the points that you told me,  I
will take them. Thank you.
Basically, We have 2 different sor indexes. One table is rarely updated but
this group-disease table has frequent update and new dieasese are added
very often. So we maintain them separately. While querying we need join
operation on table 1 and 2.

Till now, I could create a test solr index with 100k dynamic field to each
document. Further, i am yet to test. it took almost 1.5 hours to create
index for 1500 groups * each group almost having 90k dynamic fields.

I also added doc_static field which copies all the integer set from copy
fields_disease to this field. While querying I use only this filed to
retrieve.
Any best approaches, please let me know.

Thanks - David






On Sun, Oct 13, 2013 at 6:37 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> Yeah, something like that. The key or ID field would probably just be the
> composition of the group and disease fields.
>
> The other thing is if occurrence is simply a boolean, omit it and omit the
> document if that disease is not present for that group. If the majority of
> the diseases are not present for a specified group, that would eliminate a
> lot of documents. Or if occurrence is not a boolean, keep the field, but
> again not add a document if the disease is not present for that group.
>
> My usual, over-generalized rule for dynamic fields is that they are a
> powerful tool, but only if used in moderation. "Millions" would not be
> moderation.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Lee Carroll
> Sent: Sunday, October 13, 2013 8:35 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Storing 2 dimension array in Solr
>
> I think he means a doc for each element. so you have a disease occurrence
> index
>
> <doc>
> <group>1</group>
> <dis>1</dis>
> <occurrence>exist</occurrence>
> <unique Field>1-1</unique field>
> </doc>
>
> assuming (and its a pretty fair assumption?) most groups have only a subset
> of diseases this will be a sparse matrix so just don't index
> the occurrence value "does not exist"
>
> basically denormalize via adding fields which don't relate to the key.
>
> This will work fine on modest hardware and no thought to performance for <5
> million docs. It will work fine with some though and hardware for very
> large numbers. Its worth a go anyway just to test. It should probably be
> your first method to try out.
>
>
>
>
> On 13 October 2013 12:10, Erick Erickson <erickerick...@gmail.com> wrote:
>
>  This sounds like a denormalization issue. Don't be afraid <G>.
>>
>> Actually, I've seen from 50M 50 300M small docs on a Solr node,
>> depending on query type, hardware, etc. So that gives you a
>> place to start being cautious about the number of docs in your
>> system. If your full expansion of your table numbers in that range,
>> you might be just fine denormalizing the data.
>>
>> Alternatively, there's the "pseudo join" capability to consider. I'm
>> usually hesitant to recommend that, but Joel is committing some
>> really interesting stuff in the join area which you might take a look
>> at if the existing pseudo-join isn't performant enough.
>>
>> But I'd consider denormalizing the data as the first approach.
>>
>> Best,
>> Erick
>>
>>
>> On Sun, Oct 13, 2013 at 8:07 AM, David Philip
>> <davidphilipshe...@gmail.com>**wrote:
>>
>> > Hi Jack, for the point: "each element of the array as a solr document,
>> with
>> > a group field and a disease field"
>> > Did you mean it this way:
>> >
>> > <doc>
>> >   "group1_grp": G1
>> >  "disease1_d": 2,
>> >  "disease2_d": 3,
>> > </doc>
>> > <doc>
>> >   "group1_grp": G2
>> >  "disease1_d": 2,
>> >  "disease2_d": 3,
>> > "disease3_d":  1,
>> > "disease4_d":  1,
>> > </doc>
>> > similar to first case: having dynamic fields for disease?
>> > Will it be performance issue if disease field increase to millions?
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Oct 13, 2013 at 9:00 AM, Jack Krupansky <
>> j...@basetechnology.com
>> > >wrote:
>> >
>> > > You may be better off indexing each element of the array as a solr
>> > > document, with a group field and a disease field. Then you can easily
>> and
>> > > efficiently add new diseases. Then to query a row, you query for the
>> > group
>> > > field having the desired group.
>> > >
>> > > If possible, index the array as being sparse - no document for a
>> disease
>> > > if it is not present for that group.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > -----Original Message----- From: David Philip
>> > > Sent: Saturday, October 12, 2013 9:56 PM
>> > > To: solr-user@lucene.apache.org
>> > > Subject: Re: Storing 2 dimension array in Solr
>> > >
>> > >
>> > > Hi Erick, Yes it is. But the columns here are dynamically and very
>> > > frequently added.They can increase upto 1 million right now. So, 1
>> > document
>> > > with 1 million dynamic fields, is it fine? Or any other approach?
>> > >
>> > > While searching through web, I found that docValues are column
>> oriented.
>> > > http://searchhub.org/2013/04/****02/fun-with-docvalues-in-**
>> solr-**4-2/<http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/>
>> <
>> > http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/<http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/>
>> >
>> > > However,  I did not understand, how to use docValues to add these
>> > columns.
>> > >
>> > > What is the recommended approach?
>> > >
>> > > Thanks - David
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Sun, Oct 13, 2013 at 3:33 AM, Erick Erickson <
>> erickerick...@gmail.com
>> > >*
>> > > *wrote:
>> > >
>> > >  Isn't this just indexing each row as a separate document
>> > >> with a suitable ID "groupN" in your example?
>> > >>
>> > >>
>> > >> On Sat, Oct 12, 2013 at 2:43 PM, David Philip
>> > >> <davidphilipshe...@gmail.com>****wrote:
>> > >>
>> > >> > Hi Erick,
>> > >> >
>> > >> >    We have set of groups as represented below. New columns > >> >
>> (diseases
>> > as
>> > >> in
>> > >> > below matrix) keep coming and we need to add them as new column. To
>> > that
>> > >> > column, we have values such as 1 or 2 or 3 or 4 (exist, slight, na,
>> > >> > notfound) for respective groups.
>> > >> >
>> > >> > While querying we need  to get the entire row for group:"group1".
>>  We
>> > >> will
>> > >> > not be searching on columns(*_disease) values, index=false but
>> stored
>> > is
>> > >> > true.
>> > >> >
>> > >> > for ex: we use, get group:"group1" and we need to get the entire
>> row-
>> > >> > exist,slight, not found. Hoping this explanation is clearer.
>> > >> >
>> > >> >                disease1    disease2     disease3
>> > >> > group1    exist         slight          not found
>> > >> > groups2   slight        not found    exist
>> > >> > group3    slight         exist
>> > >> > groupK    -                na             exist
>> > >> >
>> > >> >
>> > >> >
>> > >> > Thanks - David
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Sat, Oct 12, 2013 at 11:39 PM, Erick Erickson <
>> > >> erickerick...@gmail.com
>> > >> > >wrote:
>> > >> >
>> > >> > > David:
>> > >> > >
>> > >> > > This feels like it may be an XY problem. _Why_ do you
>> > >> > > want to store a 2-dimensional array and what
>> > >> > > do you want to do with it? Maybe there are better
>> > >> > > approaches.
>> > >> > >
>> > >> > > Best
>> > >> > > Erick
>> > >> > >
>> > >> > >
>> > >> > > On Sat, Oct 12, 2013 at 2:07 AM, David Philip
>> > >> > > <davidphilipshe...@gmail.com>****wrote:
>> > >> > >
>> > >> > > > Hi,
>> > >> > > >
>> > >> > > >   I have a 2 dimension array and want it to be persisted in
>> solr.
>> > >
>> > >> > > How
>> > >> > > can I
>> > >> > > > do that?
>> > >> > > >
>> > >> > > > Sample case:
>> > >> > > >
>> > >> > > >              disease1    disease2     disease3
>> > >> > > > group1    exist         slight          not found
>> > >> > > > groups2   slight        not found    exist
>> > >> > > > group2    slight         exist
>> > >> > > >
>> > >> > > > exist-1 not found - 2 slight-3 .. can be stored like this also.
>> > >> > > >
>> > >> > > > Note: This array has frequent updates.  Every time new disease
>> > get's
>> > >> > > added
>> > >> > > > and I have to add description about that disease to all groups.
>> > And
>> > >> at
>> > >> > > > query time, I will do get by row  - get by group only group =
>> > group2
>> > >> > row.
>> > >> > > >
>> > >> > > > Any suggestion on how I can achieve this?  I am thankful to the
>> >
>> > >
>> > >> > forum
>> > >> > for
>> > >> > > > replying with patience, on achieving this, i will blog and will
>> >
>> > >
>> > >> > share
>> > >> > it
>> > >> > > > with all.
>> > >> > > >
>> > >> > > > Thanks - David
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >>
>> > >
>> >
>>
>>
>

Re: Storing 2 dimension array in Solr

Reply via email to