Hi, I will check for pesudo join.
Jack, I doubt further de-normalization. Rest of the points that you told me, I will take them. Thank you. Basically, We have 2 different sor indexes. One table is rarely updated but this group-disease table has frequent update and new dieasese are added very often. So we maintain them separately. While querying we need join operation on table 1 and 2. Till now, I could create a test solr index with 100k dynamic field to each document. Further, i am yet to test. it took almost 1.5 hours to create index for 1500 groups * each group almost having 90k dynamic fields. I also added doc_static field which copies all the integer set from copy fields_disease to this field. While querying I use only this filed to retrieve. Any best approaches, please let me know. Thanks - David On Sun, Oct 13, 2013 at 6:37 PM, Jack Krupansky <j...@basetechnology.com>wrote: > Yeah, something like that. The key or ID field would probably just be the > composition of the group and disease fields. > > The other thing is if occurrence is simply a boolean, omit it and omit the > document if that disease is not present for that group. If the majority of > the diseases are not present for a specified group, that would eliminate a > lot of documents. Or if occurrence is not a boolean, keep the field, but > again not add a document if the disease is not present for that group. > > My usual, over-generalized rule for dynamic fields is that they are a > powerful tool, but only if used in moderation. "Millions" would not be > moderation. > > -- Jack Krupansky > > -----Original Message----- From: Lee Carroll > Sent: Sunday, October 13, 2013 8:35 AM > > To: solr-user@lucene.apache.org > Subject: Re: Storing 2 dimension array in Solr > > I think he means a doc for each element. so you have a disease occurrence > index > > <doc> > <group>1</group> > <dis>1</dis> > <occurrence>exist</occurrence> > <unique Field>1-1</unique field> > </doc> > > assuming (and its a pretty fair assumption?) most groups have only a subset > of diseases this will be a sparse matrix so just don't index > the occurrence value "does not exist" > > basically denormalize via adding fields which don't relate to the key. > > This will work fine on modest hardware and no thought to performance for <5 > million docs. It will work fine with some though and hardware for very > large numbers. Its worth a go anyway just to test. It should probably be > your first method to try out. > > > > > On 13 October 2013 12:10, Erick Erickson <erickerick...@gmail.com> wrote: > > This sounds like a denormalization issue. Don't be afraid <G>. >> >> Actually, I've seen from 50M 50 300M small docs on a Solr node, >> depending on query type, hardware, etc. So that gives you a >> place to start being cautious about the number of docs in your >> system. If your full expansion of your table numbers in that range, >> you might be just fine denormalizing the data. >> >> Alternatively, there's the "pseudo join" capability to consider. I'm >> usually hesitant to recommend that, but Joel is committing some >> really interesting stuff in the join area which you might take a look >> at if the existing pseudo-join isn't performant enough. >> >> But I'd consider denormalizing the data as the first approach. >> >> Best, >> Erick >> >> >> On Sun, Oct 13, 2013 at 8:07 AM, David Philip >> <davidphilipshe...@gmail.com>**wrote: >> >> > Hi Jack, for the point: "each element of the array as a solr document, >> with >> > a group field and a disease field" >> > Did you mean it this way: >> > >> > <doc> >> > "group1_grp": G1 >> > "disease1_d": 2, >> > "disease2_d": 3, >> > </doc> >> > <doc> >> > "group1_grp": G2 >> > "disease1_d": 2, >> > "disease2_d": 3, >> > "disease3_d": 1, >> > "disease4_d": 1, >> > </doc> >> > similar to first case: having dynamic fields for disease? >> > Will it be performance issue if disease field increase to millions? >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > On Sun, Oct 13, 2013 at 9:00 AM, Jack Krupansky < >> j...@basetechnology.com >> > >wrote: >> > >> > > You may be better off indexing each element of the array as a solr >> > > document, with a group field and a disease field. Then you can easily >> and >> > > efficiently add new diseases. Then to query a row, you query for the >> > group >> > > field having the desired group. >> > > >> > > If possible, index the array as being sparse - no document for a >> disease >> > > if it is not present for that group. >> > > >> > > -- Jack Krupansky >> > > >> > > -----Original Message----- From: David Philip >> > > Sent: Saturday, October 12, 2013 9:56 PM >> > > To: solr-user@lucene.apache.org >> > > Subject: Re: Storing 2 dimension array in Solr >> > > >> > > >> > > Hi Erick, Yes it is. But the columns here are dynamically and very >> > > frequently added.They can increase upto 1 million right now. So, 1 >> > document >> > > with 1 million dynamic fields, is it fine? Or any other approach? >> > > >> > > While searching through web, I found that docValues are column >> oriented. >> > > http://searchhub.org/2013/04/****02/fun-with-docvalues-in-** >> solr-**4-2/<http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/> >> < >> > http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/<http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/> >> > >> > > However, I did not understand, how to use docValues to add these >> > columns. >> > > >> > > What is the recommended approach? >> > > >> > > Thanks - David >> > > >> > > >> > > >> > > >> > > >> > > >> > > On Sun, Oct 13, 2013 at 3:33 AM, Erick Erickson < >> erickerick...@gmail.com >> > >* >> > > *wrote: >> > > >> > > Isn't this just indexing each row as a separate document >> > >> with a suitable ID "groupN" in your example? >> > >> >> > >> >> > >> On Sat, Oct 12, 2013 at 2:43 PM, David Philip >> > >> <davidphilipshe...@gmail.com>****wrote: >> > >> >> > >> > Hi Erick, >> > >> > >> > >> > We have set of groups as represented below. New columns > >> > >> (diseases >> > as >> > >> in >> > >> > below matrix) keep coming and we need to add them as new column. To >> > that >> > >> > column, we have values such as 1 or 2 or 3 or 4 (exist, slight, na, >> > >> > notfound) for respective groups. >> > >> > >> > >> > While querying we need to get the entire row for group:"group1". >> We >> > >> will >> > >> > not be searching on columns(*_disease) values, index=false but >> stored >> > is >> > >> > true. >> > >> > >> > >> > for ex: we use, get group:"group1" and we need to get the entire >> row- >> > >> > exist,slight, not found. Hoping this explanation is clearer. >> > >> > >> > >> > disease1 disease2 disease3 >> > >> > group1 exist slight not found >> > >> > groups2 slight not found exist >> > >> > group3 slight exist >> > >> > groupK - na exist >> > >> > >> > >> > >> > >> > >> > >> > Thanks - David >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > On Sat, Oct 12, 2013 at 11:39 PM, Erick Erickson < >> > >> erickerick...@gmail.com >> > >> > >wrote: >> > >> > >> > >> > > David: >> > >> > > >> > >> > > This feels like it may be an XY problem. _Why_ do you >> > >> > > want to store a 2-dimensional array and what >> > >> > > do you want to do with it? Maybe there are better >> > >> > > approaches. >> > >> > > >> > >> > > Best >> > >> > > Erick >> > >> > > >> > >> > > >> > >> > > On Sat, Oct 12, 2013 at 2:07 AM, David Philip >> > >> > > <davidphilipshe...@gmail.com>****wrote: >> > >> > > >> > >> > > > Hi, >> > >> > > > >> > >> > > > I have a 2 dimension array and want it to be persisted in >> solr. >> > > >> > >> > > How >> > >> > > can I >> > >> > > > do that? >> > >> > > > >> > >> > > > Sample case: >> > >> > > > >> > >> > > > disease1 disease2 disease3 >> > >> > > > group1 exist slight not found >> > >> > > > groups2 slight not found exist >> > >> > > > group2 slight exist >> > >> > > > >> > >> > > > exist-1 not found - 2 slight-3 .. can be stored like this also. >> > >> > > > >> > >> > > > Note: This array has frequent updates. Every time new disease >> > get's >> > >> > > added >> > >> > > > and I have to add description about that disease to all groups. >> > And >> > >> at >> > >> > > > query time, I will do get by row - get by group only group = >> > group2 >> > >> > row. >> > >> > > > >> > >> > > > Any suggestion on how I can achieve this? I am thankful to the >> > >> > > >> > >> > forum >> > >> > for >> > >> > > > replying with patience, on achieving this, i will blog and will >> > >> > > >> > >> > share >> > >> > it >> > >> > > > with all. >> > >> > > > >> > >> > > > Thanks - David >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> >> > > >> > >> >> >