Thanks for the tip. I've re-posted on Lucene-Java Users. -Terry
Grant Ingersoll-6 wrote: > > I suggest you ask on the user mailing list (java- > [EMAIL PROTECTED]) as you are likely to get a lot more interest > from others. java-dev is for discussing the internals of how Lucene > works. > > > > Thanks, > Grant > > On May 15, 2007, at 7:13 PM, dontspamterry wrote: > >> >> Hi all, >> >> I know this whole distinct query has been discussed a bunch of >> times for >> various scenarios because I've been scouring the forums trying to >> find a >> clue as to how I could solve my problem. I'm indexing a large set of >> parent-child term relations (~1 million). The number of unique >> terms is >> about ~570,000. Each relation is a document. Each term in a relation >> contains all of the term's attributes. Effectively, a term's >> attributes will >> be duplicated "x" number of times for the "x" number of relations it >> participates in. For example, say I have the following term tree: >> >> A >> |--B >> |--E >> |--H >> |--F >> |--C >> |--G >> |--D >> >> I would then have documents for: >> A->B, A->C, A->D, B->E, (and so forth...) >> >> For all relations involving A, A's attributes will be duplicated in 3 >> separate documents. >> For all relations involving B, B's attributes will be duplicated in 3 >> separate documents. >> (you get the picture...) >> >> This index structure works great for queries which traverse up and >> down the >> tree. However, I have a requirement where I would also like to do a >> distinct >> query which returns the data for each unique term satisfying the >> query. For >> example, say I have a query which returns all relations where A or >> B is the >> parent (that would be 5 documents in total), >> but do a distinct on the parent such that I get 2 documents back, >> one for A >> as the parent (any 1 of the 3 matching docs) and the other where B >> is the >> parent (any 1 of the 2 matching docs). For this query, I don't care >> about >> the child information since I'm only interested in retrieving the >> distinct >> parent terms. This query is analogous to a 'select distinct <set of >> parent >> term attributes>' . I played around with caching BitSets for the >> fields >> which I'd like to do a distinct on, but given the amount of data, I >> run out >> of memory. I also took the approach where I retrieve the bitset >> using a >> queryfilter and then process each document id, hashing the field >> values on >> which I'm doing a distinct to construct my distinct set. Problem >> with this >> is that I have tree structures where a parent has over 100K children. >> Retrieving each doc for this size is too time- and memory- >> consuming. Since >> I don't really want to return that much data, I thought that I >> could use >> paging. The problem I faced is that I do not know if a distinct >> value in the >> current query was actually returned in some previous query for a >> previous >> page. >> >> Sorry for the long description, but wanted to make sure I explained >> it as >> clearly as I could. >> >> -Terry >> -- >> View this message in context: http://www.nabble.com/Multi-field- >> distinct-query-tf3761682.html#a10633050 >> Sent from the Lucene - Java Developer mailing list archive at >> Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > -------------------------- > Grant Ingersoll > Center for Natural Language Processing > http://www.cnlp.org/tech/lucene.asp > > Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ > LuceneFAQ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Multi-field-distinct-query-tf3761682.html#a10645430 Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]