Any updates/progress with this? I'm looking at ways to implement an RTree with lucene -- and this discussion seems relevant
thanks ryan On Sat, Sep 25, 2010 at 5:42 PM, mark harwood <markharw...@yahoo.co.uk> wrote: >>>Both these on disk data structures and the ones in a B+ tree have seek >>>offsets >>>into files >>>that require disk seeks. And both could use document ids as key values. > > Yep. However my approach doesn't use a doc id as a key that is searched in any > B+ tree index (which involves disk seeks) - it is used as direct offset into a > file to get the pointer into a "links" data structure. > > > >>>But do these disk data structures support dynamic addition and deletion of >>>(larger >>>numbers of) document links? > > Yes, the slide deck I linked to shows how links (like documents) spend the > early > stages of life being merged frequently in the smaller, newer segments and over > time migrate into larger, more stable segments as part of Lucene transactions. > > That's the theory - I'm currently benchmarking an early prototype. > > > > ----- Original Message ---- > From: Paul Elschot <paul.elsc...@xs4all.nl> > To: dev@lucene.apache.org > Sent: Sat, 25 September, 2010 22:03:28 > Subject: Re: Document links > > Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood: >> My starting point in the solution I propose was to eliminate linking via any >>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph >>traversals have exponential numbers of links and so all these index disk seeks >>start to stack up. The solution I propose uses doc ids as more-or-less direct >>pointers into file structures avoiding any index lookup. >> I've started coding up some tests using the file structures I outlined and >> will >>compare that with a traditional key-based approach. > > Both these on disk data structures and the ones in a B+ tree have seek offsets > into files > that require disk seeks. And both could use document ids as key values. > > But do these disk data structures support dynamic addition and deletion of > (larger > numbers of) document links? > > B+ trees are a standard solution for problems like this one, and it would > probably > not be easy to outperform them. > It may be possible to improve performance of B+ trees somewhat by specializing > for the fairly simple keys that would be needed, and by encoding very short > lists of links > for a single document directly into a seek offset to avoid the actual seek, > but > that's > about it. > > Regards, > Paul Elschot > >> >> For reference - playing the "Kevin Bacon game" on a traditional Lucene index >> of >>IMDB data took 18 seconds to find a short path that Neo4j finds in 200 >>milliseconds on the same data (and this was a disk based graph of 3m nodes, >>10m >>edges). >> Going from actor->movies->actors->movies produces a lot of key lookups and >> the >>difference between key indexes and direct node pointers becomes clear. >> I know path finding analysis is perhaps not a typical Lucene application but >>other forms of link analysis e.g. recommendation engines require similar >>performance. >> >> Cheers >> Mark >> >> >> >> On 25 Sep 2010, at 11:41, Paul Elschot wrote: >> >> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood: >> >>>> While not exactly equivalent, it reminds me of our earlier discussion >>around >> >> >>>> "layered segments" for dealing with field updates >> >> >> >> Right. Fast discovery of document relations is a foundation on which lots >> >> of >> >> >> things like this can build. Relations can be given types to support a >> >> number >>of >> >> >> different use cases. >> > >> > How about using this (bsd licenced) tree as a starting point: >> > http://bplusdotnet.sourceforge.net/ >> > It has various keys: ao. byte array, String and long. >> > >> > A fixed size byte array as key seems to be just fine: two bytes for a field >>number, >> > four for the segment number and four for the in-segment document id. >> > The separate segment number would allow to minimize the updates >> > in the tree during merges. One could also use the normal doc id directly. >> > >> > The value could then be a similar to the key, but without >> > the field number, and with an indication of the direction of the link. >> > Or perhaps the direction of the link should be added to the key. >> > A link would be present twice, once for each direction. >> > Also both directions could have their own payloads. >> > >> > It could be put in its own file as a separate 'segment', or maybe >> > each segment could allow for allocation of a part of this tree. >> > >> > I like this somehow, in case it is done right one might never >> > need a relational database again. Well, almost... >> > >> > Regards, >> > Paul Elschot >> > >> > >> >> >> >> >> >> >> >> ----- Original Message ---- >> >> From: Grant Ingersoll <gsing...@apache.org> >> >> To: dev@lucene.apache.org >> >> Sent: Fri, 24 September, 2010 16:26:27 >> >> Subject: Re: Document links >> >> >> >> While not exactly equivalent, it reminds me of our earlier discussion >> >> around >> >> >> "layered segments" for dealing with field updates [1], [2], albeit this >> >> is a >>bit >> >> >> more generic since one could not only use the links for relating >> >> documents, >>but >> >> >> one could use "special" links underneath the covers in Lucene to >>maintain/mark >> >> >> which fields have been updated and then traverse to them. >> >> >> >> [1] >> >> >>http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384 >> >> >> >> >> [2] >> >> >>http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e >> >> >> >> >> >> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote: >> >> >> >>> This slideshow has a first-cut on the Lucene file format extensions >>required to >> >> >>> >> >>> support fast linking between documents: >> >>> >> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents >> >>> >> >>> >> >>> Interested in any of your thoughts. >> >>> >> >>> Cheers, >> >>> Mark >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >>> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>> >> >> >> >> -------------------------- >> >> Grant Ingersoll >> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8 >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> >> >> >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org