Hi Mark, Any update on this?
-Grant On Sep 25, 2010, at 5:42 PM, mark harwood wrote: >>> Both these on disk data structures and the ones in a B+ tree have seek >>> offsets >>> into files >>> that require disk seeks. And both could use document ids as key values. > > Yep. However my approach doesn't use a doc id as a key that is searched in > any > B+ tree index (which involves disk seeks) - it is used as direct offset into > a > file to get the pointer into a "links" data structure. > > > >>> But do these disk data structures support dynamic addition and deletion of >>> (larger >>> numbers of) document links? > > Yes, the slide deck I linked to shows how links (like documents) spend the > early > stages of life being merged frequently in the smaller, newer segments and > over > time migrate into larger, more stable segments as part of Lucene transactions. > > That's the theory - I'm currently benchmarking an early prototype. > > > > ----- Original Message ---- > From: Paul Elschot <paul.elsc...@xs4all.nl> > To: dev@lucene.apache.org > Sent: Sat, 25 September, 2010 22:03:28 > Subject: Re: Document links > > Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood: >> My starting point in the solution I propose was to eliminate linking via any >> type of key. Key lookups mean indexes and indexes mean disk seeks. Graph >> traversals have exponential numbers of links and so all these index disk >> seeks >> start to stack up. The solution I propose uses doc ids as more-or-less >> direct >> pointers into file structures avoiding any index lookup. >> I've started coding up some tests using the file structures I outlined and >> will >> compare that with a traditional key-based approach. > > Both these on disk data structures and the ones in a B+ tree have seek > offsets > into files > that require disk seeks. And both could use document ids as key values. > > But do these disk data structures support dynamic addition and deletion of > (larger > numbers of) document links? > > B+ trees are a standard solution for problems like this one, and it would > probably > not be easy to outperform them. > It may be possible to improve performance of B+ trees somewhat by specializing > for the fairly simple keys that would be needed, and by encoding very short > lists of links > for a single document directly into a seek offset to avoid the actual seek, > but > that's > about it. > > Regards, > Paul Elschot > >> >> For reference - playing the "Kevin Bacon game" on a traditional Lucene index >> of >> IMDB data took 18 seconds to find a short path that Neo4j finds in 200 >> milliseconds on the same data (and this was a disk based graph of 3m nodes, >> 10m >> edges). >> Going from actor->movies->actors->movies produces a lot of key lookups and >> the >> difference between key indexes and direct node pointers becomes clear. >> I know path finding analysis is perhaps not a typical Lucene application but >> other forms of link analysis e.g. recommendation engines require similar >> performance. >> >> Cheers >> Mark >> >> >> >> On 25 Sep 2010, at 11:41, Paul Elschot wrote: >> >>> Op vrijdag 24 september 2010 17:57:45 schreef mark harwood: >>>>>> While not exactly equivalent, it reminds me of our earlier discussion >> around >> >>>>>> "layered segments" for dealing with field updates >>>> >>>> Right. Fast discovery of document relations is a foundation on which lots >>>> of >> >>>> things like this can build. Relations can be given types to support a >>>> number >> of >> >>>> different use cases. >>> >>> How about using this (bsd licenced) tree as a starting point: >>> http://bplusdotnet.sourceforge.net/ >>> It has various keys: ao. byte array, String and long. >>> >>> A fixed size byte array as key seems to be just fine: two bytes for a field >> number, >>> four for the segment number and four for the in-segment document id. >>> The separate segment number would allow to minimize the updates >>> in the tree during merges. One could also use the normal doc id directly. >>> >>> The value could then be a similar to the key, but without >>> the field number, and with an indication of the direction of the link. >>> Or perhaps the direction of the link should be added to the key. >>> A link would be present twice, once for each direction. >>> Also both directions could have their own payloads. >>> >>> It could be put in its own file as a separate 'segment', or maybe >>> each segment could allow for allocation of a part of this tree. >>> >>> I like this somehow, in case it is done right one might never >>> need a relational database again. Well, almost... >>> >>> Regards, >>> Paul Elschot >>> >>> >>>> >>>> >>>> >>>> ----- Original Message ---- >>>> From: Grant Ingersoll <gsing...@apache.org> >>>> To: dev@lucene.apache.org >>>> Sent: Fri, 24 September, 2010 16:26:27 >>>> Subject: Re: Document links >>>> >>>> While not exactly equivalent, it reminds me of our earlier discussion >>>> around >> >>>> "layered segments" for dealing with field updates [1], [2], albeit this is >>>> a >> bit >> >>>> more generic since one could not only use the links for relating >>>> documents, >> but >> >>>> one could use "special" links underneath the covers in Lucene to >> maintain/mark >> >>>> which fields have been updated and then traverse to them. >>>> >>>> [1] >>>> >> http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384 >> >>>> >>>> [2] >>>> >> http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e >> >>>> >>>> >>>> On Sep 24, 2010, at 10:36 AM, mark harwood wrote: >>>> >>>>> This slideshow has a first-cut on the Lucene file format extensions >> required to >> >>>>> >>>>> support fast linking between documents: >>>>> >>>>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents >>>>> >>>>> >>>>> Interested in any of your thoughts. >>>>> >>>>> Cheers, >>>>> Mark >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> >>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8 >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > -------------------------- Grant Ingersoll http://www.lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org