Re: Document links

Grant Ingersoll Wed, 20 Oct 2010 06:07:49 -0700

Hi Mark,

Any update on this?


-Grant

On Sep 25, 2010, at 5:42 PM, mark harwood wrote:

>>> Both these on disk data structures and the ones in a B+ tree have seek 
>>> offsets 
>>> into files
>>> that require disk seeks. And both could use document ids as key values.
> 
> Yep. However my approach doesn't use a doc id as a key that is searched in 
> any 
> B+ tree index (which involves disk seeks) - it is used as direct offset into 
> a 
> file to get the pointer into a "links" data structure. 
> 
> 
> 
>>> But do these disk data structures support dynamic addition and deletion of 
>>> (larger
>>> numbers of) document links?
> 
> Yes, the slide deck I linked to shows how links (like documents) spend the 
> early 
> stages of life being merged frequently in the smaller, newer segments and 
> over 
> time migrate into larger, more stable segments as part of Lucene transactions.
> 
> That's the theory - I'm currently benchmarking an early prototype.
> 
> 
> 
> ----- Original Message ----
> From: Paul Elschot <[email protected]>
> To: [email protected]
> Sent: Sat, 25 September, 2010 22:03:28
> Subject: Re: Document links
> 
> Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
>> My starting point in the solution I propose was to eliminate linking via any 
>> type of key. Key lookups mean indexes and indexes mean disk seeks. Graph 
>> traversals have exponential numbers of links and so all these index disk 
>> seeks 
>> start to stack up. The solution I propose uses doc ids as more-or-less 
>> direct 
>> pointers into file structures avoiding any index lookup.
>> I've started coding up some tests using the file structures I outlined and 
>> will 
>> compare that with a traditional key-based approach.
> 
> Both these on disk data structures and the ones in a B+ tree have seek 
> offsets 
> into files
> that require disk seeks. And both could use document ids as key values.
> 
> But do these disk data structures support dynamic addition and deletion of 
> (larger
> numbers of) document links?
> 
> B+ trees are a standard solution for problems like this one, and it would 
> probably
> not be easy to outperform them.
> It may be possible to improve performance of B+ trees somewhat by specializing
> for the fairly simple keys that would be needed, and by encoding very short 
> lists of links
> for a single document directly into a seek offset to avoid the actual seek, 
> but 
> that's
> about it.
> 
> Regards,
> Paul Elschot
> 
>> 
>> For reference - playing the "Kevin Bacon game" on a traditional Lucene index 
>> of 
>> IMDB data took 18 seconds to find a short path that Neo4j finds in 200 
>> milliseconds on the same data (and this was a disk based graph of 3m nodes, 
>> 10m 
>> edges).
>> Going from actor->movies->actors->movies produces a lot of key lookups and 
>> the 
>> difference between key indexes and direct node pointers becomes clear.
>> I know path finding analysis is perhaps not a typical Lucene application but 
>> other forms of link analysis e.g. recommendation engines require similar 
>> performance.
>> 
>> Cheers
>> Mark
>> 
>> 
>> 
>> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
>> 
>>> Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>>>>>> While not exactly equivalent, it reminds me of our earlier discussion 
>> around 
>> 
>>>>>> "layered segments" for dealing with field updates
>>>> 
>>>> Right. Fast discovery of document relations is a foundation on which lots 
>>>> of 
>> 
>>>> things like this can build. Relations can be given types to support a 
>>>> number 
>> of 
>> 
>>>> different use cases.
>>> 
>>> How about using this (bsd licenced) tree as a starting point:
>>> http://bplusdotnet.sourceforge.net/
>>> It has various keys: ao. byte array, String and long.
>>> 
>>> A fixed size byte array as key seems to be just fine: two bytes for a field 
>> number,
>>> four for the segment number and four for the in-segment document id.
>>> The separate segment number would allow to minimize the updates
>>> in the tree during merges. One could also use the normal doc id directly.
>>> 
>>> The value could then be a similar to the key, but without
>>> the field number, and with an indication of the direction of the link.
>>> Or perhaps the direction of the link should be added to the key.
>>> A link would be present twice, once for each direction.
>>> Also both directions could have their own payloads.
>>> 
>>> It could be put in its own file as a separate 'segment', or maybe
>>> each segment could allow for allocation of a part of this tree.
>>> 
>>> I like this somehow, in case it is done right one might never
>>> need a relational database again. Well, almost...
>>> 
>>> Regards,
>>> Paul Elschot
>>> 
>>> 
>>>> 
>>>> 
>>>> 
>>>> ----- Original Message ----
>>>> From: Grant Ingersoll <[email protected]>
>>>> To: [email protected]
>>>> Sent: Fri, 24 September, 2010 16:26:27
>>>> Subject: Re: Document links
>>>> 
>>>> While not exactly equivalent, it reminds me of our earlier discussion 
>>>> around 
>> 
>>>> "layered segments" for dealing with field updates [1], [2], albeit this is 
>>>> a 
>> bit 
>> 
>>>> more generic since one could not only use the links for relating 
>>>> documents, 
>> but 
>> 
>>>> one could use "special" links underneath the covers in Lucene to 
>> maintain/mark 
>> 
>>>> which fields have been updated and then traverse to them.
>>>> 
>>>> [1] 
>>>> 
>> http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>> 
>>>> 
>>>> [2] 
>>>> 
>> http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>> 
>>>> 
>>>> 
>>>> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>>>> 
>>>>> This slideshow has a first-cut on the Lucene file format extensions 
>> required to 
>> 
>>>>> 
>>>>> support fast linking between documents:
>>>>> 
>>>>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>>>>> 
>>>>> 
>>>>> Interested in any of your thoughts.
>>>>> 
>>>>> Cheers,
>>>>> Mark
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>> 
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Document links

Reply via email to