[jira] [Updated] (JENA-119) Eliminate memory bounds during query execution

Stephen Allen (Updated) (JIRA) Thu, 29 Sep 2011 15:58:09 -0700

     [ 
https://issues.apache.org/jira/browse/JENA-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stephen Allen updated JENA-119:
-------------------------------

    Attachment: JENA-119-r1177090-Fuseki-Construct.patch
                JENA-119-r1177452-ARQ-Construct.patch


I've added support for non-memory bound CONSTRUCT queries.  This is needed 
because CONSTRUCT must remove duplicates from the set of triples it returns.  I 
did this by creating two GraphDataBags that extend the Graph interface.  The 
GraphDistinctDataBag is the one used for removing duplicates in the CONSTRUCT.

I've also attached a patch that updates Fuseki to use this new Graph.

There are a few limitations to DataBag backed Graphs:
  1) Cannot add any triples after you call find() unless you first call clear() 
or getBulkUpdateHandler().removeAll().
  2) Cannot remove any triples except by calling clear() or 
getBulkUpdateHandler().removeAll().
  3) There is no indexing, so find() will always scan all triples.
  4) The size() method is not guaranteed to be accurate, treat it as an 
estimate.
  5) You must call close() in order to release any resources (such as spill 
files).

But the big plus is that it can spill to disk.  Also it should be faster for 
situations that do all their writing before reading, and that need to retrieve 
all of the triples.

--

Other notes:

I am consolidating the previously separate "spillOnDiskSortingThreshold" and 
"spillOnDiskUpdateThreshold".  As we add more operators from JENA-119, it 
doesn't seem to make sense to have a separate symbol for each one.

Also, I changed the symbol name from "spillOnDiskThreshold" to "workCount".  I 
think it better captures the intent instead of the particular way the 
implementation behaves.  What we want to specify is the maximum number of 
bindings that can reside in memory, not the fact that the current 
implementations do that by spilling to disk after they pass a threshold.  
Indeed, a better implementation could put the excess bindings into a temporary 
disk buffer (which may mean they might not even touch the disk if there is free 
memory in the system) [1].

Additionally, in the future I think we may want to add "workMem" to specify the 
amount of working memory that can be used by the operators instead of a count.  
This more closely aligns it with the actual resource that we are try to 
conserve.  This requires the ability to generate a good estimate of the binding 
sizes in memory.


[1] If this isn't a good name, we should think of one and finalize it, since it 
is a user-facing configuration option that we don't want to have to change in 
the future.

                
> Eliminate memory bounds during query execution
> ----------------------------------------------
>
>                 Key: JENA-119
>                 URL: https://issues.apache.org/jira/browse/JENA-119
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Stephen Allen
>         Attachments: JENA-119-r1177090-Fuseki-Construct.patch, 
> JENA-119-r1177452-ARQ-Construct.patch
>
>
> It would be nice to eliminate all memory bounds on queries.  Similar to 
> JENA-44, it would involve modifying all of the QueryIterator objects that 
> maintain unbounded collections of Bindings.
> The ones I've identified (let me know if I've missed any):
> + QueryIterSort
>       Complete!
> + QueryIterGroup
>       Probably one of the more complicated implementations.  I think it can 
> be done with a DistinctDataBag.
> + QueryIterDistinct
>       Can be implemented trivially using DistinctDataBag, but would lose 
> streaming capability.  We could do streaming just until the first spill, 
> which would be a little more difficult but not bad.  If we wanted streaming 
> even after spilling, then we would need an on-disk hashtable or b-tree (which 
> could get expensive for maybe limited benefit, do you really need streaming 
> after 10,000 results?).
> + QueryIteratorCopy
>     Only appears to be used QueryIterService.  Simple implementation using 
> DefaultDataBag.
> + QueryIteratorCaching
>       Does not match DataBag's assumption of completing all writes before 
> iterating.  But it isn't used anywhere, so maybe we remove it?
> + QueryIterDiff
> + QueryIterMinus
>       Both of these materialize the RHS into a collection.  Can be 
> implemented with DefaultDataBag.  As an aside, is this necessary to do for 
> all queries?  What if the RHS is cheap (i.e. a single TriplePattern)?
> + QueryIterJoin
> + QueryIterLeftJoin
>      Both materialize RHS.  Are they used anywhere?  I was under the 
> impression that ARQ only considered left-deep plans with indexed joins on the 
> RHS TriplePatterns.
> + SubQueries
>      I'm not sure how this is handled.  Are these materialized somewhere?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-119) Eliminate memory bounds during query execution

Reply via email to