[jira] [Commented] (JENA-99) Spill to disk data bags

Stephen Allen (JIRA) Tue, 16 Aug 2011 09:07:50 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085797#comment-13085797
 ]


Stephen Allen commented on JENA-99:
-----------------------------------

1/ The background sort/write-to-disk could be added back to the SortedDataBag 
at the cost of additional complexity (more difficulty cancelling and 
potentially X-times increased memory usage, the multiple determined by the 
number of worker threads).  Also I have the suspicion (but no evidence) that 
the amount of time spent sorting and writing a file sequentially to disk will 
be dominated by retrieving the bindings from the source iterator.

a) The bags should work with in-memory datasets, no files are created until the 
threshold is passed.  A few options: use Long.MAX_VALUE as the threshold count 
(no changes to code); designate -1 as never spill, or create a new policy 
object ThresholdPolicyNever.  The -1 option might be the easiest for setting up 
the config file.

A/  You're right about the BindingCompator.compareBindingsSyntactic().  Sorting 
each binding's variables for every comparison is going to be quite expensive.  
I think your suggestions make sense.

As background, I had to modify it because by default DistinctDataBag does not 
use any SortConditions.  Also, we will need a stable sort on the entire binding 
set, not just the ORDER BY variables, if we are to do optimizations like 
JENA-90.

B/ SerializationFactoryFinder is used to build the actual SerializationFactorys 
(JENA-44 uses a Binding factory, while JENA-45 uses Binding and Triple 
factories)

C/ Yes, it makes more sense there.

D/ Yeah, I noticed the other Tuple object and meant to change it, but forgot.  
It also needs to be changed in DataBag.java.

> Spill to disk data bags
> -----------------------
>
>                 Key: JENA-99
>                 URL: https://issues.apache.org/jira/browse/JENA-99
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Stephen Allen
>         Attachments: JENA-99-r1157891.patch
>
>
> For certain query operations, ARQ needs to store a large number of tuples 
> temporarily.  Currently these are stored in Java Collections, however for 
> large result sets the system can exhaust the available memory.  There is a 
> need for a set of generic data structures that can hold these tuples and 
> spill to disk if they get too large.
> ==
> The design is inspired by Apache Pig's DataBag [1]:
> A DataBag is a collection of tuples. A DataBag may or may not fit into 
> memory. It proactively spills to disk when its size exceeds the threshold. 
> When it spills, it takes whatever it has in memory, opens a spill file, and 
> writes the contents out. This may happen multiple times. The bag tracks all 
> of the files it's spilled to. The spill behavior is controlled by a 
> ThresholdPolicy object.  The most basic policy spills based on the number of 
> tuples added.  A more advanced policy is to estimate the size of all the 
> tuples added to the DataBag and spill when it passes a byte threshold.
> A DataBag provides an Iterator interface, that allows callers to read through 
> the contents. The iterators are aware of the data spilling. They have to be 
> able to handle reading from the spill files. 
> The DataBag interface assumes that all data is written before any is read. 
> That is, a DataBag cannot be used as a queue. If data is written after data 
> is read, the results are undefined.
> DataBags come in several types, default, sorted, and distinct. The type must 
> be chosen up front, there is no way to convert a bag on the fly. Default data 
> bags do not guarantee any particular order of retrieval for the tuples and 
> may contain duplicate tuples. Sorted data bags guarantee that tuples will be 
> retrieved in order, where "in order" is defined either by the default 
> comparator for the tuple or the comparator provided by the caller when the 
> bag was created. Sorted bags may contain duplicates. Distinct bags do not 
> guarantee any particular order of retrieval, but do guarantee that they will 
> not contain duplicate tuples. 
> The DataBags are generic containers, and may store any item that can be 
> serialized and deserialized.  It accepts a SerializationFactory that handles 
> this task.
> [1] http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/data/DataBag.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-99) Spill to disk data bags

Reply via email to