[jira] [Created] (JENA-99) Spill to disk data bags

Stephen Allen (JIRA) Mon, 15 Aug 2011 12:25:51 -0700

Spill to disk data bags
-----------------------

                 Key: JENA-99
                 URL: https://issues.apache.org/jira/browse/JENA-99
             Project: Jena
          Issue Type: New Feature
          Components: ARQ
            Reporter: Stephen Allen



For certain query operations, ARQ needs to store a large number of tuples 
temporarily.  Currently these are stored in Java Collections, however for large 
result sets the system can exhaust the available memory.  There is a need for a 
set of generic data structures that can hold these tuples and spill to disk if 
they get too large.

==

The design is inspired by Apache Pig's DataBag [1]:

A DataBag is a collection of tuples. A DataBag may or may not fit into memory. 
It proactively spills to disk when its size exceeds the threshold. When it 
spills, it takes whatever it has in memory, opens a spill file, and writes the 
contents out. This may happen multiple times. The bag tracks all of the files 
it's spilled to. The spill behavior is controlled by a ThresholdPolicy object.  
The most basic policy spills based on the number of tuples added.  A more 
advanced policy is to estimate the size of all the tuples added to the DataBag 
and spill when it passes a byte threshold.

A DataBag provides an Iterator interface, that allows callers to read through 
the contents. The iterators are aware of the data spilling. They have to be 
able to handle reading from the spill files. 

The DataBag interface assumes that all data is written before any is read. That 
is, a DataBag cannot be used as a queue. If data is written after data is read, 
the results are undefined.

DataBags come in several types, default, sorted, and distinct. The type must be 
chosen up front, there is no way to convert a bag on the fly. Default data bags 
do not guarantee any particular order of retrieval for the tuples and may 
contain duplicate tuples. Sorted data bags guarantee that tuples will be 
retrieved in order, where "in order" is defined either by the default 
comparator for the tuple or the comparator provided by the caller when the bag 
was created. Sorted bags may contain duplicates. Distinct bags do not guarantee 
any particular order of retrieval, but do guarantee that they will not contain 
duplicate tuples. 

The DataBags are generic containers, and may store any item that can be 
serialized and deserialized.  It accepts a SerializationFactory that handles 
this task.


[1] http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/data/DataBag.html


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (JENA-99) Spill to disk data bags

Reply via email to