[jira] [Resolved] (JENA-99) Spill to disk data bags

Andy Seaborne (JIRA) Tue, 06 Sep 2011 12:47:33 -0700

     [ 
https://issues.apache.org/jira/browse/JENA-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andy Seaborne resolved JENA-99.
-------------------------------

    Resolution: Fixed

(didn't re-resolve it after fixing a comment)

> Spill to disk data bags
> -----------------------
>
>                 Key: JENA-99
>                 URL: https://issues.apache.org/jira/browse/JENA-99
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Stephen Allen
>            Assignee: Andy Seaborne
>         Attachments: JENA-99-r1157891.patch
>
>
> For certain query operations, ARQ needs to store a large number of tuples 
> temporarily.  Currently these are stored in Java Collections, however for 
> large result sets the system can exhaust the available memory.  There is a 
> need for a set of generic data structures that can hold these tuples and 
> spill to disk if they get too large.
> ==
> The design is inspired by Apache Pig's DataBag [1]:
> A DataBag is a collection of tuples. A DataBag may or may not fit into 
> memory. It proactively spills to disk when its size exceeds the threshold. 
> When it spills, it takes whatever it has in memory, opens a spill file, and 
> writes the contents out. This may happen multiple times. The bag tracks all 
> of the files it's spilled to. The spill behavior is controlled by a 
> ThresholdPolicy object.  The most basic policy spills based on the number of 
> tuples added.  A more advanced policy is to estimate the size of all the 
> tuples added to the DataBag and spill when it passes a byte threshold.
> A DataBag provides an Iterator interface, that allows callers to read through 
> the contents. The iterators are aware of the data spilling. They have to be 
> able to handle reading from the spill files. 
> The DataBag interface assumes that all data is written before any is read. 
> That is, a DataBag cannot be used as a queue. If data is written after data 
> is read, the results are undefined.
> DataBags come in several types, default, sorted, and distinct. The type must 
> be chosen up front, there is no way to convert a bag on the fly. Default data 
> bags do not guarantee any particular order of retrieval for the tuples and 
> may contain duplicate tuples. Sorted data bags guarantee that tuples will be 
> retrieved in order, where "in order" is defined either by the default 
> comparator for the tuple or the comparator provided by the caller when the 
> bag was created. Sorted bags may contain duplicates. Distinct bags do not 
> guarantee any particular order of retrieval, but do guarantee that they will 
> not contain duplicate tuples. 
> The DataBags are generic containers, and may store any item that can be 
> serialized and deserialized.  It accepts a SerializationFactory that handles 
> this task.
> [1] http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/data/DataBag.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (JENA-99) Spill to disk data bags

Reply via email to