[
https://issues.apache.org/jira/browse/JENA-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andy Seaborne resolved JENA-99.
-------------------------------
Resolution: Fixed
(didn't re-resolve it after fixing a comment)
> Spill to disk data bags
> -----------------------
>
> Key: JENA-99
> URL: https://issues.apache.org/jira/browse/JENA-99
> Project: Jena
> Issue Type: New Feature
> Components: ARQ
> Reporter: Stephen Allen
> Assignee: Andy Seaborne
> Attachments: JENA-99-r1157891.patch
>
>
> For certain query operations, ARQ needs to store a large number of tuples
> temporarily. Currently these are stored in Java Collections, however for
> large result sets the system can exhaust the available memory. There is a
> need for a set of generic data structures that can hold these tuples and
> spill to disk if they get too large.
> ==
> The design is inspired by Apache Pig's DataBag [1]:
> A DataBag is a collection of tuples. A DataBag may or may not fit into
> memory. It proactively spills to disk when its size exceeds the threshold.
> When it spills, it takes whatever it has in memory, opens a spill file, and
> writes the contents out. This may happen multiple times. The bag tracks all
> of the files it's spilled to. The spill behavior is controlled by a
> ThresholdPolicy object. The most basic policy spills based on the number of
> tuples added. A more advanced policy is to estimate the size of all the
> tuples added to the DataBag and spill when it passes a byte threshold.
> A DataBag provides an Iterator interface, that allows callers to read through
> the contents. The iterators are aware of the data spilling. They have to be
> able to handle reading from the spill files.
> The DataBag interface assumes that all data is written before any is read.
> That is, a DataBag cannot be used as a queue. If data is written after data
> is read, the results are undefined.
> DataBags come in several types, default, sorted, and distinct. The type must
> be chosen up front, there is no way to convert a bag on the fly. Default data
> bags do not guarantee any particular order of retrieval for the tuples and
> may contain duplicate tuples. Sorted data bags guarantee that tuples will be
> retrieved in order, where "in order" is defined either by the default
> comparator for the tuple or the comparator provided by the caller when the
> bag was created. Sorted bags may contain duplicates. Distinct bags do not
> guarantee any particular order of retrieval, but do guarantee that they will
> not contain duplicate tuples.
> The DataBags are generic containers, and may store any item that can be
> serialized and deserialized. It accepts a SerializationFactory that handles
> this task.
> [1] http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/data/DataBag.html
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira