[ https://issues.apache.org/jira/browse/JENA-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Seaborne closed JENA-99. ----------------------------- > Spill to disk data bags > ----------------------- > > Key: JENA-99 > URL: https://issues.apache.org/jira/browse/JENA-99 > Project: Jena > Issue Type: New Feature > Components: ARQ > Reporter: Stephen Allen > Assignee: Andy Seaborne > Attachments: JENA-99-r1157891.patch > > > For certain query operations, ARQ needs to store a large number of tuples > temporarily. Currently these are stored in Java Collections, however for > large result sets the system can exhaust the available memory. There is a > need for a set of generic data structures that can hold these tuples and > spill to disk if they get too large. > == > The design is inspired by Apache Pig's DataBag [1]: > A DataBag is a collection of tuples. A DataBag may or may not fit into > memory. It proactively spills to disk when its size exceeds the threshold. > When it spills, it takes whatever it has in memory, opens a spill file, and > writes the contents out. This may happen multiple times. The bag tracks all > of the files it's spilled to. The spill behavior is controlled by a > ThresholdPolicy object. The most basic policy spills based on the number of > tuples added. A more advanced policy is to estimate the size of all the > tuples added to the DataBag and spill when it passes a byte threshold. > A DataBag provides an Iterator interface, that allows callers to read through > the contents. The iterators are aware of the data spilling. They have to be > able to handle reading from the spill files. > The DataBag interface assumes that all data is written before any is read. > That is, a DataBag cannot be used as a queue. If data is written after data > is read, the results are undefined. > DataBags come in several types, default, sorted, and distinct. The type must > be chosen up front, there is no way to convert a bag on the fly. Default data > bags do not guarantee any particular order of retrieval for the tuples and > may contain duplicate tuples. Sorted data bags guarantee that tuples will be > retrieved in order, where "in order" is defined either by the default > comparator for the tuple or the comparator provided by the caller when the > bag was created. Sorted bags may contain duplicates. Distinct bags do not > guarantee any particular order of retrieval, but do guarantee that they will > not contain duplicate tuples. > The DataBags are generic containers, and may store any item that can be > serialized and deserialized. It accepts a SerializationFactory that handles > this task. > [1] http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/data/DataBag.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira