[
https://issues.apache.org/jira/browse/JENA-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085797#comment-13085797
]
Stephen Allen commented on JENA-99:
-----------------------------------
1/ The background sort/write-to-disk could be added back to the SortedDataBag
at the cost of additional complexity (more difficulty cancelling and
potentially X-times increased memory usage, the multiple determined by the
number of worker threads). Also I have the suspicion (but no evidence) that
the amount of time spent sorting and writing a file sequentially to disk will
be dominated by retrieving the bindings from the source iterator.
a) The bags should work with in-memory datasets, no files are created until the
threshold is passed. A few options: use Long.MAX_VALUE as the threshold count
(no changes to code); designate -1 as never spill, or create a new policy
object ThresholdPolicyNever. The -1 option might be the easiest for setting up
the config file.
A/ You're right about the BindingCompator.compareBindingsSyntactic(). Sorting
each binding's variables for every comparison is going to be quite expensive.
I think your suggestions make sense.
As background, I had to modify it because by default DistinctDataBag does not
use any SortConditions. Also, we will need a stable sort on the entire binding
set, not just the ORDER BY variables, if we are to do optimizations like
JENA-90.
B/ SerializationFactoryFinder is used to build the actual SerializationFactorys
(JENA-44 uses a Binding factory, while JENA-45 uses Binding and Triple
factories)
C/ Yes, it makes more sense there.
D/ Yeah, I noticed the other Tuple object and meant to change it, but forgot.
It also needs to be changed in DataBag.java.
> Spill to disk data bags
> -----------------------
>
> Key: JENA-99
> URL: https://issues.apache.org/jira/browse/JENA-99
> Project: Jena
> Issue Type: New Feature
> Components: ARQ
> Reporter: Stephen Allen
> Attachments: JENA-99-r1157891.patch
>
>
> For certain query operations, ARQ needs to store a large number of tuples
> temporarily. Currently these are stored in Java Collections, however for
> large result sets the system can exhaust the available memory. There is a
> need for a set of generic data structures that can hold these tuples and
> spill to disk if they get too large.
> ==
> The design is inspired by Apache Pig's DataBag [1]:
> A DataBag is a collection of tuples. A DataBag may or may not fit into
> memory. It proactively spills to disk when its size exceeds the threshold.
> When it spills, it takes whatever it has in memory, opens a spill file, and
> writes the contents out. This may happen multiple times. The bag tracks all
> of the files it's spilled to. The spill behavior is controlled by a
> ThresholdPolicy object. The most basic policy spills based on the number of
> tuples added. A more advanced policy is to estimate the size of all the
> tuples added to the DataBag and spill when it passes a byte threshold.
> A DataBag provides an Iterator interface, that allows callers to read through
> the contents. The iterators are aware of the data spilling. They have to be
> able to handle reading from the spill files.
> The DataBag interface assumes that all data is written before any is read.
> That is, a DataBag cannot be used as a queue. If data is written after data
> is read, the results are undefined.
> DataBags come in several types, default, sorted, and distinct. The type must
> be chosen up front, there is no way to convert a bag on the fly. Default data
> bags do not guarantee any particular order of retrieval for the tuples and
> may contain duplicate tuples. Sorted data bags guarantee that tuples will be
> retrieved in order, where "in order" is defined either by the default
> comparator for the tuple or the comparator provided by the caller when the
> bag was created. Sorted bags may contain duplicates. Distinct bags do not
> guarantee any particular order of retrieval, but do guarantee that they will
> not contain duplicate tuples.
> The DataBags are generic containers, and may store any item that can be
> serialized and deserialized. It accepts a SerializationFactory that handles
> this task.
> [1] http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/data/DataBag.html
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira