[
https://issues.apache.org/jira/browse/APEXMALHAR-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228887#comment-15228887
]
Timothy Farkas commented on APEXMALHAR-2026:
--------------------------------------------
byte[] optimization will be there. Off heap-memory will take more
investigation. Chandni has done some research on the topic, we can work
together to incorporate it after the first iteration is complete.
> Spooled Datastructures
> ----------------------
>
> Key: APEXMALHAR-2026
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2026
> Project: Apache Apex Malhar
> Issue Type: New Feature
> Reporter: Timothy Farkas
> Assignee: Timothy Farkas
> Labels: roadmap
>
> Add libraryies for spooling datastructures to a key value store. There are
> several customer use cases which require spooled data structures.
> 1 - Some operators like AbstractFileInputOperator have ever growing state.
> This is an issue because eventually the state of the operator will grow
> larger than the memory allocated to the operator, which will cause the
> operator to perpetually fail. However if the operator's datastructures are
> spooled then the operator will never run out of memory.
> 2 - Some users have requested for the ability to maintain a map as well as a
> list of keys over which to iterate. Most key value stores don't provide this
> functionality. However, with spooled datastructures this functionality can be
> provided by maintaining a spooled map and an iterable set of keys.
> 3 - Some users have requested building graph databases within APEX. This
> would require implementing a spooled graph data structure.
> 4 - Another use case for spooled data structures is database operators.
> Database operators need to write data to a data base, but sometimes the
> database is down. In this case most of the database operators repeatedly fail
> until the database comes back up. In order to avoid constant failures the
> database operator need to writes data to a queue when the data base is down,
> then when the database is up the operator need to take data from the queue
> and write it to the database. In the case of a database failure this queue
> will grow larger than the total amount of memory available to the operator,
> so the queue should be spooled in order to prevent the operator from failing.
> 5 - Any operator which needs to maintain a large data structure in memory
> currently needs to have that data serialized and written out to HDFS with
> every checkpoint. This is costly when the data structure is large. If the
> data structure is spooled, then only the changes to the data structure are
> written out to HDFS instead of the entire data structure.
> 6 - Also building an Apex Native database for aggregations requires indices.
> These indices need to take the form of spooled data structures.
> 7 - In the future any operator which needs to maintain a data structure
> larger than the memory available to it will need to spool the data structure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)