[jira] [Commented] (APEXMALHAR-2026) Spooled Datastructures

Timothy Farkas (JIRA) Wed, 06 Apr 2016 11:54:52 -0700

    [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228887#comment-15228887
 ]


Timothy Farkas commented on APEXMALHAR-2026:
--------------------------------------------

byte[] optimization will be there. Off heap-memory will take more 
investigation. Chandni has done some research on the topic, we can work 
together to incorporate it after the first iteration is complete.

> Spooled Datastructures
> ----------------------
>
>                 Key: APEXMALHAR-2026
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2026
>             Project: Apache Apex Malhar
>          Issue Type: New Feature
>            Reporter: Timothy Farkas
>            Assignee: Timothy Farkas
>              Labels: roadmap
>
> Add libraryies for spooling datastructures to a key value store. There are 
> several customer use cases which require spooled data structures.
> 1 - Some operators like AbstractFileInputOperator have ever growing state. 
> This is an issue because eventually the state of the operator will grow 
> larger than the memory allocated to the operator, which will cause the 
> operator to perpetually fail. However if the operator's datastructures are 
> spooled then the operator will never run out of memory.
> 2 - Some users have requested for the ability to maintain a map as well as a 
> list of keys over which to iterate. Most key value stores don't provide this 
> functionality. However, with spooled datastructures this functionality can be 
> provided by maintaining a spooled map and an iterable set of keys.
> 3 - Some users have requested building graph databases within APEX. This 
> would require implementing a spooled graph data structure.
> 4 - Another use case for spooled data structures is database operators. 
> Database operators need to write data to a data base, but sometimes the 
> database is down. In this case most of the database operators repeatedly fail 
> until the database comes back up. In order to avoid constant failures the 
> database operator need to writes data to a queue when the data base is down, 
> then when the database is up the operator need to take data from the queue 
> and write it to the database. In the case of a database failure this queue 
> will grow larger than the total amount of memory available to the operator, 
> so the queue should be spooled in order to prevent the operator from failing.
> 5 - Any operator which needs to maintain a large data structure in memory 
> currently needs to have that data serialized and written out to HDFS with 
> every checkpoint. This is costly when the data structure is large. If the 
> data structure is spooled, then only the changes to the data structure are 
> written out to HDFS instead of the entire data structure.
> 6 - Also building an Apex Native database for aggregations requires indices. 
> These indices need to take the form of spooled data structures.
> 7 - In the future any operator which needs to maintain a data structure 
> larger than the memory available to it will need to spool the data structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (APEXMALHAR-2026) Spooled Datastructures

Reply via email to