Re: [DISCUSS] Add RocksDB StateStore

2021-02-07 Thread Liang-Chi Hsieh
Thank you for the inputs! Yikun. Let's take these inputs when we are ready to
have rocksdb state store in Spark SS.


Yikun Jiang wrote
> I worked on some work about rocksdb multi-arch support and version upgrade
> on
> Kafka/Storm/Flink[1][2][3].To avoid these issues happened in spark again,
> I
> want to
> give some inputs in here about rocksdb version selection from multi-arch
> support
> view. Hope it helps.
> 
> The Rocksdb adds Arm64 support [4] since version 6.4.6, and also backports
> all Arm64
> related commits to 5.18.4 and release a all platforms support version.
> 
> So, from multi-arch support view, the better rocksdb version is the
> version
> since
> v6.4.6, or 5.X version is v5.18.4.
> 
> [1] https://issues.apache.org/jira/browse/STORM-3599
> [2] https://github.com/apache/kafka/pull/8284
> [3] https://issues.apache.org/jira/browse/FLINK-13598
> [4] https://github.com/facebook/rocksdb/pull/6250
> 
> Regards,
> Yikun





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Add RocksDB StateStore

2021-02-07 Thread Yikun Jiang
I worked on some work about rocksdb multi-arch support and version upgrade
on
Kafka/Storm/Flink[1][2][3].To avoid these issues happened in spark again, I
want to
give some inputs in here about rocksdb version selection from multi-arch
support
view. Hope it helps.

The Rocksdb adds Arm64 support [4] since version 6.4.6, and also backports
all Arm64
related commits to 5.18.4 and release a all platforms support version.

So, from multi-arch support view, the better rocksdb version is the version
since
v6.4.6, or 5.X version is v5.18.4.

[1] https://issues.apache.org/jira/browse/STORM-3599
[2] https://github.com/apache/kafka/pull/8284
[3] https://issues.apache.org/jira/browse/FLINK-13598
[4] https://github.com/facebook/rocksdb/pull/6250

Regards,
Yikun

Liang-Chi Hsieh  于2021年2月2日周二 下午4:32写道:

> Hi devs,
>
> In Spark structured streaming, we need state store for state management for
> stateful operators such streaming aggregates, joins, etc. We have one and
> only one state store implementation now. It is in-memory hashmap which was
> backed up in HDFS complaint file system at the end of every micro-batch.
>
> As it basically uses in-memory map to store states, memory consumption is a
> serious issue and state store size is limited by the size of the executor
> memory. Moreover, state store using more memory means it may impact the
> performance of task execution that requires memory too.
>
> Internally we see more streaming applications that requires large state in
> stateful operations. For such requirements, we need a StateStore not rely
> on
> memory to store states.
>
> This seems to be also true externally as several other major streaming
> frameworks already use RocksDB for state management. RocksDB is an embedded
> DB and streaming engines can use it to store state instead of memory
> storage.
>
> So seems to me, it is proven to be good choice for large state usage. But
> Spark SS still lacks of a built-in state store for the requirement.
>
> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
> Spark SS. IIUC, it was pushed back due to two concerns: extra code
> maintenance cost and it introduces RocksDB dependency.
>
> For the first concern, as more users require to use the feature, it should
> be highly used code in SS and more developers will look at it. For second
> one, we propose (SPARK-34198) to add it as an external module to relieve
> the
> dependency concern.
>
> Because it was pushed back previously, I'm going to raise this discussion
> to
> know what people think about it now, in advance of submitting any code.
>
> I think there might be some possible opinions:
>
> 1. okay to add RocksDB StateStore into sql core module
> 2. not okay for 1, but okay to add RocksDB StateStore as external module
> 3. either 1 or 2 is okay
> 4. not okay to add RocksDB StateStore, no matter into sql core or as
> external module
>
> Please let us know if you have some thoughts.
>
> Thank you.
>
> Liang-Chi Hsieh
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>