Read the link carefully,

This solution is available (*only*) in Databricks Runtime.

You can enable RockDB-based state management by setting the following
configuration in the SparkSession before starting the streaming query.

spark.conf.set(
  "spark.sql.streaming.stateStore.providerClass",
  "com.databricks.sql.streaming.state.RocksDBStateStoreProvider")


On Sun, 10 Mar 2019 at 11:54, Lian Jiang <jiangok2...@gmail.com> wrote:

> Hi,
>
> I have a very simple SSS pipeline which does:
>
> val query = df
>   .dropDuplicates(Array("Id", "receivedAt"))
>   .withColumn(timePartitionCol, timestamp_udfnc(col("receivedAt")))
>   .writeStream
>   .format("parquet")
>   .partitionBy("availabilityDomain", timePartitionCol)
>   .trigger(Trigger.ProcessingTime(5, TimeUnit.MINUTES))
>   .option("path", "/data")
>   .option("checkpointLocation", "/data_checkpoint")
>   .start()
>
> After ingesting 2T records, the state under checkpoint folder on HDFS 
> (replicator factor 2) grows to 2T bytes.
> My cluster has only 2T bytes which means the cluster can barely handle 
> further data growth.
>
> Online spark documents 
> (https://docs.databricks.com/spark/latest/structured-streaming/production.html)
> says using rocksdb help SSS job reduce JVM memory overhead. But I cannot find 
> any document how
>
> to setup rocksdb for SSS. Spark class CheckpointReader seems to only handle 
> HDFS.
>
> Any suggestions? Thanks!
>
>
>
>

Reply via email to