Use https://github.com/chermenin/spark-states instead

Am So., 10. März 2019 um 20:51 Uhr schrieb Arun Mahadevan <ar...@apache.org
>:

>
> Read the link carefully,
>
> This solution is available (*only*) in Databricks Runtime.
>
> You can enable RockDB-based state management by setting the following
> configuration in the SparkSession before starting the streaming query.
>
> spark.conf.set(
>   "spark.sql.streaming.stateStore.providerClass",
>   "com.databricks.sql.streaming.state.RocksDBStateStoreProvider")
>
>
> On Sun, 10 Mar 2019 at 11:54, Lian Jiang <jiangok2...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a very simple SSS pipeline which does:
>>
>> val query = df
>>   .dropDuplicates(Array("Id", "receivedAt"))
>>   .withColumn(timePartitionCol, timestamp_udfnc(col("receivedAt")))
>>   .writeStream
>>   .format("parquet")
>>   .partitionBy("availabilityDomain", timePartitionCol)
>>   .trigger(Trigger.ProcessingTime(5, TimeUnit.MINUTES))
>>   .option("path", "/data")
>>   .option("checkpointLocation", "/data_checkpoint")
>>   .start()
>>
>> After ingesting 2T records, the state under checkpoint folder on HDFS 
>> (replicator factor 2) grows to 2T bytes.
>> My cluster has only 2T bytes which means the cluster can barely handle 
>> further data growth.
>>
>> Online spark documents 
>> (https://docs.databricks.com/spark/latest/structured-streaming/production.html)
>> says using rocksdb help SSS job reduce JVM memory overhead. But I cannot 
>> find any document how
>>
>> to setup rocksdb for SSS. Spark class CheckpointReader seems to only handle 
>> HDFS.
>>
>> Any suggestions? Thanks!
>>
>>
>>
>>

Reply via email to