[ 
https://issues.apache.org/jira/browse/SPARK-30946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim closed SPARK-30946.
--------------------------------

> FileStreamSourceLog/FileStreamSinkLog: leverage UnsafeRow type to 
> serialize/deserialize entry
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30946
>                 URL: https://issues.apache.org/jira/browse/SPARK-30946
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.1.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> HDFSMetadataLog and its descendants are normally using JSON serialization to 
> serialize/deserialize entries.
> While it's good to support backward compatibility (like field addition and 
> field deletion), it tends to take bunch of overhead as it adds field names, 
> and should store all data types to string (at least when it's being written 
> to file), works badly for some kind of fields - e.g. timestamp.
> The major overhead is heavily affecting to CompactibleFileStreamLog, as 
> "compact" operation requires to load all entities and do the 
> transformation/filtering (I haven't seen any effective operation being 
> implemented though), and store them altogether into one file. This is the one 
> of major reason why the metadata file is too huge and it brings unacceptable 
> latency on "compact" operation.
> Fortunately, the entity class for both FileStreamSourceLog (FileEntry) and 
> FileStreamSinkLog (SinkFileStatus) haven't been modified for over 3 years. 
> The latest modification of both classes were year 2016. We can call it 
> "stable" - and then we have more option to optimize serde.
> One easy but pretty effective approach on optimizing serde is converting to 
> UnsafeRow and storing it on the same way we do in 
> HDFSBackedStateStoreProvider, and vice versa. It has being running for 2.x 
> versions, so the approach is considered as safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to