[ https://issues.apache.org/jira/browse/SPARK-30946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jungtaek Lim closed SPARK-30946. -------------------------------- > FileStreamSourceLog/FileStreamSinkLog: leverage UnsafeRow type to > serialize/deserialize entry > --------------------------------------------------------------------------------------------- > > Key: SPARK-30946 > URL: https://issues.apache.org/jira/browse/SPARK-30946 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 3.1.0 > Reporter: Jungtaek Lim > Priority: Major > > HDFSMetadataLog and its descendants are normally using JSON serialization to > serialize/deserialize entries. > While it's good to support backward compatibility (like field addition and > field deletion), it tends to take bunch of overhead as it adds field names, > and should store all data types to string (at least when it's being written > to file), works badly for some kind of fields - e.g. timestamp. > The major overhead is heavily affecting to CompactibleFileStreamLog, as > "compact" operation requires to load all entities and do the > transformation/filtering (I haven't seen any effective operation being > implemented though), and store them altogether into one file. This is the one > of major reason why the metadata file is too huge and it brings unacceptable > latency on "compact" operation. > Fortunately, the entity class for both FileStreamSourceLog (FileEntry) and > FileStreamSinkLog (SinkFileStatus) haven't been modified for over 3 years. > The latest modification of both classes were year 2016. We can call it > "stable" - and then we have more option to optimize serde. > One easy but pretty effective approach on optimizing serde is converting to > UnsafeRow and storing it on the same way we do in > HDFSBackedStateStoreProvider, and vice versa. It has being running for 2.x > versions, so the approach is considered as safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org