[GitHub] [spark] HeartSaVioR commented on pull request #27694: [SPARK-30946][SS] Serde entry with UnsafeRow on FileStream(Source/Sink)Log with LZ4 compression

GitBox Thu, 28 May 2020 13:57:30 -0700


HeartSaVioR commented on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-635603504



   > One lesson I learned from the past is UnsafeRow is not designed to be 
persisted across Spark versions.
   
   This sounds like a blocker for SS, as we leverage it to store state which 
should be used across Spark versions and shouldn't be corrupted/lost as it's 
time-consuming or sometimes simply not possible to replay from the scratch to 
construct the same due to retention policy. Do you have references on following 
the discussion/gotcha for that issue? I think we should really fix it for 
storing state - probably state must not be stored via UnsafeRow format then.
   
   > Do you know what are the performance numbers if we just compress the text 
files?
   
   I haven't experimented but we can easily imagine the file size would be 
smaller whereas processing time may be affected both positively (less to read 
from remote storage) and negatively (still have to serialize/deserialize with 
JSON + compression).
   
   Here the point of the PR is that we know the schema (and even versioning) of 
the file in prior, hence we don't (and shouldn't) pay huge cost to make the 
file be backward-compatible by itself. We don't do versioning for data 
structures being used by event log so we are paying huge cost to make it 
backward/forward compatible. If we are not sure about unsafe row format for 
storing then we may be able to just try with traditional approaches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #27694: [SPARK-30946][SS] Serde entry with UnsafeRow on FileStream(Source/Sink)Log with LZ4 compression

Reply via email to