Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

via GitHub Fri, 12 Apr 2024 10:29:16 -0700


WweiL commented on PR #45971:
URL: https://github.com/apache/spark/pull/45971#issuecomment-2052183431


   Thanks for the effort! This really requires some deep understanding of spark 
internals...
   
   There is still one important concern, that the golden file size is too big. 
I looked a bit, it seems that the largest golden file is ~7MB. We should find a 
way to limit the file size to < 10MB.
   
   One improvement I can see is that, here you are storing both the rows and 
partition ids.
   I think we don't need to store rows here. Instead, we store the random seed, 
and regenerate the random rows in the check.
   By doing this we only need to store the seed, the schemas, and for each 
schema:
   1. partition ids, and 
   2. numRows
   
   Now golden file size should be much smaller.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

Reply via email to