[GitHub] [spark] HeartSaVioR commented on pull request #29715: [WIP][SPARK-32847][SS] Add DataStreamWriterV2 API

GitBox Mon, 14 Sep 2020 05:11:10 -0700


HeartSaVioR commented on pull request #29715:
URL: https://github.com/apache/spark/pull/29715#issuecomment-692011667



   Thanks for the input.
   
   My initial goal was to enable reading catalog table in SS query, so I didn't 
touch the other stuff from DataStreamWriter. I borrowed the concept of 
representing "save mode" as the end chain of the method, but that's also OK for 
me if we'd like to keep the `start()` as DataStreamWriter does. If you have 
some idea in mind about the problem, please feel free to share. 
   
   I think there're some points to consider while designing:
   
   1. The output mode for the sink isn't exactly matched to the output mode for 
the result table.
   
   We already know about the case of "update as append" (output mode for the 
result table is update but the sink does the append) for DSv2, but in reality, 
most sinks (at least built-in sinks) are doing the append for any mode (even 
complete mode), just because we did for Spark 2.x. DSv1 is even more 
problematic, the interface is designed to only append, but there's no 
limitation of the output mode for DSv1 sink.
   
   I think we won't support DSv1 in DataStreamWriterV2, but mismatch still 
remains in DSv2. Do we want to keep the mismatch forever, or fix it at least in 
DSv2? (Kafka is an one of examples - Kafka sink shouldn't allow update and 
complete mode. I think we did the right fix but the compatibility messed up.)
   
   2. The continuous mode hasn't been actively developed.
   
   Given the current status of SS development, I don't think continuous mode 
would leverage the output mode in near future. (That said, output mode is not 
needed.) I'm not sure that will be valid in near future - if it is, we may be 
able to split builders for micro-batch and continuous mode and remove output 
mode for continuous mode.
   
   (TBH, I'm wondering continuous mode is being used in production - the mode 
is introduced in Spark 2.3, and no one has been claimed to graduate continuous 
mode from experimental. No contributor has been caring about it. Is that 
something we might be able to consider retiring to reduce complexity?)
   
   3. More things to consider?
   
   Without the clear answer on considerations it would be hard to construct a 
good API.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #29715: [WIP][SPARK-32847][SS] Add DataStreamWriterV2 API

Reply via email to