There might be some help from the staging table catalog as well.
-Matt Cheah From: Wenchen Fan <cloud0...@gmail.com> Date: Monday, August 5, 2019 at 7:40 PM To: Shiv Prashant Sood <shivprash...@gmail.com> Cc: Ryan Blue <rb...@netflix.com>, Jungtaek Lim <kabh...@gmail.com>, Spark Dev List <dev@spark.apache.org> Subject: Re: DataSourceV2 : Transactional Write support I agree with the temp table approach. One idea is: maybe we only need one temp table, and each task writes to this temp table. At the end we read the data from the temp table and write it to the target table. AFAIK JDBC can handle concurrent table writing very well, and it's better than creating thousands of temp tables for one write job(assume the input RDD has thousands of partitions). On Tue, Aug 6, 2019 at 7:57 AM Shiv Prashant Sood <shivprash...@gmail.com> wrote: Thanks all for the clarification. Regards, Shiv On Sat, Aug 3, 2019 at 12:49 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > What you could try instead is intermediate output: inserting into temporal > table in executors, and move inserted records to the final table in driver > (must be atomic) I think that this is the approach that other systems (maybe sqoop?) have taken. Insert into independent temporary tables, which can be done quickly. Then for the final commit operation, union and insert into the final table. In a lot of cases, JDBC databases can do that quickly as well because the data is already on disk and just needs to added to the final table. On Fri, Aug 2, 2019 at 7:25 PM Jungtaek Lim <kabh...@gmail.com> wrote: I asked similar question for end-to-end exactly-once with Kafka, and you're correct distributed transaction is not supported. Introducing distributed transaction like "two-phase commit" requires huge change on Spark codebase and the feedback was not positive. What you could try instead is intermediate output: inserting into temporal table in executors, and move inserted records to the final table in driver (must be atomic). Thanks, Jungtaek Lim (HeartSaVioR) On Sat, Aug 3, 2019 at 4:56 AM Shiv Prashant Sood <shivprash...@gmail.com> wrote: All, I understood that DataSourceV2 supports Transactional write and wanted to implement that in JDBC DataSource V2 connector ( PR#25211 [github.com] ). Don't see how this is feasible for JDBC based connector. The FW suggest that EXECUTOR send a commit message to DRIVER, and actual commit should only be done by DRIVER after receiving all commit confirmations. This will not work for JDBC as commits have to happen on the JDBC Connection which is maintained by the EXECUTORS and JDBCConnection is not serializable that it can be sent to the DRIVER. Am i right in thinking that this cannot be supported for JDBC? My goal is to either fully write or roll back the dataframe write operation. Thanks in advance for your help. Regards, Shiv -- Name : Jungtaek Lim Blog : http://medium.com/@heartsavior [medium.com] Twitter : http://twitter.com/heartsavior [twitter.com] LinkedIn : http://www.linkedin.com/in/heartsavior [linkedin.com] -- Ryan Blue Software Engineer Netflix
smime.p7s
Description: S/MIME cryptographic signature