Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Jungtaek Lim
IMHO that's up to how we would like to be strict about "exactly-once". Some being said it is exactly-once when the output is eventually exactly-once, whereas others being said there should be no side effect, like consumer shouldn't see partial write. I guess 2PC is former, since some partitions

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Russell Spitzer
Why is atomic operations a requirement? I feel like doubling the amount of writes (with staging tables) is probably a tradeoff that the end user should make. On Mon, Sep 10, 2018, 10:43 PM Wenchen Fan wrote: > Regardless the API, to use Spark to write data atomically, it requires > 1. Write

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Wenchen Fan
Regardless the API, to use Spark to write data atomically, it requires 1. Write data distributedly, with a central coordinator at Spark driver. 2. The distributed writers are not guaranteed to run together at the same time. (This can be relaxed if we can extend the barrier scheduling feature) 3.

Re: Branch 2.4 is cut

2018-09-10 Thread Wenchen Fan
Since it's not a clean revert, I've sent a PR to revert it from 2.4, please take a look, thanks! https://github.com/apache/spark/pull/22388 On Tue, Sep 11, 2018 at 1:16 AM Ryan Blue wrote: > SPARK-24882 was committed in order to make some progress, with a note > about following up with

[DISCUSS][K8S] consistent application id across driver and executors

2018-09-10 Thread onursatici
Hello, Spark on Kubernetes adds a label to group pods belonging to the same application. This value is obtained from SchedulerBackend::applicationId for executors, but for the driver this is set by spark-submit, in KubernetesClientApplication, to a different value. Should we change this? A fix

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Jungtaek Lim
> And regarding the issue that Jungtaek brought up, 2PC doesn't require tasks to be running at the same time, we need a mechanism to take down tasks after they have prepared and bring up the tasks during the commit phase. I guess we already got into too much details here, but if it is based on

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
I'm not sure if it was Sqoop or another similar system, but I've heard about at least one Hadoop to RDBMS system working by writing to temporary staging tables in the database. The commit then runns a big INSERT INTO that selects from all the temporary tables and cleans up after by deleting them.

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
Sorry, I accidentally sent a reply to just Jungtaek. The current commit structure works for any table where you can stage data in place and commit in a combined operation. Iceberg does this by writing data files in place and committing them in an atomic update to the table metadata. You can also

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Arun Mahadevan
>Well almost all relational databases you can move data in a transactional way. That’s what transactions are for. It would work, but I suspect in most cases it would involve moving data from temporary tables to the final tables Right now theres no mechanisms to let the individual tasks commit in

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Jungtaek Lim
Ah yes. I have been thinking about NoSQL things since output for Spark workload may not be suitable for RDBMS (in terms of scale, and performance). For RDBMS it would work essentially (via INSERT ... SELECT). I agree the potential failure is pretty short for HDFS case. I just thought about it

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Reynold Xin
Well almost all relational databases you can move data in a transactional way. That’s what transactions are for. For just straight HDFS, the move is a pretty fast operation so while it is not completely transactional, the window of potential failure is pretty short for appends. For writers at the

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Jungtaek Lim
When network partitioning happens it is pretty OK for me to see 2PC not working, cause we deal with global transaction. Recovery should be hard thing to get it correctly though. I completely agree it would require massive changes to Spark. What I couldn't find for underlying storages is moving

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Dilip Biswal
Thanks a lot Reynold and Jungtaek Lim. It definitely helped me understand this better.   Regards,Dilip BiswalTel: 408-463-4980dbis...@us.ibm.com     - Original message -From: Reynold Xin To: kabh...@gmail.comCc: ar...@apache.org, dbis...@us.ibm.com, dev , Ryan Blue , Ross Lawley Subject:

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Reynold Xin
I don't think two phase commit would work here at all. 1. It'd require massive changes to Spark. 2. Unless the underlying data source can provide an API to coordinate commits (which few data sources I know provide something like that), 2PC wouldn't work in the presence of network partitioning.

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Jungtaek Lim
I guess we all are aware of limitation of contract on DSv2 writer. Actually it can be achieved only with HDFS sink (or other filesystem based sinks) and other external storage are normally not feasible to implement it because there's no way to couple a transaction with multiple clients as well as

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Arun Mahadevan
In some cases the implementations may be ok with eventual consistency (and does not care if the output is written out atomically) XA can be one option for datasources that supports it and requires atomicity but I am not sure how would one implement it with the current API. May be we need to

custom sink & model transformation

2018-09-10 Thread Stavros Kontopoulos
Hi, Just copying form users, since got no response. Is it unsfate to do model prediction within a custom sink eg. model.transform(df)? I see that the only transformation done is adding a prediction column AFAIK, does that change the execution plan? Thanks, Stavros

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Dilip Biswal
This is a pretty big challenge in general for data sources -- for the vast majority of data stores, the boundary of a transaction is per client. That is, you can't have two clients doing writes and coordinating a single transaction. That's certainly the case for almost all relational databases.

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Russell Spitzer
I think I mentioned on the Design Doc that with the Cassandra connector we have similar issues. There is no "transaction" or "staging table" capable of really doing that the api requires. On Mon, Sep 10, 2018 at 12:26 PM Reynold Xin wrote: > I don't think the problem is just whether we have a

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Reynold Xin
I don't think the problem is just whether we have a starting point for write. As a matter of fact there's always a starting point for write, whether it is explicit or implicit. This is a pretty big challenge in general for data sources -- for the vast majority of data stores, the boundary of a

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
Ross, I think the intent is to create a single transaction on the driver, write as part of it in each task, and then commit the transaction once the tasks complete. Is that possible in your implementation? I think that part of this is made more difficult by not having a clear starting point for a

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Reynold Xin
Typically people do it via transactions, or staging tables. On Mon, Sep 10, 2018 at 2:07 AM Ross Lawley wrote: > Hi all, > > I've been prototyping an implementation of the DataSource V2 writer for > the MongoDB Spark Connector and I have a couple of questions about how its > intended to be

Re: Branch 2.4 is cut

2018-09-10 Thread Felix Cheung
I’m a bit concern about what Arun is summarizing? We are building on DSv2 and already have to rewrite for bunch of changes in master/2.4, increasing in cost for dev work and release management. If we are saying more changes are coming in 3.0, do we have more info on what value the current

DataSourceWriter V2 Api questions

2018-09-10 Thread Ross Lawley
Hi all, I've been prototyping an implementation of the DataSource V2 writer for the MongoDB Spark Connector and I have a couple of questions about how its intended to be used with database systems. According to the Javadoc for DataWriter.commit(): *"this method should still "hide" the written

Re: Branch 2.4 is cut

2018-09-10 Thread Wenchen Fan
There are a lot of "breaking" changes we made in 2.4 for data source v2, while I agree SPARK-24882 is "breaking" most. I don't agree SPARK-24882 is half-baked. But I'm willing to revert it if we have a bunch of data source v2 users and they are not willing to update their implementation intensely

Re: Branch 2.4 is cut

2018-09-10 Thread Arun Mahadevan
Ryan's proposal makes a lot of sense. Its better not to release half-baked changes in 2.4 which not only breaks a lot of the APIs released in 2.3, but also expected to change further due redesigns before 3.0 so don't see much value releasing it in 2.4. On Sun, 9 Sep 2018 at 22:42, Wenchen Fan