Re: Implementation for exactly-once streaming sink

2018-12-06 Thread Eric Wohlstadter
Thanks Arun. In our case, we only commit sink task output to the datastore by coordinating with the driver. Sink tasks write output to a "staging" area, and the driver only commits the staging data to a datastore once all tasks for a micro-batch have reported success back to the driver. In the

Re: Implementation for exactly-once streaming sink

2018-12-06 Thread Arun M
Hi Eric, I think it will depend on how you implement the sink and when the data in the sink partitions are committed. I think the batch can be repeated during task retries as well as if the driver fails before the batch id is committed in sparks checkpoint. In the first case may be the sink had

Re: Implementation for exactly-once streaming sink

2018-12-06 Thread Eric Wohlstadter
Hi Arun, Gabor, Thanks for the feedback. We are using the "Exactly-once using transactional writes" approach, so we don't rely on message keys for idempotent writes. So I should clarify that my question is specific to the "Exactly-once using transactional writes" approach. We are following the

Re: Run a specific PySpark test or group of tests

2018-12-06 Thread Xiao Li
Yes! This is very helpful! On Wed, Dec 5, 2018 at 9:21 PM Wenchen Fan wrote: > great job! thanks a lot! > > On Thu, Dec 6, 2018 at 9:39 AM Hyukjin Kwon wrote: > >> It's merged now and in developer tools page - >> http://spark.apache.org/developer-tools.html#individual-tests >> Have some func

proposal for expanded & consistent timestamp types

2018-12-06 Thread Imran Rashid
Hi, I'd like to discuss the future of timestamp support in Spark, in particular with respect of handling timezones in different SQL types. In a nutshell: * There are at least 3 different ways of handling the timestamp type across timezone changes * We'd like Spark to clearly distinguish the 3

Re: Implementation for exactly-once streaming sink

2018-12-06 Thread Gabor Somogyi
Hi Eric, In order to have exactly-once one need re-playable source and idempotent sink. The cases what you've mentioned are covering the 2 main group of issues. Practically any kind of programming problem can end-up in duplicated data (even in the code which feeds kafka). Don't know why have you