Re: [pyspark 2.3+] CountDistinct

2019-06-28 Thread Rishi Shah
Hi All, Just wanted to check in to see if anyone has any insight about this behavior. Any pointers would help. Thanks, Rishi On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah wrote: > Hi All, > > Recently we noticed that countDistinct on a larger dataframe doesn't > always return the same value. Any

Re: Spark Structured Streaming Custom Sources confusion

2019-06-28 Thread Lars Francke
Hi Gabor, sure, the DSv2 seems to be undergoing backward-incompatible changes from Spark 2 -> 3 though, right? That combined with the fact that the API is pretty new still doesn't instill confidence in its stability (API wise I mean). Cheers, Lars On Fri, Jun 28, 2019 at 4:10 PM Gabor Somogyi

Re: Spark Structured Streaming Custom Sources confusion

2019-06-28 Thread Gabor Somogyi
Hi Lars, DSv2 already used in production. Documentation, well since Spark evolving fast I would take a look at how the built-in connectors implemented. BR? G On Fri, Jun 28, 2019 at 3:52 PM Lars Francke wrote: > Gabor, > > thank you. That is immensely helpful. DataSource v1 it is then. Does

Re: Spark Structured Streaming Custom Sources confusion

2019-06-28 Thread Lars Francke
Gabor, thank you. That is immensely helpful. DataSource v1 it is then. Does that mean DSV2 is not really for production use yet? Any idea what the best documentation would be? I'd probably start by looking at existing code. Cheers, Lars On Fri, Jun 28, 2019 at 1:06 PM Gabor Somogyi wrote: >

Re: Spark Structured Streaming Custom Sources confusion

2019-06-28 Thread Gabor Somogyi
Hi Lars, Since Structured Streaming doesn't support receivers at all so that source/sink can't be used. Data source v2 is under development and because of that it's a moving target so I suggest to implement it with v1 (unless special features are required from v2). Additionally since I've just