Re: Sorting on a streaming dataframe

2018-05-01 Thread Hemant Bhanawat
Opened an issue. https://issues.apache.org/jira/browse/SPARK-24144 Since it is a Major issue for us, I have marked it as Major issue. Feel free to change if that is not the case from Spark's perspective. On Tue, May 1, 2018 at 4:34 AM, Michael Armbrust wrote: > Please

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Thakrar, Jayesh
Just wondering- Given that currently V2 is less performant because of use of Row vs InternalRow (and other things?), is still evolving, and is missing some of the other features of V1, it might help to focus on remediating those features and then look at porting the filesources over. As for

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Joseph Torres
I agree that Spark should fully handle state serialization and recovery for most sources. This is how it works in V1, and we definitely wouldn't want or need to change that in V2.* The question is just whether we should have an escape hatch for the sources that don't want Spark to do that, and if

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Ryan Blue
I think there's a difference. You're right that we wanted to clean up the API in V2 to avoid file sources using side channels. But there's a big difference between adding, for example, a way to report partitioning and designing for sources that need unbounded state. It's a judgment call, but I

Re: [build system] jenkins master unreachable, build system currently down

2018-05-01 Thread Joseph Bradley
Thank you Shane!! On Tue, May 1, 2018 at 8:58 AM, Xiao Li wrote: > Thank you very much, Shane! Yeah, it works now! > > Xiao > > > 2018-05-01 8:40 GMT-07:00 shane knapp : > >> and we're back! there was apparently a firewall migration yesterday that >>

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Ryan Blue
This is usually caused by skew. Sometimes you can work around it by in creasing the number of partitions like you tried, but when that doesn’t work you need to change the partitioning that you’re using. If you’re aggregating, try adding an intermediate aggregation. For example, if your query is

Re: [build system] jenkins master unreachable, build system currently down

2018-05-01 Thread Xiao Li
Thank you very much, Shane! Yeah, it works now! Xiao 2018-05-01 8:40 GMT-07:00 shane knapp : > and we're back! there was apparently a firewall migration yesterday that > went sideways. > > shane > > On Mon, Apr 30, 2018 at 8:27 PM, shane knapp wrote:

Re: [build system] jenkins master unreachable, build system currently down

2018-05-01 Thread shane knapp
and we're back! there was apparently a firewall migration yesterday that went sideways. shane On Mon, Apr 30, 2018 at 8:27 PM, shane knapp wrote: > we just noticed that we're unable to connect to jenkins, and have reached > out to our NOC support staff at our colo. until

ApacheCon North America 2018 schedule is now live.

2018-05-01 Thread Rich Bowen
Dear Apache Enthusiast, We are pleased to announce our schedule for ApacheCon North America 2018. ApacheCon will be held September 23-27 at the Montreal Marriott Chateau Champlain in Montreal, Canada. Registration is open! The early bird rate of $575 lasts until July 21, at which time it

org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Pralabh Kumar
Hi I am getting the above error in Spark SQL . I have increase (using 5000 ) number of partitions but still getting the same error . My data most probably is skew. org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829 at

PySpark.sql.filter not performing as it should

2018-05-01 Thread 880f0464
Hi Everyone, I wonder If someone could be so kind and share some light on this problem: [PySpark.sql.filter not performing as it should](https://stackoverflow.com/q/49995538) Cheers, A. Sent with [ProtonMail](https://protonmail.com) Secure Email.

spark.python.worker.reuse not working as expected

2018-05-01 Thread 880f0464
Hi Everyone, I wonder If someone could be so kind and share some light on this problem: [spark.python.worker.reuse not working as expected](https://stackoverflow.com/q/50043684) Cheers, A. Sent with [ProtonMail](https://protonmail.com) Secure Email.

UnresolvedException: Invalid call to dataType on unresolved object

2018-05-01 Thread 880f0464
Hi Everyone, I wonder If someone could be so kind and share some light on this problem: [UnresolvedException: Invalid call to dataType on unresolved object when using DataSet constructed from Seq.empty (since Spark 2.3.0)](https://stackoverflow.com/q/49757487) Cheers, A. Sent with