Re: Sorting in Spark on multiple partitions

2018-06-03 Thread Jörn Franke
You partition by userid, why do you then sort again by userid in the partition? Can you try to remove userid from the sort? How do you check if the sort is correct or not? What is the underlying objective of the sort? Do you have more information on schema and data? > On 4. Jun 2018, at

Sorting in Spark on multiple partitions

2018-06-03 Thread Sing, Jasbir
Hi Team, We are currently using Spark 2.2.0 and facing some challenges in sorting of data on multiple partitions. We have tried below approaches: 1. Spark SQL approach: * var query = "select * from data distribute by " + userid + " sort by " + userid + ", " + time " This query

Re: [Spark SQL] Is it possible to do stream to stream inner join without event time?

2018-06-03 Thread Becket Qin
Bump. Any direction would be helpful. Thanks. On Fri, Jun 1, 2018 at 6:10 PM, Becket Qin wrote: > Hi, > > I am new to Spark and I'm trying to run a few queries from TPC-H using > Spark SQL. > > According to the documentation here >

Spark task default timeout

2018-06-03 Thread Shushant Arora
Hi I have an spark application where driver starts few tasks and In each task which is a VoidFunction , I have a long running infinite loop. I have set speculative execution to false. Will spark kill my task after sometime (Timeout) or tasks will run infinitely? If tasks will be killed after

Re: Append In-Place to S3

2018-06-03 Thread Tayler Lawrence Jones
Sorry actually my last message is not true for anti join, I was thinking of semi join. -TJ On Sun, Jun 3, 2018 at 14:57 Tayler Lawrence Jones wrote: > A left join with null filter is only the same as a left anti join if the > join keys can be guaranteed unique in the existing data. Since hive

Re: Append In-Place to S3

2018-06-03 Thread Tayler Lawrence Jones
A left join with null filter is only the same as a left anti join if the join keys can be guaranteed unique in the existing data. Since hive tables on s3 offer no unique guarantees outside of your processing code, I recommend using left anti join over left join + null filter. -TJ On Sun, Jun 3,

Re: Append In-Place to S3

2018-06-03 Thread ayan guha
I do not use anti join semantics, but you can use left outer join and then filter out nulls from right side. Your data may have dups on the columns separately but it should not have dups on the composite key ie all columns put together. On Mon, 4 Jun 2018 at 6:42 am, Tayler Lawrence Jones wrote:

Re: Append In-Place to S3

2018-06-03 Thread Tayler Lawrence Jones
The issue is not the append vs overwrite - perhaps those responders do not know Anti join semantics. Further, Overwrite on s3 is a bad pattern due to s3 eventual consistency issues. First, your sql query is wrong as you don’t close the parenthesis of the CTE (“with” part). In fact, it looks like

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-03 Thread Alessandro Solimando
Hi Pranav, I don´t have an answer to your issue, but what I generally do in this cases is to first try to simplify it to a point where it is easier to check what´s going on, and then adding back ¨pieces¨ one by one until I spot the error. In your case I can suggest to: 1) project the dataset to