[Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-02 Thread Pranav Agrawal
can't get around this error when performing union of two datasets (ds1.union(ds2)) having complex data type (struct, list), *18/06/02 15:12:00 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.sql.AnalysisException: Union can

[Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-02 Thread Pranav Agrawal
can't get around this error when performing union of two datasets having complex data type (struct, list), *18/06/02 15:12:00 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.sql.AnalysisException: Union can only be performed

Re: Append In-Place to S3

2018-06-02 Thread Aakash Basu
As Jay suggested correctly, if you're joining then overwrite otherwise only append as it removes dups. I think, in this scenario, just change it to write.mode('overwrite') because you're already reading the old data and your job would be done. On Sat 2 Jun, 2018, 10:27 PM Benjamin Kim, wrote:

Re: Append In-Place to S3

2018-06-02 Thread Benjamin Kim
Hi Jay, Thanks for your response. Are you saying to append the new data and then remove the duplicates to the whole data set afterwards overwriting the existing data set with new data set with appended values? I will give that a try. Cheers, Ben On Fri, Jun 1, 2018 at 11:49 PM Jay wrote: >

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-06-02 Thread Timur Shenkao
Did you use RDDs or DataFrames? What is the Spark version? On Mon, May 28, 2018 at 10:32 PM, Saulo Sobreiro wrote: > Hi, > I run a few more tests and found that even with a lot more operations on > the scala side, python is outperformed... > > Dataset Stream duration: ~3 minutes (csv formatted

Re: Append In-Place to S3

2018-06-02 Thread vincent gromakowski
Structured streaming can provide idempotent and exactly once writings in parquet but I don't know how it does under the hood. Without this you need to load all your dataset, then dedup, then write back the entire dataset. This overhead can be minimized with partitionning output files. Le ven. 1

Re: Append In-Place to S3

2018-06-02 Thread Jay
Benjamin, The append will append the "new" data to the existing data with removing the duplicates. You would need to overwrite the file everytime if you need unique values. Thanks, Jayadeep On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim wrote: > I have a situation where I trying to add only new