date:20180602

[Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-02 Thread Pranav Agrawal

can't get around this error when performing union of two datasets (ds1.union(ds2)) having complex data type (struct, list), *18/06/02 15:12:00 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.sql.AnalysisException: Union can

[Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-02 Thread Pranav Agrawal

can't get around this error when performing union of two datasets having complex data type (struct, list), *18/06/02 15:12:00 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.sql.AnalysisException: Union can only be performed

Re: Append In-Place to S3

2018-06-02 Thread Aakash Basu

As Jay suggested correctly, if you're joining then overwrite otherwise only append as it removes dups. I think, in this scenario, just change it to write.mode('overwrite') because you're already reading the old data and your job would be done. On Sat 2 Jun, 2018, 10:27 PM Benjamin Kim, wrote:

Re: Append In-Place to S3

2018-06-02 Thread Benjamin Kim

Hi Jay, Thanks for your response. Are you saying to append the new data and then remove the duplicates to the whole data set afterwards overwriting the existing data set with new data set with appended values? I will give that a try. Cheers, Ben On Fri, Jun 1, 2018 at 11:49 PM Jay wrote: >

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-06-02 Thread Timur Shenkao

Did you use RDDs or DataFrames? What is the Spark version? On Mon, May 28, 2018 at 10:32 PM, Saulo Sobreiro wrote: > Hi, > I run a few more tests and found that even with a lot more operations on > the scala side, python is outperformed... > > Dataset Stream duration: ~3 minutes (csv formatted

Re: Append In-Place to S3

2018-06-02 Thread vincent gromakowski

Structured streaming can provide idempotent and exactly once writings in parquet but I don't know how it does under the hood. Without this you need to load all your dataset, then dedup, then write back the entire dataset. This overhead can be minimized with partitionning output files. Le ven. 1

Re: Append In-Place to S3

2018-06-02 Thread Jay

Benjamin, The append will append the "new" data to the existing data with removing the duplicates. You would need to overwrite the file everytime if you need unique values. Thanks, Jayadeep On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim wrote: > I have a situation where I trying to add only new

[Spark SQL] error in performing dataset union with complex data type (struct, list)

[Spark SQL] error in performing dataset union with complex data type (struct, list)

Re: Append In-Place to S3

Re: Append In-Place to S3

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

Re: Append In-Place to S3

Re: Append In-Place to S3

7 matches

Site Navigation

Mail list logo

Footer information