Re: Should python-2 be supported in Spark 3.0?

2019-05-29 Thread shane knapp
> > I don't have a good sense of the overhead of continuing to support > Python 2; is it large enough to consider dropping it in Spark 3.0? > > from the build/test side, it will actually be pretty easy to continue support for python2.7 for spark 2.x as the feature sets won't be expanding. that

Re: Should python-2 be supported in Spark 3.0?

2019-05-29 Thread Jules Damji
Here’s the tweet from the horse’s mouth: https://twitter.com/gvanrossum/status/1133496146700058626?s=21 Cheers Jules — Sent from my iPhone Pardon the dumb thumb typos :) > On May 29, 2019, at 10:12 PM, Sean Owen wrote: > > Deprecated -- certainly and sooner than later. > I don't have a

Re: Should python-2 be supported in Spark 3.0?

2019-05-29 Thread Sean Owen
Deprecated -- certainly and sooner than later. I don't have a good sense of the overhead of continuing to support Python 2; is it large enough to consider dropping it in Spark 3.0? On Wed, May 29, 2019 at 11:47 PM Xiangrui Meng wrote: > > Hi all, > > I want to revive this old thread since no

Re: Should python-2 be supported in Spark 3.0?

2019-05-29 Thread Xiangrui Meng
Hi all, I want to revive this old thread since no action was taken so far. If we plan to mark Python 2 as deprecated in Spark 3.0, we should do it as early as possible and let users know ahead. PySpark depends on Python, numpy, pandas, and pyarrow, all of which are sunsetting Python 2 support by

Re: Upsert for hive tables

2019-05-29 Thread Aakash Basu
Don't you have a date/timestamp to handle updates? So, you're talking about CDC? If you've Datestamp you can check if that/those key(s) exists, if exists then check if timestamp matches, if that matches, then ignore, if that doesn't then update. On Thu 30 May, 2019, 7:11 AM Genieliu, wrote: >

Re: Upsert for hive tables

2019-05-29 Thread Genieliu
Isn't step1 and step2 producing the copy of Table A? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

adding a column to a groupBy (dataframe)

2019-05-29 Thread Marcelo Valle
Hi all, I am new to spark and I am trying to write an application using dataframes that normalize data. So I have a dataframe `denormalized_cities` with 3 columns: COUNTRY, CITY, CITY_NICKNAME Here is what I want to do: 1. Map by country, then for each country generate a new ID and write

[Spark Streaming]: Spark Checkpointing: Content, Recovery and Idempotency

2019-05-29 Thread Sheel Pancholi
Hello, I am trying to understand the *content* of a checkpoint and corresponding recovery; understanding the process of checkpointing is obviously the natural way of going about it and so I went over the following list: - medium post

Re: Upsert for hive tables

2019-05-29 Thread Aakash Basu
Why don't you simply copy whole of delta data (Table A) into a stage table (temp table in your case) and insert depending on a *WHERE NOT EXISTS* check on primary key/composite key which already exists in the table B? That's faster and does the reconciliation job smoothly enough. Others, any

Spark Streaming: Checkpointing Recovery and Idempotency

2019-05-29 Thread Sheel Pancholi
Hello, I am trying to understand the *content* of a checkpoint and corresponding recovery; understanding the process of checkpointing is obviously the natural way of going about it and so I went over the following list: - medium post

Spark Streaming: Checkpoint, Recovery and Idempotency

2019-05-29 Thread sheelstera
Hello, I am trying to understand the content of a checkpoint and corresponding recovery. *My understanding of Spark Checkpointing: * If you have really long DAGs and your spark cluster fails, checkpointing helps by persisting intermediate state e.g. to HDFS. So, a DAG of 50 transformations can

Re: Upsert for hive tables

2019-05-29 Thread Tomasz Krol
Hey Aakash, That will work for records which dont exist yet in the target table. What about records which have to be updated ? As I mentioned, I want to do an upsert. That means, I want to add not existing records and update those which already exist. Thanks Tom On Wed 29 May 2019 at 18:39,

Upsert for hive tables

2019-05-29 Thread Tomasz Krol
Hey Guys, I am wondering what would be your approach to following scenario: I have two tables - one (Table A) is relatively small (e.g 50GB) and second one (Table B) much bigger (e.g. 3TB). Both are parquet tables. I want to ADD all records from Table A to Table B which dont exist in Table B

Re: Does Spark SQL has match_recognize?

2019-05-29 Thread kant kodali
Nope Not at all On Sun, May 26, 2019 at 8:15 AM yeikel valdes wrote: > Isn't match_recognize just a filter? > > df.filter(predicate)? > > > On Sat, 25 May 2019 12:55:47 -0700 * kanth...@gmail.com > * wrote > > Hi All, > > Does Spark SQL has match_recognize? I am not sure why CEP

Re: Executors idle, driver heap exploding and maxing only 1 cpu core

2019-05-29 Thread Akshay Bhardwaj
Hi, A few thoughts to add to Nicholas' apt reply. We were loading multiple files from AWS S3 in our Spark application. When the spark step of load files is called, the driver spends significant time fetching the exact path of files from AWS s3. Especially because we specified S3 paths like regex