Re: Arrow type issue with Pandas UDF

2018-07-19 Thread Bryan Cutler
Hi Patrick, It looks like it's failing in Scala before it even gets to Python to execute your udf, which is why it doesn't seem to matter what's in your udf. Since you are doing a grouped map udf maybe your group sizes are too big or skewed? Could you try to reduce the size of your groups by

Re: Mulitple joins with same Dataframe throws AnalysisException: resolved attribute(s)

2018-07-19 Thread Joel D
One workaround is to rename the fid column for each df before joining. On Thu, Jul 19, 2018 at 9:50 PM wrote: > Spark 2.3.0 has this problem upgrade it to 2.3.1 > > Sent from my iPhone > > On Jul 19, 2018, at 2:13 PM, Nirav Patel wrote: > > corrected subject line. It's missing attribute error

Re: Mulitple joins with same Dataframe throws AnalysisException: resolved attribute(s)

2018-07-19 Thread kanth909
Spark 2.3.0 has this problem upgrade it to 2.3.1 Sent from my iPhone > On Jul 19, 2018, at 2:13 PM, Nirav Patel wrote: > > corrected subject line. It's missing attribute error not ambiguous reference > error. > >> On Thu, Jul 19, 2018 at 2:11 PM, Nirav Patel wrote: >> I am getting

Parquet

2018-07-19 Thread amin mohebbi
We do have two big tables each includes 5 billion of rows, so my question here is should we partition /sort the data and convert it to Parquet before doing any join? Best Regards ... Amin Mohebbi PhD candidate in Software Engineering   at

Re: Mulitple joins with same Dataframe throws AnalysisException: resolved attribute(s)

2018-07-19 Thread Nirav Patel
corrected subject line. It's missing attribute error not ambiguous reference error. On Thu, Jul 19, 2018 at 2:11 PM, Nirav Patel wrote: > I am getting attribute missing error after joining dataframe 'df2' twice . > > Exception in thread "main" org.apache.spark.sql.AnalysisException: > resolved

Mulitple joins with same Dataframe throws Ambiguous reference error

2018-07-19 Thread Nirav Patel
I am getting attribute missing error after joining dataframe 'df2' twice . Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) *fid#49 *missing from value#14,value#126,mgrId#15,name#16,d31#109,df2Id#125,df2Id#47,d4#130,d3#129,df1Id#13,name#128, *fId#127* in

Re: [STRUCTURED STREAM] Join static dataset in state function (flatMapGroupsWithState)

2018-07-19 Thread Christiaan Ras
Hi Gerard, First, I like to thank you for your fast reply and for directing my question to the proper mailinglist! I established the JDBC connection in the context of the state function (flatMapGroupsWithState). The JDBC connection is made by using the read in the SparkSession. Like below:

Compute /Storage Calculation

2018-07-19 Thread Deepu Raj
Hi Team - Any good calculator/Excel to estimate compute and storage requirements for the new spark jobs to be developed. Capacity planning based on:- Job, Data type etc Thanks, Deepu Raj

Arrow type issue with Pandas UDF

2018-07-19 Thread Patrick McCarthy
PySpark 2.3.1 on YARN, Python 3.6, PyArrow 0.8. I'm trying to run a pandas UDF, but I seem to get nonsensical exceptions in the last stage of the job regardless of my output type. The problem I'm trying to solve: I have a column of scalar values, and each value on the same row has a sorted

[Spark Structured Streaming on K8S]: Debug - File handles/descriptor (unix pipe) leaking

2018-07-19 Thread Abhishek Tripathi
Hello All!​​ I am using spark 2.3.1 on kubernetes to run a structured streaming spark job which read stream from Kafka , perform some window aggregation and output sink to Kafka. After job running few hours(5-6 hours), the executor pods is getting crashed which is caused by "Too many open files in

Re: [STRUCTURED STREAM] Join static dataset in state function (flatMapGroupsWithState)

2018-07-19 Thread Gerard Maas
Hi Chris, Could you show the code you are using? When you mention "I like to use a static datasource (JDBC) in the state function" do you refer to a DataFrame from a JDBC source or an independent JDBC connection? The key point to consider is that the flatMapGroupsWithState function must be