Re: Spark, read from Kafka stream failing AnalysisException

2020-04-05 Thread Tathagata Das
Have you looked at the suggestion made by the error by searching for "Structured Streaming + Kafka Integration Guide" in Google? It should be the first result. The last section in the "Structured Streaming

Re: spark-submit exit status on k8s

2020-04-05 Thread Yinan Li
Not sure if you are aware of this new feature in Airflow https://issues.apache.org/jira/browse/AIRFLOW-6542. It's a way to use Airflow to orchestrate spark applications run using the Spark K8S operator ( https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). On Sun, Apr 5, 2020 at 8:25

Re: pandas_udf is very slow

2020-04-05 Thread Lian Jiang
Thanks Silvio. I need grouped map pandas UDF which takes a spark data frame as the input and outputs a spark data frame having a different shape from input. Grouped map is kind of unique to pandas udf and I have trouble to find a similar non pandas udf for an apple to apple comparison. Let me

Re: spark-submit exit status on k8s

2020-04-05 Thread Masood Krohy
Another, simpler solution that I just thought of: just add an operation at the end of your Spark program to write an empty file somewhere, with filename SUCCESS for example. Add a stage to your AirFlow graph to check the existence of this file after running spark-submit. If the file is absent,

Spark, read from Kafka stream failing AnalysisException

2020-04-05 Thread Sumit Agrawal
Hello, I am using Spark 2.4.5, Kafka 2.3.1 on my local machine. I am able to produce and consume messages on Kafka with bootstrap server config "localhost:9092” While trying to setup reader with spark streaming API, I am getting an error as Exception Message: Py4JJavaError: An error occurred

Re: pandas_udf is very slow

2020-04-05 Thread Silvio Fiorito
Your 2 examples are doing different things. The Pandas UDF is doing a grouped map, whereas your Python UDF is doing an aggregate. I think you want your Pandas UDF to be PandasUDFType.GROUPED_AGG? Is your result the same? From: Lian Jiang Date: Sunday, April 5, 2020 at 3:28 AM To: user

pandas_udf is very slow

2020-04-05 Thread Lian Jiang
Hi, I am using pandas udf in pyspark 2.4.3 on EMR 5.21.0. pandas udf is favored over non pandas udf per https://www.twosigma.com/wp-content/uploads/Jin_-_Improving_Python__Spark_Performance_-_Spark_Summit_West.pdf. My data has about 250M records and the pandas udf code is like: def

Re: Serialization or internal functions?

2020-04-05 Thread Som Lima
If you want to measure optimisation in terms of time taken , then here is an idea :) public class MyClass { public static void main(String args[]) throws InterruptedException { long start = System.currentTimeMillis(); // replace with your add column code // enough data