Usage of PyArrow in Spark

2019-07-16 Thread Abdeali Kothari
Hi, In spark 2.3+ I saw that pyarrow was being used in a bunch of places in spark. And I was trying to understand the benefit in terms of serialization / deserializaiton it provides. I understand that the new pandas-udf works only if pyarrow is installed. But what about the plain old PythonUDF

Re: Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-16 Thread Dongjoon Hyun
Thank you for volunteering for 2.3.4 release manager, Kazuaki! It's great to see a new release manager in advance. :D Thank you for reply, Stavros. In addition to that issue, I'm also monitoring some other K8s issues and PRs. But, I'm not sure we can have that because some PRs seems to fail at

Re: [PySpark] [SparkR] Is it possible to invoke a PySpark function with a SparkR DataFrame?

2019-07-16 Thread Felix Cheung
Not currently in Spark. However, there are systems out there that can share DataFrame between languages on top of Spark - it’s not calling the python UDF directly but you can pass the DataFrame to python and then .map(UDF) that way. From: Fiske, Danny Sent:

Re: Sorting tuples with byte key and byte value

2019-07-16 Thread Supun Kamburugamuve
Thanks, Keith. we have set the SPARK_WORKER_INSTANCES=8. So that means we are running 8 workers in a single machine with 1 thread and this gives the 8 threads? Is there a preference for running 1 worker and 8 threads inside it? These are dual CPU machines, so I believe we at least need 2 worker

Re: spark python script importError problem

2019-07-16 Thread Patrick McCarthy
Your module 'feature' isn't available to the yarn workers, so you'll need to either install it on them if you have access, or else upload to the workers at runtime using --py-files or similar. On Tue, Jul 16, 2019 at 7:16 AM zenglong chen wrote: > Hi,all, > When i run a run a python

spark python script importError problem

2019-07-16 Thread zenglong chen
Hi,all, When i run a run a python script on spark submit,it done well in local[*] mode,but not in standalone mode or yarn mode.The error like below: Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File

Re: Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-16 Thread Kazuaki Ishizaki
Thank you Dongjoon for being a release manager. If the assumed dates are ok, I would like to volunteer for an 2.3.4 release manager. Best Regards, Kazuaki Ishizaki, From: Dongjoon Hyun To: dev , "user @spark" , Apache Spark PMC Date: 2019/07/13 07:18 Subject:[EXTERNAL] Re:

Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-16 Thread Stavros Kontopoulos
Hi Dongjoon, Should we also consider fixing https://issues.apache.org/jira/browse/SPARK-27812 before the cut? Best, Stavros On Mon, Jul 15, 2019 at 7:04 PM Dongjoon Hyun wrote: > Hi, Apache Spark PMC members. > > Can we cut Apache Spark 2.4.4 next Monday (22nd July)? > > Bests, > Dongjoon. >

event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2019-07-16 Thread raman gugnani
HI , I have long running spark streaming jobs. Event log directories are getting filled with .inprogress files. Is there fix or work around for spark streaming. There is also one jira raised for the same by one reporter. https://issues.apache.org/jira/browse/SPARK-22783 -- Raman Gugnani