Usage of PyArrow in Spark

2019-07-16 Thread Abdeali Kothari
Hi,
In spark 2.3+ I saw that pyarrow was being used in a bunch of places in
spark. And I was trying to understand the benefit in terms of serialization
/ deserializaiton it provides.

I understand that the new pandas-udf works only if pyarrow is installed.
But what about the plain old PythonUDF which can be used in map() kind of
operations?
Are they also using pyarrow under the hood to reduce the cost is serde? Or
do they remain as earlier and no performance gain should be expected in
those?

If I'm not mistaken, plain old PythonUDFs could also benefit from Arrow as
the data transfer cost to serialize/deserialzie from Java to Python and
back still exists and could potentially be reduced by using Arrow?
Is my understanding correct? Are there any plans to implement this?

Pointers to any notes or Jira about this would be appreciated.


Re: Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-16 Thread Dongjoon Hyun
Thank you for volunteering for 2.3.4 release manager, Kazuaki!
It's great to see a new release manager in advance. :D

Thank you for reply, Stavros.
In addition to that issue, I'm also monitoring some other K8s issues and
PRs.
But, I'm not sure we can have that because some PRs seems to fail at
building consensus (even for 3.0.0).
In any way, could you ping the reviewers once more on those PRs which you
have concerns?
If it is merged into `branch-2.4`, it will be Apache Spark 2.4.4 of course.

Bests,
Dongjoon.


On Tue, Jul 16, 2019 at 4:00 AM Kazuaki Ishizaki 
wrote:

> Thank you Dongjoon for being a release manager.
>
> If the assumed dates are ok, I would like to volunteer for an 2.3.4
> release manager.
>
> Best Regards,
> Kazuaki Ishizaki,
>
>
>
> From:Dongjoon Hyun 
> To:dev , "user @spark" <
> user@spark.apache.org>, Apache Spark PMC 
> Date:2019/07/13 07:18
> Subject:[EXTERNAL] Re: Release Apache Spark 2.4.4 before 3.0.0
> --
>
>
>
> Thank you, Jacek.
>
> BTW, I added `@private` since we need PMC's help to make an Apache Spark
> release.
>
> Can I get more feedbacks from the other PMC members?
>
> Please me know if you have any concerns (e.g. Release date or Release
> manager?)
>
> As one of the community members, I assumed the followings (if we are on
> schedule).
>
> - 2.4.4 at the end of July
> - 2.3.4 at the end of August (since 2.3.0 was released at the end of
> February 2018)
> - 3.0.0 (possibily September?)
> - 3.1.0 (January 2020?)
>
> Bests,
> Dongjoon.
>
>
> On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski <*ja...@japila.pl*
> > wrote:
> Hi,
>
> Thanks Dongjoon Hyun for stepping up as a release manager!
> Much appreciated.
>
> If there's a volunteer to cut a release, I'm always to support it.
>
> In addition, the more frequent releases the better for end users so they
> have a choice to upgrade and have all the latest fixes or wait. It's their
> call not ours (when we'd keep them waiting).
>
> My big 2 yes'es for the release!
>
> Jacek
>
>
> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun, <*dongjoon.h...@gmail.com*
> > wrote:
> Hi, All.
>
> Spark 2.4.3 was released two months ago (8th May).
>
> As of today (9th July), there exist 45 fixes in `branch-2.4` including the
> following correctness or blocker issues.
>
> - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
> decimals not fitting in long
> - SPARK-26045 Error in the spark 2.4 release package with the
> spark-avro_2.11 dependency
> - SPARK-27798 from_avro can modify variables in other rows in local
> mode
> - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
> - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries
> - SPARK-28308 CalendarInterval sub-second part should be padded before
> parsing
>
> It would be great if we can have Spark 2.4.4 before we are going to get
> busier for 3.0.0.
> If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
> it next Monday. (15th July).
> How do you think about this?
>
> Bests,
> Dongjoon.
>
>


Re: [PySpark] [SparkR] Is it possible to invoke a PySpark function with a SparkR DataFrame?

2019-07-16 Thread Felix Cheung
Not currently in Spark.

However, there are systems out there that can share DataFrame between languages 
on top of Spark - it’s not calling the python UDF directly but you can pass the 
DataFrame to python and then .map(UDF) that way.



From: Fiske, Danny 
Sent: Monday, July 15, 2019 6:58:32 AM
To: user@spark.apache.org
Subject: [PySpark] [SparkR] Is it possible to invoke a PySpark function with a 
SparkR DataFrame?

Hi all,

Forgive this naïveté, I’m looking for reassurance from some experts!

In the past we created a tailored Spark library for our organisation, 
implementing Spark functions in Scala with Python and R “wrappers” on top, but 
the focus on Scala has alienated our analysts/statisticians/data scientists and 
collaboration is important for us (yeah… we’re aware that your SDKs are very 
similar across languages… :/ ). We’d like to see if we could forego the Scala 
facet in order to present the source code in a language more familiar to users 
and internal contributors.

We’d ideally write our functions with PySpark and potentially create a SparkR 
“wrapper” over the top, leading to the question:

Given a function written with PySpark that accepts a DataFrame parameter, is 
there a way to invoke this function using a SparkR DataFrame?

Is there any reason to pursue this? Is it even possible?

Many thanks,

Danny

For the latest data on the economy and society, consult our website at 
http://www.ons.gov.uk

***
Please Note:  Incoming and outgoing email messages are routinely monitored for 
compliance with our policy on the use of electronic communications

***

Legal Disclaimer:  Any views expressed by the sender of this message are not 
necessarily those of the Office for National Statistics
***


Re: Sorting tuples with byte key and byte value

2019-07-16 Thread Supun Kamburugamuve
Thanks, Keith. we have set the SPARK_WORKER_INSTANCES=8. So that means we
are running 8 workers in a single machine with 1 thread and this gives the
8 threads?

Is there a preference for running 1 worker and 8 threads inside it? These
are dual CPU machines, so I believe we at least need 2 worker instances per
machine. If this is the case, I can use 2 worker instances each having 4
threads.

Another question is how to avoid the disk for shuffle operation?

Best,
Supun..





On Mon, Jul 15, 2019 at 8:49 PM Keith Chapman 
wrote:

> Hi Supun,
>
> A couple of things with regard to your question.
>
> --executor-cores means the number of worker threads per VM. According to
> your requirement this should be set to 8.
>
> *repartitionAndSortWithinPartitions *is a RDD operation, RDD operations
> in Spark are not performant both in terms of execution and memory. I would
> rather use Dataframe sort operation if performance is key.
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
>
> On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve <
> supun.kamburugam...@gmail.com> wrote:
>
>> Hi all,
>>
>> We are trying to measure the sorting performance of Spark. We have a 16
>> node cluster with 48 cores and 256GB of ram in each machine and 10Gbps
>> network.
>>
>> Let's say we are running with 128 parallel tasks and each partition
>> generates about 1GB of data (total 128GB).
>>
>> We are using the method *repartitionAndSortWithinPartitions*
>>
>> A standalone cluster is used with the following configuration.
>>
>> SPARK_WORKER_CORES=1
>> SPARK_WORKER_MEMORY=16G
>> SPARK_WORKER_INSTANCES=8
>>
>> --executor-memory 16G --executor-cores 1 --num-executors 128
>>
>> I believe this sets 128 executors to run the job each having 16GB of
>> memory and spread across 16 nodes with 8 threads in each node. This
>> configuration runs very slow. The program doesn't use disks to read or
>> write data (data generated in-memory and we don't write to file after
>> sorting).
>>
>> It seems even though the data size is small, it uses disk for the
>> shuffle. We are not sure our configurations are optimal to achieve the best
>> performance.
>>
>> Best,
>> Supun..
>>
>>


Re: spark python script importError problem

2019-07-16 Thread Patrick McCarthy
Your module 'feature' isn't available to the yarn workers, so you'll need
to either install it on them if you have access, or else upload to the
workers at runtime using --py-files or similar.

On Tue, Jul 16, 2019 at 7:16 AM zenglong chen 
wrote:

> Hi,all,
>   When i run a run a python script on spark submit,it done well in
> local[*] mode,but not in standalone mode or yarn mode.The error like below:
>
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most
> recent call last):
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/worker.py", line
> 364, in main
> func, profiler, deserializer, serializer = read_command(pickleSer,
> infile)
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/worker.py", line
> 69, in read_command
> command = serializer._read_with_length(file)
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/serializers.py",
> line 172, in _read_with_length
> return self.loads(obj)
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/serializers.py",
> line 583, in loads
> return pickle.loads(obj)
> ImportError: No module named feature.user.user_feature
>
> The script also run well in "sbin/start-master.sh sbin/start-slave.sh",but
> it has the same importError problem in "sbin/start-master.sh
> sbin/start-slaves.sh".The conf/slaves contents is 'localhost'.
>
> What should i do to solve this import problem?Thanks!!!
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


spark python script importError problem

2019-07-16 Thread zenglong chen
Hi,all,
  When i run a run a python script on spark submit,it done well in
local[*] mode,but not in standalone mode or yarn mode.The error like below:

Caused by: org.apache.spark.api.python.PythonException: Traceback (most
recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pyspark/worker.py", line
364, in main
func, profiler, deserializer, serializer = read_command(pickleSer,
infile)
  File "/usr/local/lib/python2.7/dist-packages/pyspark/worker.py", line 69,
in read_command
command = serializer._read_with_length(file)
  File "/usr/local/lib/python2.7/dist-packages/pyspark/serializers.py",
line 172, in _read_with_length
return self.loads(obj)
  File "/usr/local/lib/python2.7/dist-packages/pyspark/serializers.py",
line 583, in loads
return pickle.loads(obj)
ImportError: No module named feature.user.user_feature

The script also run well in "sbin/start-master.sh sbin/start-slave.sh",but
it has the same importError problem in "sbin/start-master.sh
sbin/start-slaves.sh".The conf/slaves contents is 'localhost'.

What should i do to solve this import problem?Thanks!!!


Re: Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-16 Thread Kazuaki Ishizaki
Thank you Dongjoon for being a release manager.

If the assumed dates are ok, I would like to volunteer for an 2.3.4 
release manager.

Best Regards,
Kazuaki Ishizaki,



From:   Dongjoon Hyun 
To: dev , "user @spark" , 
Apache Spark PMC 
Date:   2019/07/13 07:18
Subject:[EXTERNAL] Re: Release Apache Spark 2.4.4 before 3.0.0



Thank you, Jacek.

BTW, I added `@private` since we need PMC's help to make an Apache Spark 
release.

Can I get more feedbacks from the other PMC members?

Please me know if you have any concerns (e.g. Release date or Release 
manager?)

As one of the community members, I assumed the followings (if we are on 
schedule).

- 2.4.4 at the end of July
- 2.3.4 at the end of August (since 2.3.0 was released at the end of 
February 2018)
- 3.0.0 (possibily September?)
- 3.1.0 (January 2020?)

Bests,
Dongjoon.


On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski  wrote:
Hi,

Thanks Dongjoon Hyun for stepping up as a release manager! 
Much appreciated. 

If there's a volunteer to cut a release, I'm always to support it.

In addition, the more frequent releases the better for end users so they 
have a choice to upgrade and have all the latest fixes or wait. It's their 
call not ours (when we'd keep them waiting).

My big 2 yes'es for the release!

Jacek


On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun,  wrote:
Hi, All.

Spark 2.4.3 was released two months ago (8th May).

As of today (9th July), there exist 45 fixes in `branch-2.4` including the 
following correctness or blocker issues.

- SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for 
decimals not fitting in long
- SPARK-26045 Error in the spark 2.4 release package with the 
spark-avro_2.11 dependency
- SPARK-27798 from_avro can modify variables in other rows in local 
mode
- SPARK-27907 HiveUDAF should return NULL in case of 0 rows
- SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries
- SPARK-28308 CalendarInterval sub-second part should be padded before 
parsing

It would be great if we can have Spark 2.4.4 before we are going to get 
busier for 3.0.0.
If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll 
it next Monday. (15th July).
How do you think about this?

Bests,
Dongjoon.




Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-16 Thread Stavros Kontopoulos
Hi Dongjoon,

Should we also consider fixing
https://issues.apache.org/jira/browse/SPARK-27812 before the cut?

Best,
Stavros

On Mon, Jul 15, 2019 at 7:04 PM Dongjoon Hyun 
wrote:

> Hi, Apache Spark PMC members.
>
> Can we cut Apache Spark 2.4.4 next Monday (22nd July)?
>
> Bests,
> Dongjoon.
>
>
> On Fri, Jul 12, 2019 at 3:18 PM Dongjoon Hyun 
> wrote:
>
>> Thank you, Jacek.
>>
>> BTW, I added `@private` since we need PMC's help to make an Apache Spark
>> release.
>>
>> Can I get more feedbacks from the other PMC members?
>>
>> Please me know if you have any concerns (e.g. Release date or Release
>> manager?)
>>
>> As one of the community members, I assumed the followings (if we are on
>> schedule).
>>
>> - 2.4.4 at the end of July
>> - 2.3.4 at the end of August (since 2.3.0 was released at the end of
>> February 2018)
>> - 3.0.0 (possibily September?)
>> - 3.1.0 (January 2020?)
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> Thanks Dongjoon Hyun for stepping up as a release manager!
>>> Much appreciated.
>>>
>>> If there's a volunteer to cut a release, I'm always to support it.
>>>
>>> In addition, the more frequent releases the better for end users so they
>>> have a choice to upgrade and have all the latest fixes or wait. It's their
>>> call not ours (when we'd keep them waiting).
>>>
>>> My big 2 yes'es for the release!
>>>
>>> Jacek
>>>
>>>
>>> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun, 
>>> wrote:
>>>
 Hi, All.

 Spark 2.4.3 was released two months ago (8th May).

 As of today (9th July), there exist 45 fixes in `branch-2.4` including
 the following correctness or blocker issues.

 - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
 decimals not fitting in long
 - SPARK-26045 Error in the spark 2.4 release package with the
 spark-avro_2.11 dependency
 - SPARK-27798 from_avro can modify variables in other rows in local
 mode
 - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
 - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist
 entries
 - SPARK-28308 CalendarInterval sub-second part should be padded
 before parsing

 It would be great if we can have Spark 2.4.4 before we are going to get
 busier for 3.0.0.
 If it's okay, I'd like to volunteer for an 2.4.4 release manager to
 roll it next Monday. (15th July).
 How do you think about this?

 Bests,
 Dongjoon.

>>>


event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2019-07-16 Thread raman gugnani
HI ,

I have long running spark streaming jobs.
Event log directories are getting filled with .inprogress files.
Is there fix or work around for spark streaming.

There is also one jira raised for the same by one reporter.

https://issues.apache.org/jira/browse/SPARK-22783

-- 
Raman Gugnani

8588892293
Principal Engineer
*ixigo.com *