Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Andrew Melo Mon, 06 May 2019 09:44:28 -0700

Hi,

On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy
<pmccar...@dstillery.com.invalid> wrote:
>
> Thanks Gourav.
>
> Incidentally, since the regular UDF is row-wise, we could optimize that a bit 
> by taking the convert() closure and simply making that the UDF.
>
> Since there's that MGRS object that we have to create too, we could probably 
> optimize it further by applying the UDF via rdd.mapPartitions, which would 
> allow the UDF to instantiate objects once per-partition instead of per-row 
> and then iterate element-wise through the rows of the partition.
>
> All that said, having done the above on prior projects I find the pandas 
> abstractions to be very elegant and friendly to the end-user so I haven't 
> looked back :)
>
> (The common memory model via Arrow is a nice boost too!)


And some tentative SPIPs that want to use columnar representations
internally in Spark should also add some good performance in the
future.

Cheers
Andrew

>
> On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta <gourav.sengu...@gmail.com> 
> wrote:
>>
>> The proof is in the pudding
>>
>> :)
>>
>>
>>
>> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta <gourav.sengu...@gmail.com> 
>> wrote:
>>>
>>> Hi Patrick,
>>>
>>> super duper, thanks a ton for sharing the code. Can you please confirm that 
>>> this runs faster than the regular UDF's?
>>>
>>> Interestingly I am also running same transformations using another geo 
>>> spatial library in Python, where I am passing two fields and getting back 
>>> an array.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <pmccar...@dstillery.com> 
>>> wrote:
>>>>
>>>> Human time is considerably more expensive than computer time, so in that 
>>>> regard, yes :)
>>>>
>>>> This took me one minute to write and ran fast enough for my needs. If 
>>>> you're willing to provide a comparable scala implementation I'd be happy 
>>>> to compare them.
>>>>
>>>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
>>>>
>>>> def generate_mgrs_series(lat_lon_str, level):
>>>>
>>>>
>>>>     import mgrs
>>>>
>>>>     m = mgrs.MGRS()
>>>>
>>>>
>>>>     precision_level = 0
>>>>
>>>>     levelval = level[0]
>>>>
>>>>
>>>>     if levelval == 1000:
>>>>
>>>>        precision_level = 2
>>>>
>>>>     if levelval == 100:
>>>>
>>>>        precision_level = 3
>>>>
>>>>
>>>>     def convert(ll_str):
>>>>
>>>>           lat, lon = ll_str.split('_')
>>>>
>>>>
>>>>           return m.toMGRS(lat, lon,
>>>>
>>>>               MGRSPrecision = precision_level)
>>>>
>>>>
>>>>     return lat_lon_str.apply(lambda x: convert(x))
>>>>
>>>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <gourav.sengu...@gmail.com> 
>>>> wrote:
>>>>>
>>>>> And you found the PANDAS UDF more performant ? Can you share your code 
>>>>> and prove it?
>>>>>
>>>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com> 
>>>>> wrote:
>>>>>>
>>>>>> I disagree that it's hype. Perhaps not 1:1 with pure scala 
>>>>>> performance-wise, but for python-based data scientists or others with a 
>>>>>> lot of python expertise it allows one to do things that would otherwise 
>>>>>> be infeasible at scale.
>>>>>>
>>>>>> For instance, I recently had to convert latitude / longitude pairs to 
>>>>>> MGRS strings 
>>>>>> (https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing 
>>>>>> a pandas UDF (and putting the mgrs python package into a conda 
>>>>>> environment) was _significantly_ easier than any alternative I found.
>>>>>>
>>>>>> @Rishi - depending on your network is constructed, some lag could come 
>>>>>> from just uploading the conda environment. If you load it from hdfs with 
>>>>>> --archives does it improve?
>>>>>>
>>>>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta 
>>>>>> <gourav.sengu...@gmail.com> wrote:
>>>>>>>
>>>>>>> hi,
>>>>>>>
>>>>>>> Pandas UDF is a bit of hype. One of their blogs shows the used case of 
>>>>>>> adding 1 to a field using Pandas UDF which is pretty much pointless. So 
>>>>>>> you go beyond the blog and realise that your actual used case is more 
>>>>>>> than adding one :) and the reality hits you
>>>>>>>
>>>>>>> Pandas UDF in certain scenarios is actually slow, try using apply for a 
>>>>>>> custom or pandas function. In fact in certain scenarios I have found 
>>>>>>> general UDF's work much faster and use much less memory. Therefore test 
>>>>>>> out your used case (with at least 30 million records) before trying to 
>>>>>>> use the Pandas UDF option.
>>>>>>>
>>>>>>> And when you start using GroupMap then you realise after reading 
>>>>>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>>>>>>>  that "Oh!! now I can run into random OOM errors and the maxrecords 
>>>>>>> options does not help at all"
>>>>>>>
>>>>>>> Excerpt from the above link:
>>>>>>> Note that all data for a group will be loaded into memory before the 
>>>>>>> function is applied. This can lead to out of memory exceptions, 
>>>>>>> especially if the group sizes are skewed. The configuration for 
>>>>>>> maxRecordsPerBatch is not applied on groups and it is up to the user to 
>>>>>>> ensure that the grouped data will fit into the available memory.
>>>>>>>
>>>>>>> Let me know about your used case in case possible
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gourav
>>>>>>>
>>>>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com> 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks Patrick! I tried to package it according to this instructions, 
>>>>>>>> it got distributed on the cluster however the same spark program that 
>>>>>>>> takes 5 mins without pandas UDF has started to take 25mins...
>>>>>>>>
>>>>>>>> Have you experienced anything like this? Also is Pyarrow 0.12 
>>>>>>>> supported with Spark 2.3 (according to documentation, it should be 
>>>>>>>> fine)?
>>>>>>>>
>>>>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy 
>>>>>>>> <pmccar...@dstillery.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Rishi,
>>>>>>>>>
>>>>>>>>> I've had success using the approach outlined here: 
>>>>>>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>>>>>>>>
>>>>>>>>> Does this work for you?
>>>>>>>>>
>>>>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah 
>>>>>>>>> <rishishah.s...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> modified the subject & would like to clarify that I am looking to 
>>>>>>>>>> create an anaconda parcel with pyarrow and other libraries, so that 
>>>>>>>>>> I can distribute it on the cloudera cluster..
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah 
>>>>>>>>>> <rishishah.s...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> I have been trying to figure out a way to build anaconda parcel 
>>>>>>>>>>> with pyarrow included for my cloudera managed server for 
>>>>>>>>>>> distribution but this doesn't seem to work right. Could someone 
>>>>>>>>>>> please help?
>>>>>>>>>>>
>>>>>>>>>>> I have tried to install anaconda on one of the management nodes on 
>>>>>>>>>>> cloudera cluster... tarred the directory, but this directory 
>>>>>>>>>>> doesn't include all the packages to form a proper parcel for 
>>>>>>>>>>> distribution.
>>>>>>>>>>>
>>>>>>>>>>> Any help is much appreciated!
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Rishi Shah
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Rishi Shah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Patrick McCarthy
>>>>>>>>>
>>>>>>>>> Senior Data Scientist, Machine Learning Engineering
>>>>>>>>>
>>>>>>>>> Dstillery
>>>>>>>>>
>>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Rishi Shah
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Patrick McCarthy
>>>>>>
>>>>>> Senior Data Scientist, Machine Learning Engineering
>>>>>>
>>>>>> Dstillery
>>>>>>
>>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Patrick McCarthy
>>>>
>>>> Senior Data Scientist, Machine Learning Engineering
>>>>
>>>> Dstillery
>>>>
>>>> 470 Park Ave South, 17th Floor, NYC 10016
>
>
>
> --
>
> Patrick McCarthy
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Reply via email to