Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Patrick McCarthy Mon, 06 May 2019 09:41:44 -0700

Thanks Gourav.

Incidentally, since the regular UDF is row-wise, we could optimize that a
bit by taking the convert() closure and simply making that the UDF.


Since there's that MGRS object that we have to create too, we could
probably optimize it further by applying the UDF via rdd.mapPartitions,
which would allow the UDF to instantiate objects once per-partition instead
of per-row and then iterate element-wise through the rows of the partition.

All that said, having done the above on prior projects I find the pandas
abstractions to be very elegant and friendly to the end-user so I haven't
looked back :)

(The common memory model via Arrow is a nice boost too!)

On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> The proof is in the pudding
>
> :)
>
>
>
> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi Patrick,
>>
>> super duper, thanks a ton for sharing the code. Can you please confirm
>> that this runs faster than the regular UDF's?
>>
>> Interestingly I am also running same transformations using another geo
>> spatial library in Python, where I am passing two fields and getting back
>> an array.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <pmccar...@dstillery.com>
>> wrote:
>>
>>> Human time is considerably more expensive than computer time, so in that
>>> regard, yes :)
>>>
>>> This took me one minute to write and ran fast enough for my needs. If
>>> you're willing to provide a comparable scala implementation I'd be happy to
>>> compare them.
>>>
>>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
>>>
>>> def generate_mgrs_series(lat_lon_str, level):
>>>
>>>     import mgrs
>>>
>>>     m = mgrs.MGRS()
>>>
>>>     precision_level = 0
>>>
>>>     levelval = level[0]
>>>
>>>     if levelval == 1000:
>>>
>>>        precision_level = 2
>>>
>>>     if levelval == 100:
>>>
>>>        precision_level = 3
>>>
>>>     def convert(ll_str):
>>>
>>>           lat, lon = ll_str.split('_')
>>>
>>>           return m.toMGRS(lat, lon,
>>>
>>>               MGRSPrecision = precision_level)
>>>
>>>     return lat_lon_str.apply(lambda x: convert(x))
>>>
>>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> And you found the PANDAS UDF more performant ? Can you share your code
>>>> and prove it?
>>>>
>>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <
>>>> pmccar...@dstillery.com> wrote:
>>>>
>>>>> I disagree that it's hype. Perhaps not 1:1 with pure scala
>>>>> performance-wise, but for python-based data scientists or others with a 
>>>>> lot
>>>>> of python expertise it allows one to do things that would otherwise be
>>>>> infeasible at scale.
>>>>>
>>>>> For instance, I recently had to convert latitude / longitude pairs to
>>>>> MGRS strings (
>>>>> https://en.wikipedia.org/wiki/Military_Grid_Reference_System).
>>>>> Writing a pandas UDF (and putting the mgrs python package into a conda
>>>>> environment) was _significantly_ easier than any alternative I found.
>>>>>
>>>>> @Rishi - depending on your network is constructed, some lag could come
>>>>> from just uploading the conda environment. If you load it from hdfs with
>>>>> --archives does it improve?
>>>>>
>>>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <
>>>>> gourav.sengu...@gmail.com> wrote:
>>>>>
>>>>>> hi,
>>>>>>
>>>>>> Pandas UDF is a bit of hype. One of their blogs shows the used case
>>>>>> of adding 1 to a field using Pandas UDF which is pretty much pointless. 
>>>>>> So
>>>>>> you go beyond the blog and realise that your actual used case is more 
>>>>>> than
>>>>>> adding one :) and the reality hits you
>>>>>>
>>>>>> Pandas UDF in certain scenarios is actually slow, try using apply for
>>>>>> a custom or pandas function. In fact in certain scenarios I have found
>>>>>> general UDF's work much faster and use much less memory. Therefore test 
>>>>>> out
>>>>>> your used case (with at least 30 million records) before trying to use 
>>>>>> the
>>>>>> Pandas UDF option.
>>>>>>
>>>>>> And when you start using GroupMap then you realise after reading
>>>>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>>>>>> that "Oh!! now I can run into random OOM errors and the maxrecords 
>>>>>> options
>>>>>> does not help at all"
>>>>>>
>>>>>> Excerpt from the above link:
>>>>>> Note that all data for a group will be loaded into memory before the
>>>>>> function is applied. This can lead to out of memory exceptions, 
>>>>>> especially
>>>>>> if the group sizes are skewed. The configuration for
>>>>>> maxRecordsPerBatch
>>>>>> <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size>
>>>>>>  is
>>>>>> not applied on groups and it is up to the user to ensure that the grouped
>>>>>> data will fit into the available memory.
>>>>>>
>>>>>> Let me know about your used case in case possible
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Gourav
>>>>>>
>>>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Patrick! I tried to package it according to this
>>>>>>> instructions, it got distributed on the cluster however the same spark
>>>>>>> program that takes 5 mins without pandas UDF has started to take 
>>>>>>> 25mins...
>>>>>>>
>>>>>>> Have you experienced anything like this? Also is Pyarrow 0.12
>>>>>>> supported with Spark 2.3 (according to documentation, it should be 
>>>>>>> fine)?
>>>>>>>
>>>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
>>>>>>> pmccar...@dstillery.com> wrote:
>>>>>>>
>>>>>>>> Hi Rishi,
>>>>>>>>
>>>>>>>> I've had success using the approach outlined here:
>>>>>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>>>>>>>
>>>>>>>> Does this work for you?
>>>>>>>>
>>>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <
>>>>>>>> rishishah.s...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> modified the subject & would like to clarify that I am looking to
>>>>>>>>> create an anaconda parcel with pyarrow and other libraries, so that I 
>>>>>>>>> can
>>>>>>>>> distribute it on the cloudera cluster..
>>>>>>>>>
>>>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <
>>>>>>>>> rishishah.s...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> I have been trying to figure out a way to build anaconda parcel
>>>>>>>>>> with pyarrow included for my cloudera managed server for 
>>>>>>>>>> distribution but
>>>>>>>>>> this doesn't seem to work right. Could someone please help?
>>>>>>>>>>
>>>>>>>>>> I have tried to install anaconda on one of the management nodes
>>>>>>>>>> on cloudera cluster... tarred the directory, but this directory 
>>>>>>>>>> doesn't
>>>>>>>>>> include all the packages to form a proper parcel for distribution.
>>>>>>>>>>
>>>>>>>>>> Any help is much appreciated!
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Rishi Shah
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Rishi Shah
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>>
>>>>>>>> *Patrick McCarthy  *
>>>>>>>>
>>>>>>>> Senior Data Scientist, Machine Learning Engineering
>>>>>>>>
>>>>>>>> Dstillery
>>>>>>>>
>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>>
>>>>>>> Rishi Shah
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> *Patrick McCarthy  *
>>>>>
>>>>> Senior Data Scientist, Machine Learning Engineering
>>>>>
>>>>> Dstillery
>>>>>
>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>>
>>>>
>>>
>>> --
>>>
>>>
>>> *Patrick McCarthy  *
>>>
>>> Senior Data Scientist, Machine Learning Engineering
>>>
>>> Dstillery
>>>
>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>
>>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Reply via email to