Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Gourav Sengupta Mon, 06 May 2019 08:14:08 -0700

The proof is in the pudding

:)




On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi Patrick,
>
> super duper, thanks a ton for sharing the code. Can you please confirm
> that this runs faster than the regular UDF's?
>
> Interestingly I am also running same transformations using another geo
> spatial library in Python, where I am passing two fields and getting back
> an array.
>
>
> Regards,
> Gourav Sengupta
>
> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <pmccar...@dstillery.com>
> wrote:
>
>> Human time is considerably more expensive than computer time, so in that
>> regard, yes :)
>>
>> This took me one minute to write and ran fast enough for my needs. If
>> you're willing to provide a comparable scala implementation I'd be happy to
>> compare them.
>>
>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
>>
>> def generate_mgrs_series(lat_lon_str, level):
>>
>>     import mgrs
>>
>>     m = mgrs.MGRS()
>>
>>     precision_level = 0
>>
>>     levelval = level[0]
>>
>>     if levelval == 1000:
>>
>>        precision_level = 2
>>
>>     if levelval == 100:
>>
>>        precision_level = 3
>>
>>     def convert(ll_str):
>>
>>           lat, lon = ll_str.split('_')
>>
>>           return m.toMGRS(lat, lon,
>>
>>               MGRSPrecision = precision_level)
>>
>>     return lat_lon_str.apply(lambda x: convert(x))
>>
>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>>> And you found the PANDAS UDF more performant ? Can you share your code
>>> and prove it?
>>>
>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com>
>>> wrote:
>>>
>>>> I disagree that it's hype. Perhaps not 1:1 with pure scala
>>>> performance-wise, but for python-based data scientists or others with a lot
>>>> of python expertise it allows one to do things that would otherwise be
>>>> infeasible at scale.
>>>>
>>>> For instance, I recently had to convert latitude / longitude pairs to
>>>> MGRS strings (
>>>> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing
>>>> a pandas UDF (and putting the mgrs python package into a conda environment)
>>>> was _significantly_ easier than any alternative I found.
>>>>
>>>> @Rishi - depending on your network is constructed, some lag could come
>>>> from just uploading the conda environment. If you load it from hdfs with
>>>> --archives does it improve?
>>>>
>>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <
>>>> gourav.sengu...@gmail.com> wrote:
>>>>
>>>>> hi,
>>>>>
>>>>> Pandas UDF is a bit of hype. One of their blogs shows the used case of
>>>>> adding 1 to a field using Pandas UDF which is pretty much pointless. So 
>>>>> you
>>>>> go beyond the blog and realise that your actual used case is more than
>>>>> adding one :) and the reality hits you
>>>>>
>>>>> Pandas UDF in certain scenarios is actually slow, try using apply for
>>>>> a custom or pandas function. In fact in certain scenarios I have found
>>>>> general UDF's work much faster and use much less memory. Therefore test 
>>>>> out
>>>>> your used case (with at least 30 million records) before trying to use the
>>>>> Pandas UDF option.
>>>>>
>>>>> And when you start using GroupMap then you realise after reading
>>>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>>>>> that "Oh!! now I can run into random OOM errors and the maxrecords options
>>>>> does not help at all"
>>>>>
>>>>> Excerpt from the above link:
>>>>> Note that all data for a group will be loaded into memory before the
>>>>> function is applied. This can lead to out of memory exceptions, especially
>>>>> if the group sizes are skewed. The configuration for
>>>>> maxRecordsPerBatch
>>>>> <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size>
>>>>>  is
>>>>> not applied on groups and it is up to the user to ensure that the grouped
>>>>> data will fit into the available memory.
>>>>>
>>>>> Let me know about your used case in case possible
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav
>>>>>
>>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Patrick! I tried to package it according to this instructions,
>>>>>> it got distributed on the cluster however the same spark program that 
>>>>>> takes
>>>>>> 5 mins without pandas UDF has started to take 25mins...
>>>>>>
>>>>>> Have you experienced anything like this? Also is Pyarrow 0.12
>>>>>> supported with Spark 2.3 (according to documentation, it should be fine)?
>>>>>>
>>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
>>>>>> pmccar...@dstillery.com> wrote:
>>>>>>
>>>>>>> Hi Rishi,
>>>>>>>
>>>>>>> I've had success using the approach outlined here:
>>>>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>>>>>>
>>>>>>> Does this work for you?
>>>>>>>
>>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <
>>>>>>> rishishah.s...@gmail.com> wrote:
>>>>>>>
>>>>>>>> modified the subject & would like to clarify that I am looking to
>>>>>>>> create an anaconda parcel with pyarrow and other libraries, so that I 
>>>>>>>> can
>>>>>>>> distribute it on the cloudera cluster..
>>>>>>>>
>>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <
>>>>>>>> rishishah.s...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I have been trying to figure out a way to build anaconda parcel
>>>>>>>>> with pyarrow included for my cloudera managed server for distribution 
>>>>>>>>> but
>>>>>>>>> this doesn't seem to work right. Could someone please help?
>>>>>>>>>
>>>>>>>>> I have tried to install anaconda on one of the management nodes on
>>>>>>>>> cloudera cluster... tarred the directory, but this directory doesn't
>>>>>>>>> include all the packages to form a proper parcel for distribution.
>>>>>>>>>
>>>>>>>>> Any help is much appreciated!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Rishi Shah
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Rishi Shah
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>> *Patrick McCarthy  *
>>>>>>>
>>>>>>> Senior Data Scientist, Machine Learning Engineering
>>>>>>>
>>>>>>> Dstillery
>>>>>>>
>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>>
>>>>>> Rishi Shah
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Patrick McCarthy  *
>>>>
>>>> Senior Data Scientist, Machine Learning Engineering
>>>>
>>>> Dstillery
>>>>
>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>
>>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>

TheProof.ipynb
Description: Binary data

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Reply via email to