Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Patrick McCarthy Mon, 06 May 2019 06:09:50 -0700

Human time is considerably more expensive than computer time, so in that
regard, yes :)


This took me one minute to write and ran fast enough for my needs. If
you're willing to provide a comparable scala implementation I'd be happy to
compare them.

@F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)

def generate_mgrs_series(lat_lon_str, level):

    import mgrs

    m = mgrs.MGRS()

    precision_level = 0

    levelval = level[0]

    if levelval == 1000:

       precision_level = 2

    if levelval == 100:

       precision_level = 3

    def convert(ll_str):

          lat, lon = ll_str.split('_')

          return m.toMGRS(lat, lon,

              MGRSPrecision = precision_level)

    return lat_lon_str.apply(lambda x: convert(x))

On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> And you found the PANDAS UDF more performant ? Can you share your code and
> prove it?
>
> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com>
> wrote:
>
>> I disagree that it's hype. Perhaps not 1:1 with pure scala
>> performance-wise, but for python-based data scientists or others with a lot
>> of python expertise it allows one to do things that would otherwise be
>> infeasible at scale.
>>
>> For instance, I recently had to convert latitude / longitude pairs to
>> MGRS strings (
>> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing a
>> pandas UDF (and putting the mgrs python package into a conda environment)
>> was _significantly_ easier than any alternative I found.
>>
>> @Rishi - depending on your network is constructed, some lag could come
>> from just uploading the conda environment. If you load it from hdfs with
>> --archives does it improve?
>>
>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>>> hi,
>>>
>>> Pandas UDF is a bit of hype. One of their blogs shows the used case of
>>> adding 1 to a field using Pandas UDF which is pretty much pointless. So you
>>> go beyond the blog and realise that your actual used case is more than
>>> adding one :) and the reality hits you
>>>
>>> Pandas UDF in certain scenarios is actually slow, try using apply for a
>>> custom or pandas function. In fact in certain scenarios I have found
>>> general UDF's work much faster and use much less memory. Therefore test out
>>> your used case (with at least 30 million records) before trying to use the
>>> Pandas UDF option.
>>>
>>> And when you start using GroupMap then you realise after reading
>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>>> that "Oh!! now I can run into random OOM errors and the maxrecords options
>>> does not help at all"
>>>
>>> Excerpt from the above link:
>>> Note that all data for a group will be loaded into memory before the
>>> function is applied. This can lead to out of memory exceptions, especially
>>> if the group sizes are skewed. The configuration for maxRecordsPerBatch
>>> <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size>
>>>  is
>>> not applied on groups and it is up to the user to ensure that the grouped
>>> data will fit into the available memory.
>>>
>>> Let me know about your used case in case possible
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Patrick! I tried to package it according to this instructions,
>>>> it got distributed on the cluster however the same spark program that takes
>>>> 5 mins without pandas UDF has started to take 25mins...
>>>>
>>>> Have you experienced anything like this? Also is Pyarrow 0.12 supported
>>>> with Spark 2.3 (according to documentation, it should be fine)?
>>>>
>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
>>>> pmccar...@dstillery.com> wrote:
>>>>
>>>>> Hi Rishi,
>>>>>
>>>>> I've had success using the approach outlined here:
>>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>>>>
>>>>> Does this work for you?
>>>>>
>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <rishishah.s...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> modified the subject & would like to clarify that I am looking to
>>>>>> create an anaconda parcel with pyarrow and other libraries, so that I can
>>>>>> distribute it on the cloudera cluster..
>>>>>>
>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <rishishah.s...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I have been trying to figure out a way to build anaconda parcel with
>>>>>>> pyarrow included for my cloudera managed server for distribution but 
>>>>>>> this
>>>>>>> doesn't seem to work right. Could someone please help?
>>>>>>>
>>>>>>> I have tried to install anaconda on one of the management nodes on
>>>>>>> cloudera cluster... tarred the directory, but this directory doesn't
>>>>>>> include all the packages to form a proper parcel for distribution.
>>>>>>>
>>>>>>> Any help is much appreciated!
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>>
>>>>>>> Rishi Shah
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>>
>>>>>> Rishi Shah
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> *Patrick McCarthy  *
>>>>>
>>>>> Senior Data Scientist, Machine Learning Engineering
>>>>>
>>>>> Dstillery
>>>>>
>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Rishi Shah
>>>>
>>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Reply via email to