The proof is in the pudding :)
On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi Patrick, > > super duper, thanks a ton for sharing the code. Can you please confirm > that this runs faster than the regular UDF's? > > Interestingly I am also running same transformations using another geo > spatial library in Python, where I am passing two fields and getting back > an array. > > > Regards, > Gourav Sengupta > > On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <pmccar...@dstillery.com> > wrote: > >> Human time is considerably more expensive than computer time, so in that >> regard, yes :) >> >> This took me one minute to write and ran fast enough for my needs. If >> you're willing to provide a comparable scala implementation I'd be happy to >> compare them. >> >> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR) >> >> def generate_mgrs_series(lat_lon_str, level): >> >> import mgrs >> >> m = mgrs.MGRS() >> >> precision_level = 0 >> >> levelval = level[0] >> >> if levelval == 1000: >> >> precision_level = 2 >> >> if levelval == 100: >> >> precision_level = 3 >> >> def convert(ll_str): >> >> lat, lon = ll_str.split('_') >> >> return m.toMGRS(lat, lon, >> >> MGRSPrecision = precision_level) >> >> return lat_lon_str.apply(lambda x: convert(x)) >> >> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> >>> And you found the PANDAS UDF more performant ? Can you share your code >>> and prove it? >>> >>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com> >>> wrote: >>> >>>> I disagree that it's hype. Perhaps not 1:1 with pure scala >>>> performance-wise, but for python-based data scientists or others with a lot >>>> of python expertise it allows one to do things that would otherwise be >>>> infeasible at scale. >>>> >>>> For instance, I recently had to convert latitude / longitude pairs to >>>> MGRS strings ( >>>> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing >>>> a pandas UDF (and putting the mgrs python package into a conda environment) >>>> was _significantly_ easier than any alternative I found. >>>> >>>> @Rishi - depending on your network is constructed, some lag could come >>>> from just uploading the conda environment. If you load it from hdfs with >>>> --archives does it improve? >>>> >>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta < >>>> gourav.sengu...@gmail.com> wrote: >>>> >>>>> hi, >>>>> >>>>> Pandas UDF is a bit of hype. One of their blogs shows the used case of >>>>> adding 1 to a field using Pandas UDF which is pretty much pointless. So >>>>> you >>>>> go beyond the blog and realise that your actual used case is more than >>>>> adding one :) and the reality hits you >>>>> >>>>> Pandas UDF in certain scenarios is actually slow, try using apply for >>>>> a custom or pandas function. In fact in certain scenarios I have found >>>>> general UDF's work much faster and use much less memory. Therefore test >>>>> out >>>>> your used case (with at least 30 million records) before trying to use the >>>>> Pandas UDF option. >>>>> >>>>> And when you start using GroupMap then you realise after reading >>>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs >>>>> that "Oh!! now I can run into random OOM errors and the maxrecords options >>>>> does not help at all" >>>>> >>>>> Excerpt from the above link: >>>>> Note that all data for a group will be loaded into memory before the >>>>> function is applied. This can lead to out of memory exceptions, especially >>>>> if the group sizes are skewed. The configuration for >>>>> maxRecordsPerBatch >>>>> <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size> >>>>> is >>>>> not applied on groups and it is up to the user to ensure that the grouped >>>>> data will fit into the available memory. >>>>> >>>>> Let me know about your used case in case possible >>>>> >>>>> >>>>> Regards, >>>>> Gourav >>>>> >>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks Patrick! I tried to package it according to this instructions, >>>>>> it got distributed on the cluster however the same spark program that >>>>>> takes >>>>>> 5 mins without pandas UDF has started to take 25mins... >>>>>> >>>>>> Have you experienced anything like this? Also is Pyarrow 0.12 >>>>>> supported with Spark 2.3 (according to documentation, it should be fine)? >>>>>> >>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy < >>>>>> pmccar...@dstillery.com> wrote: >>>>>> >>>>>>> Hi Rishi, >>>>>>> >>>>>>> I've had success using the approach outlined here: >>>>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html >>>>>>> >>>>>>> Does this work for you? >>>>>>> >>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah < >>>>>>> rishishah.s...@gmail.com> wrote: >>>>>>> >>>>>>>> modified the subject & would like to clarify that I am looking to >>>>>>>> create an anaconda parcel with pyarrow and other libraries, so that I >>>>>>>> can >>>>>>>> distribute it on the cloudera cluster.. >>>>>>>> >>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah < >>>>>>>> rishishah.s...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> I have been trying to figure out a way to build anaconda parcel >>>>>>>>> with pyarrow included for my cloudera managed server for distribution >>>>>>>>> but >>>>>>>>> this doesn't seem to work right. Could someone please help? >>>>>>>>> >>>>>>>>> I have tried to install anaconda on one of the management nodes on >>>>>>>>> cloudera cluster... tarred the directory, but this directory doesn't >>>>>>>>> include all the packages to form a proper parcel for distribution. >>>>>>>>> >>>>>>>>> Any help is much appreciated! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Rishi Shah >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards, >>>>>>>> >>>>>>>> Rishi Shah >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> >>>>>>> *Patrick McCarthy * >>>>>>> >>>>>>> Senior Data Scientist, Machine Learning Engineering >>>>>>> >>>>>>> Dstillery >>>>>>> >>>>>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> >>>>>> Rishi Shah >>>>>> >>>>> >>>> >>>> -- >>>> >>>> >>>> *Patrick McCarthy * >>>> >>>> Senior Data Scientist, Machine Learning Engineering >>>> >>>> Dstillery >>>> >>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>> >>> >> >> -- >> >> >> *Patrick McCarthy * >> >> Senior Data Scientist, Machine Learning Engineering >> >> Dstillery >> >> 470 Park Ave South, 17th Floor, NYC 10016 >> >
TheProof.ipynb
Description: Binary data
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org