Human time is considerably more expensive than computer time, so in that regard, yes :)
This took me one minute to write and ran fast enough for my needs. If you're willing to provide a comparable scala implementation I'd be happy to compare them. @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR) def generate_mgrs_series(lat_lon_str, level): import mgrs m = mgrs.MGRS() precision_level = 0 levelval = level[0] if levelval == 1000: precision_level = 2 if levelval == 100: precision_level = 3 def convert(ll_str): lat, lon = ll_str.split('_') return m.toMGRS(lat, lon, MGRSPrecision = precision_level) return lat_lon_str.apply(lambda x: convert(x)) On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > And you found the PANDAS UDF more performant ? Can you share your code and > prove it? > > On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com> > wrote: > >> I disagree that it's hype. Perhaps not 1:1 with pure scala >> performance-wise, but for python-based data scientists or others with a lot >> of python expertise it allows one to do things that would otherwise be >> infeasible at scale. >> >> For instance, I recently had to convert latitude / longitude pairs to >> MGRS strings ( >> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing a >> pandas UDF (and putting the mgrs python package into a conda environment) >> was _significantly_ easier than any alternative I found. >> >> @Rishi - depending on your network is constructed, some lag could come >> from just uploading the conda environment. If you load it from hdfs with >> --archives does it improve? >> >> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> >>> hi, >>> >>> Pandas UDF is a bit of hype. One of their blogs shows the used case of >>> adding 1 to a field using Pandas UDF which is pretty much pointless. So you >>> go beyond the blog and realise that your actual used case is more than >>> adding one :) and the reality hits you >>> >>> Pandas UDF in certain scenarios is actually slow, try using apply for a >>> custom or pandas function. In fact in certain scenarios I have found >>> general UDF's work much faster and use much less memory. Therefore test out >>> your used case (with at least 30 million records) before trying to use the >>> Pandas UDF option. >>> >>> And when you start using GroupMap then you realise after reading >>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs >>> that "Oh!! now I can run into random OOM errors and the maxrecords options >>> does not help at all" >>> >>> Excerpt from the above link: >>> Note that all data for a group will be loaded into memory before the >>> function is applied. This can lead to out of memory exceptions, especially >>> if the group sizes are skewed. The configuration for maxRecordsPerBatch >>> <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size> >>> is >>> not applied on groups and it is up to the user to ensure that the grouped >>> data will fit into the available memory. >>> >>> Let me know about your used case in case possible >>> >>> >>> Regards, >>> Gourav >>> >>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com> >>> wrote: >>> >>>> Thanks Patrick! I tried to package it according to this instructions, >>>> it got distributed on the cluster however the same spark program that takes >>>> 5 mins without pandas UDF has started to take 25mins... >>>> >>>> Have you experienced anything like this? Also is Pyarrow 0.12 supported >>>> with Spark 2.3 (according to documentation, it should be fine)? >>>> >>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy < >>>> pmccar...@dstillery.com> wrote: >>>> >>>>> Hi Rishi, >>>>> >>>>> I've had success using the approach outlined here: >>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html >>>>> >>>>> Does this work for you? >>>>> >>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <rishishah.s...@gmail.com> >>>>> wrote: >>>>> >>>>>> modified the subject & would like to clarify that I am looking to >>>>>> create an anaconda parcel with pyarrow and other libraries, so that I can >>>>>> distribute it on the cloudera cluster.. >>>>>> >>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <rishishah.s...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I have been trying to figure out a way to build anaconda parcel with >>>>>>> pyarrow included for my cloudera managed server for distribution but >>>>>>> this >>>>>>> doesn't seem to work right. Could someone please help? >>>>>>> >>>>>>> I have tried to install anaconda on one of the management nodes on >>>>>>> cloudera cluster... tarred the directory, but this directory doesn't >>>>>>> include all the packages to form a proper parcel for distribution. >>>>>>> >>>>>>> Any help is much appreciated! >>>>>>> >>>>>>> -- >>>>>>> Regards, >>>>>>> >>>>>>> Rishi Shah >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> >>>>>> Rishi Shah >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> >>>>> *Patrick McCarthy * >>>>> >>>>> Senior Data Scientist, Machine Learning Engineering >>>>> >>>>> Dstillery >>>>> >>>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> >>>> Rishi Shah >>>> >>> >> >> -- >> >> >> *Patrick McCarthy * >> >> Senior Data Scientist, Machine Learning Engineering >> >> Dstillery >> >> 470 Park Ave South, 17th Floor, NYC 10016 >> > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016