Hi, On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy <pmccar...@dstillery.com.invalid> wrote: > > Thanks Gourav. > > Incidentally, since the regular UDF is row-wise, we could optimize that a bit > by taking the convert() closure and simply making that the UDF. > > Since there's that MGRS object that we have to create too, we could probably > optimize it further by applying the UDF via rdd.mapPartitions, which would > allow the UDF to instantiate objects once per-partition instead of per-row > and then iterate element-wise through the rows of the partition. > > All that said, having done the above on prior projects I find the pandas > abstractions to be very elegant and friendly to the end-user so I haven't > looked back :) > > (The common memory model via Arrow is a nice boost too!)
And some tentative SPIPs that want to use columnar representations internally in Spark should also add some good performance in the future. Cheers Andrew > > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: >> >> The proof is in the pudding >> >> :) >> >> >> >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >>> >>> Hi Patrick, >>> >>> super duper, thanks a ton for sharing the code. Can you please confirm that >>> this runs faster than the regular UDF's? >>> >>> Interestingly I am also running same transformations using another geo >>> spatial library in Python, where I am passing two fields and getting back >>> an array. >>> >>> >>> Regards, >>> Gourav Sengupta >>> >>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <pmccar...@dstillery.com> >>> wrote: >>>> >>>> Human time is considerably more expensive than computer time, so in that >>>> regard, yes :) >>>> >>>> This took me one minute to write and ran fast enough for my needs. If >>>> you're willing to provide a comparable scala implementation I'd be happy >>>> to compare them. >>>> >>>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR) >>>> >>>> def generate_mgrs_series(lat_lon_str, level): >>>> >>>> >>>> import mgrs >>>> >>>> m = mgrs.MGRS() >>>> >>>> >>>> precision_level = 0 >>>> >>>> levelval = level[0] >>>> >>>> >>>> if levelval == 1000: >>>> >>>> precision_level = 2 >>>> >>>> if levelval == 100: >>>> >>>> precision_level = 3 >>>> >>>> >>>> def convert(ll_str): >>>> >>>> lat, lon = ll_str.split('_') >>>> >>>> >>>> return m.toMGRS(lat, lon, >>>> >>>> MGRSPrecision = precision_level) >>>> >>>> >>>> return lat_lon_str.apply(lambda x: convert(x)) >>>> >>>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <gourav.sengu...@gmail.com> >>>> wrote: >>>>> >>>>> And you found the PANDAS UDF more performant ? Can you share your code >>>>> and prove it? >>>>> >>>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com> >>>>> wrote: >>>>>> >>>>>> I disagree that it's hype. Perhaps not 1:1 with pure scala >>>>>> performance-wise, but for python-based data scientists or others with a >>>>>> lot of python expertise it allows one to do things that would otherwise >>>>>> be infeasible at scale. >>>>>> >>>>>> For instance, I recently had to convert latitude / longitude pairs to >>>>>> MGRS strings >>>>>> (https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing >>>>>> a pandas UDF (and putting the mgrs python package into a conda >>>>>> environment) was _significantly_ easier than any alternative I found. >>>>>> >>>>>> @Rishi - depending on your network is constructed, some lag could come >>>>>> from just uploading the conda environment. If you load it from hdfs with >>>>>> --archives does it improve? >>>>>> >>>>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta >>>>>> <gourav.sengu...@gmail.com> wrote: >>>>>>> >>>>>>> hi, >>>>>>> >>>>>>> Pandas UDF is a bit of hype. One of their blogs shows the used case of >>>>>>> adding 1 to a field using Pandas UDF which is pretty much pointless. So >>>>>>> you go beyond the blog and realise that your actual used case is more >>>>>>> than adding one :) and the reality hits you >>>>>>> >>>>>>> Pandas UDF in certain scenarios is actually slow, try using apply for a >>>>>>> custom or pandas function. In fact in certain scenarios I have found >>>>>>> general UDF's work much faster and use much less memory. Therefore test >>>>>>> out your used case (with at least 30 million records) before trying to >>>>>>> use the Pandas UDF option. >>>>>>> >>>>>>> And when you start using GroupMap then you realise after reading >>>>>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs >>>>>>> that "Oh!! now I can run into random OOM errors and the maxrecords >>>>>>> options does not help at all" >>>>>>> >>>>>>> Excerpt from the above link: >>>>>>> Note that all data for a group will be loaded into memory before the >>>>>>> function is applied. This can lead to out of memory exceptions, >>>>>>> especially if the group sizes are skewed. The configuration for >>>>>>> maxRecordsPerBatch is not applied on groups and it is up to the user to >>>>>>> ensure that the grouped data will fit into the available memory. >>>>>>> >>>>>>> Let me know about your used case in case possible >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Gourav >>>>>>> >>>>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks Patrick! I tried to package it according to this instructions, >>>>>>>> it got distributed on the cluster however the same spark program that >>>>>>>> takes 5 mins without pandas UDF has started to take 25mins... >>>>>>>> >>>>>>>> Have you experienced anything like this? Also is Pyarrow 0.12 >>>>>>>> supported with Spark 2.3 (according to documentation, it should be >>>>>>>> fine)? >>>>>>>> >>>>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy >>>>>>>> <pmccar...@dstillery.com> wrote: >>>>>>>>> >>>>>>>>> Hi Rishi, >>>>>>>>> >>>>>>>>> I've had success using the approach outlined here: >>>>>>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html >>>>>>>>> >>>>>>>>> Does this work for you? >>>>>>>>> >>>>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah >>>>>>>>> <rishishah.s...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> modified the subject & would like to clarify that I am looking to >>>>>>>>>> create an anaconda parcel with pyarrow and other libraries, so that >>>>>>>>>> I can distribute it on the cloudera cluster.. >>>>>>>>>> >>>>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah >>>>>>>>>> <rishishah.s...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi All, >>>>>>>>>>> >>>>>>>>>>> I have been trying to figure out a way to build anaconda parcel >>>>>>>>>>> with pyarrow included for my cloudera managed server for >>>>>>>>>>> distribution but this doesn't seem to work right. Could someone >>>>>>>>>>> please help? >>>>>>>>>>> >>>>>>>>>>> I have tried to install anaconda on one of the management nodes on >>>>>>>>>>> cloudera cluster... tarred the directory, but this directory >>>>>>>>>>> doesn't include all the packages to form a proper parcel for >>>>>>>>>>> distribution. >>>>>>>>>>> >>>>>>>>>>> Any help is much appreciated! >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Regards, >>>>>>>>>>> >>>>>>>>>>> Rishi Shah >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Rishi Shah >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Patrick McCarthy >>>>>>>>> >>>>>>>>> Senior Data Scientist, Machine Learning Engineering >>>>>>>>> >>>>>>>>> Dstillery >>>>>>>>> >>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards, >>>>>>>> >>>>>>>> Rishi Shah >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Patrick McCarthy >>>>>> >>>>>> Senior Data Scientist, Machine Learning Engineering >>>>>> >>>>>> Dstillery >>>>>> >>>>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>> >>>> >>>> >>>> -- >>>> >>>> Patrick McCarthy >>>> >>>> Senior Data Scientist, Machine Learning Engineering >>>> >>>> Dstillery >>>> >>>> 470 Park Ave South, 17th Floor, NYC 10016 > > > > -- > > Patrick McCarthy > > Senior Data Scientist, Machine Learning Engineering > > Dstillery > > 470 Park Ave South, 17th Floor, NYC 10016 --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org