Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
Hi Andrew, Do not misrepresent my statements. I mentioned it depends on the used case, I NEVER (note the word "never") mentioned that Pandas UDF is ALWAYS (note the word "always") slow. Regards, Gourav Sengupta On Mon, May 6, 2019 at 6:00 PM Andrew Melo wrote: > Hi, > > On Mon, May 6, 2019 at

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
Hence, what I mentioned initially does sound correct ? On Mon, May 6, 2019 at 5:43 PM Andrew Melo wrote: > Hi, > > On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy > wrote: > > > > Thanks Gourav. > > > > Incidentally, since the regular UDF is row-wise, we could optimize that > a bit by taking

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi, On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta wrote: > > Hence, what I mentioned initially does sound correct ? I don't agree at all - we've had a significant boost from moving to regular UDFs to pandas UDFs. YMMV, of course. > > On Mon, May 6, 2019 at 5:43 PM Andrew Melo wrote: >> >>

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi, On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy wrote: > > Thanks Gourav. > > Incidentally, since the regular UDF is row-wise, we could optimize that a bit > by taking the convert() closure and simply making that the UDF. > > Since there's that MGRS object that we have to create too, we

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
Thanks Gourav. Incidentally, since the regular UDF is row-wise, we could optimize that a bit by taking the convert() closure and simply making that the UDF. Since there's that MGRS object that we have to create too, we could probably optimize it further by applying the UDF via rdd.mapPartitions,

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
The proof is in the pudding :) On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta wrote: > Hi Patrick, > > super duper, thanks a ton for sharing the code. Can you please confirm > that this runs faster than the regular UDF's? > > Interestingly I am also running same transformations using another

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
Hi Patrick, super duper, thanks a ton for sharing the code. Can you please confirm that this runs faster than the regular UDF's? Interestingly I am also running same transformations using another geo spatial library in Python, where I am passing two fields and getting back an array. Regards,

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
Human time is considerably more expensive than computer time, so in that regard, yes :) This took me one minute to write and ran fast enough for my needs. If you're willing to provide a comparable scala implementation I'd be happy to compare them. @F.pandas_udf(T.StringType(),

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
And you found the PANDAS UDF more performant ? Can you share your code and prove it? On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy wrote: > I disagree that it's hype. Perhaps not 1:1 with pure scala > performance-wise, but for python-based data scientists or others with a lot > of python

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-05 Thread Patrick McCarthy
I disagree that it's hype. Perhaps not 1:1 with pure scala performance-wise, but for python-based data scientists or others with a lot of python expertise it allows one to do things that would otherwise be infeasible at scale. For instance, I recently had to convert latitude / longitude pairs to

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-05 Thread Gourav Sengupta
hi, Pandas UDF is a bit of hype. One of their blogs shows the used case of adding 1 to a field using Pandas UDF which is pretty much pointless. So you go beyond the blog and realise that your actual used case is more than adding one :) and the reality hits you Pandas UDF in certain scenarios is

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-04 Thread Rishi Shah
Thanks Patrick! I tried to package it according to this instructions, it got distributed on the cluster however the same spark program that takes 5 mins without pandas UDF has started to take 25mins... Have you experienced anything like this? Also is Pyarrow 0.12 supported with Spark 2.3

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-04-30 Thread Patrick McCarthy
Hi Rishi, I've had success using the approach outlined here: https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html Does this work for you? On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah wrote: > modified the subject & would like to clarify that I am looking to

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-04-29 Thread Rishi Shah
modified the subject & would like to clarify that I am looking to create an anaconda parcel with pyarrow and other libraries, so that I can distribute it on the cloudera cluster.. On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah wrote: > Hi All, > > I have been trying to figure out a way to build