Re: On adding applyInArrow to groupBy and cogroup

2023-11-03 Thread Abdeali Kothari
Seeing more support for arrow based functions would be great. Gives more control to application developers. And so pandas just becomes 1 of the available options. On Fri, 3 Nov 2023, 21:23 Luca Canali, wrote: > Hi Enrico, > > > > +1 on supporting Arrow on par with Pandas. Besides the frameworks

Re: Making spark plan UI interactive

2023-09-06 Thread Abdeali Kothari
I feel this pain frequently Something more interactive would be great On Wed, 6 Sep 2023 at 4:34 PM, Santosh Pingale wrote: > Hey community > > Spark UI with the plan visualisation is an excellent resource for finding > out crucial information about how your application is doing and what parts

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Abdeali Kothari
I would definitely use it - is it's available :) On Mon, 19 Jun 2023, 21:56 Jacek Laskowski, wrote: > Hi Allison and devs, > > Although I was against this idea at first sight (probably because I'm a > Scala dev), I think it could work as long as there are people who'd be > interested in such an

Re: Spark 3.2 - ReusedExchange not present in join execution plan

2022-01-06 Thread Abdeali Kothari
y because the upgraded > aqe. > > not sure whether this is expected though. > > On Thu, Jan 6, 2022 at 12:11 AM Abdeali Kothari > wrote: > >> Just thought I'd do a quick bump and add the dev mailing list - in case >> there is some insight there >> Feels like th

Re: Spark 3.2 - ReusedExchange not present in join execution plan

2022-01-05 Thread Abdeali Kothari
Just thought I'd do a quick bump and add the dev mailing list - in case there is some insight there Feels like this should be categorized as a bug for spark 3.2.0 On Wed, Dec 29, 2021 at 5:25 PM Abdeali Kothari wrote: > Hi, > I am using pyspark for some projects. And one of the thi

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Abdeali Kothari
;> jupyter from org git repo as it was shared, so i do not know how the venv >> was created or python for venv was created even. >> >> The os is CentOS release 6.9 (Final) >> >> >> >> >> >> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Abdeali Kothari
import base64 >>>> import zlib >>>> >>>> def decompress(data): >>>> >>>> bytecode = base64.b64decode(data) >>>> d = zlib.decompressobj(32 + zlib.MAX_WBITS) >>>> decompressed_data = d.decompress(bytecode

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Abdeali Kothari
Maybe you can try running it in a python shell or jupyter-console/ipython instead of a spark-submit and check how much time it takes too. Compare the env variables to check that no additional env configuration is present in either environment. Also is the python environment for both the exact

Re: Resolving all JIRAs affecting EOL releases

2019-05-15 Thread Abdeali Kothari
Was thinking that getting an estimated statistic of the number of issues that would be closed if this is done would help. Open issues: 3882 (project = SPARK AND status in (Open, "In Progress", Reopened)) Open + Does not affect 3.0+ = 2795 Open + Does not affect 2.4+ = 2373 Open + Does not affect

Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Abdeali Kothari
gt; 2019년 3월 26일 (화) 오후 3:34, Reynold Xin 님이 작성: >> >> We have some early stuff there but not quite ready to talk about it in >> public yet (I hope soon though). Will shoot you a separate email on it. >> >> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari < >&g

Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Abdeali Kothari
out of some users. We are considering building > a shim layer as a separate project on top of Spark (so we can make rapid > releases based on feedback) just to test this out and see how well it could > work in practice. > > On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari > wrote

PySpark syntax vs Pandas syntax

2019-03-26 Thread Abdeali Kothari
Hi, I was doing some spark to pandas (and vice versa) conversion because some of the pandas codes we have don't work on huge data. And some spark codes work very slow on small data. It was nice to see that pyspark had some similar syntax for the common pandas operations that the python community

Accumulator issues in PySpark

2018-09-25 Thread Abdeali Kothari
I was trying to check out accumulators and see if I could use them for anything. I made a demo program and could not figure out how to add them up. I found that I need to do a shuffle between all my python UDFs that I am running for the accumulators to be run. Basically, if I do 5 withColumn()

Accessing the SQL parser

2018-01-11 Thread Abdeali Kothari
I was writing some code to try to auto find a list of tables and databases being used in a SparkSQL query. Mainly I was looking to auto-check the permissions and owners of all the tables a query will be trying to access. I was wondering whether PySpark has some method for me to directly use the