Re: Pandas UDF cogroup.applyInPandas with multiple dataframes

2023-02-07 Thread Li Jin
I am not a Spark committer and haven't been working on Spark for a while. However, I was heavily involved in the original cogroup work and we are using cogroup functionality pretty heavily and I want to give my two cents here. I think this is a nice improvement and I hope someone from the PySpark

Re: How does PySpark send "import" to the worker when executing Python UDFs?

2022-07-19 Thread Li Jin
Aha I see. Thanks Hyukjin! On Tue, Jul 19, 2022 at 9:09 PM Hyukjin Kwon wrote: > This is done by cloudpickle. They pickle global variables referred within > the func together, and register it to the global imported modules. > > On Wed, 20 Jul 2022 at 00:55, Li Jin wrote: > >

How does PySpark send "import" to the worker when executing Python UDFs?

2022-07-19 Thread Li Jin
Hi, I have a question about how does "imports" get send to the python worker. For example, I have def foo(x): return np.abs(x) If I run this code directly, it obviously failed (because np is undefined on the driver process): sc.paralleilize([1, 2, 3]).map(foo).collect() However, if I add

Checkpoint and recomputation

2020-01-03 Thread Li Jin
Hi dear devs, I recently came across checkpoint functionality in Spark and found (a little surprising) that checkpoint causes the DataFrame to be computed twice unless cache is called before checkpoint. My guess is that this is probably hard to fix and/or maybe checkpoint feature is not very

Re: Revisiting Python / pandas UDF (new proposal)

2020-01-02 Thread Li Jin
I am going to review this carefully today. Thanks for the work! Li On Wed, Jan 1, 2020 at 10:34 PM Hyukjin Kwon wrote: > Thanks for comments Maciej - I am addressing them. > adding Li Jin too. > > I plan to proceed this late this week or early next week to make it on > time bef

Debug "Java gateway process exited before sending the driver its port number"

2019-11-27 Thread Li Jin
Dear Spark devs, I am debugging a weird "Java gateway process exited before sending the driver its port number" when creating SparkSession with pyspark. I am running the following simple code with pytest: " from pyspark.sql import SparkSession def test_spark(): spark =

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Li Jin
Thanks for summary! I have a question that is semi-related - What's the process to propose a feature to be included in the final Spark 3.0 release? In particular, I am interested in https://issues.apache.org/jira/browse/SPARK-28006. I am happy to do the work so want to make sure I don't miss

Re: Welcoming some new committers and PMC members

2019-09-17 Thread Li Jin
Congrats to all! On Tue, Sep 17, 2019 at 6:51 PM Bryan Cutler wrote: > Congratulations, all well deserved! > > On Thu, Sep 12, 2019, 3:32 AM Jacek Laskowski wrote: > >> Hi, >> >> What a great news! Congrats to all awarded and the community for voting >> them in! >> >> p.s. I think it should go

Re: Thoughts on dataframe cogroup?

2019-04-17 Thread Li Jin
try > again now? > > thanks, > > Chris > > > > On Mon, Apr 15, 2019 at 9:49 PM Li Jin wrote: > >> Hi Chris, >> >> Thanks! The permission to the google doc is maybe not set up properly. I >> cannot view the doc by default. >> >> Li >

Re: Thoughts on dataframe cogroup?

2019-04-15 Thread Li Jin
to have a document to comment on. >> >> The main thing that stands out right now is that this is only for PySpark >> and states that it will not be added to the Scala API. Why not make this >> available since most of the work would be done? >> >> On Mon, Apr 1

Re: Thoughts on dataframe cogroup?

2019-04-15 Thread Li Jin
I don't >> think it's necessary to include details of the Python worker, we can hash >> that out after the SPIP is approved. >> >> Bryan >> >> On Mon, Apr 8, 2019 at 10:43 AM Li Jin wrote: >> >>> Thanks Chris, look forward to it. >>>

Re: Thoughts on dataframe cogroup?

2019-04-08 Thread Li Jin
htforward. If anyone has any ideas as to how > this might be achieved in an elegant manner I’d be happy to hear them! > > Thanks, > > Chris > > On 26 Feb 2019, at 14:55, Li Jin wrote: > > Thank you both for the reply. Chris and I have very similar use cases for > cogro

Re: [pyspark] dataframe map_partition

2019-03-08 Thread Li Jin
Hi, Pandas UDF supports input as struct type. However, note that it will be turned into python dict because pandas itself does not have native struct type. On Fri, Mar 8, 2019 at 2:55 PM peng yu wrote: > Yeah, that seems most likely i have wanted, does the scalar Pandas UDF > support input is a

Re: Thoughts on dataframe cogroup?

2019-02-26 Thread Li Jin
[row] based or union > all collect list based). > > I might be biased, but find the approach very useful in project to > simplify and speed up transformations, and remove a lot of intermediate > stages (distinct + join => just cogroup). > > Plus spark 2.4 introduced a lot of

Re: Thoughts on dataframe cogroup?

2019-02-21 Thread Li Jin
I am wondering do other people have opinion/use case on cogroup? On Wed, Feb 20, 2019 at 5:03 PM Li Jin wrote: > Alessandro, > > Thanks for the reply. I assume by "equi-join", you mean "equality full > outer join" . > > Two issues I see with equity

Re: Thoughts on dataframe cogroup?

2019-02-20 Thread Li Jin
"groupby" on such key > 3) finally apply a udaf (you can have a look here if you are not familiar > with them > https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), that > will process each group "in isolation". > > HTH, > Alessandro > > On

Thoughts on dataframe cogroup?

2019-02-19 Thread Li Jin
Hi, We have been using Pyspark's groupby().apply() quite a bit and it has been very helpful in integrating Spark with our existing pandas-heavy libraries. Recently, we have found more and more cases where groupby().apply() is not sufficient - In some cases, we want to group two dataframes by the

Re: proposal for expanded & consistent timestamp types

2018-12-11 Thread Li Jin
ython? > > thanks, > Imran > > On Fri, Dec 7, 2018 at 1:33 PM Li Jin wrote: > >> Imran, >> >> Thanks for sharing this. When working on interop between Spark and >> Pandas/Arrow in the past, we also faced some issues due to the different >> definition

Re: proposal for expanded & consistent timestamp types

2018-12-07 Thread Li Jin
Imran, Thanks for sharing this. When working on interop between Spark and Pandas/Arrow in the past, we also faced some issues due to the different definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime

Re: Helper methods for PySpark discussion

2018-10-26 Thread Li Jin
> (2) If the method forces evaluation this matches most obvious way that would implemented then we should add it with a note in the docstring I am not sure about this because force evaluation could be something that has side effect. For example, df.count() can realize a cache and if we implement

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Li Jin
Although I am not specifically involved in DSv2, I think having this kind of meeting is definitely helpful to discuss, move certain effort forward and keep people on the same page. Glad to see this kind of working group happening. On Thu, Oct 25, 2018 at 5:58 PM John Zhuge wrote: > Great idea!

Re: [DISCUSS] PySpark Window UDF

2018-09-05 Thread Li Jin
people think? I'd love to hear community's feedbacks. Links: You can reproduce benchmark with numpy variant by using the branch: https://github.com/icexelloss/spark/tree/window-udf-numpy PR link: https://github.com/apache/spark/pull/22305 On Wed, May 16, 2018 at 3:34 PM Li Jin wrote: > Hi

Re: [DISCUSS] move away from python doctests

2018-08-29 Thread Li Jin
Hi Imran, My understanding is that doctests and unittests are orthogonal - doctests are used to make sure docstring examples are correct and are not meant to replace unittests. Functionalities are covered by unit tests to ensure correctness and doctests are used to test the docstring, not the

Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

2018-08-20 Thread Li Jin
Thanks for looking into this Shane. If we can only have a single python 3 version, I agree 3.6 would be better than 3.5. Otherwise, ideally I think it would be nice to test all supported 3.x versions (latest micros should be fine). On Mon, Aug 20, 2018 at 7:07 PM shane knapp wrote: > initially,

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Li Jin
I agree with Byran. If it's acceptable to have another job to test with Python 3.5 and pyarrow 0.10.0, I am leaning towards upgrading arrow. Arrow 0.10.0 has tons of bug fixes and improves from 0.8.0, including important memory leak fixes such as https://issues.apache.org/jira/browse/ARROW-1973.

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Li Jin
Hi Linar, This seems useful. But perhaps reusing the same function name is better? http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame Currently createDataFrame takes an RDD of any kind of SQL data representation(e.g. row, tuple, int, boolean,

Re: [SPARK-24579] SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-06-29 Thread Li Jin
Hi Xiangrui, Thanks for sending this out. I have left some comments on the google doc: https://docs.google.com/document/d/1dFOFV3LOu6deSNd8Ndp87-wsdxQqA9cnkuj35jUlrmQ/edit#heading=h.84jotgsrp6bj Look forward to your response. Li On Mon, Jun 18, 2018 at 11:33 AM, Xiangrui Meng wrote: > Hi

Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Li Jin
uld do the > code review more carefully. > > Xiao > > 2018-06-14 9:18 GMT-07:00 Li Jin : > >> Are there objection to restore the behavior for PySpark users? I am happy >> to submit a patch. >> >> On Thu, Jun 14, 2018 at 12:15 PM Reynold Xin wrote: >>

Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Li Jin
Are there objection to restore the behavior for PySpark users? I am happy to submit a patch. On Thu, Jun 14, 2018 at 12:15 PM Reynold Xin wrote: > The behavior change is not good... > > On Thu, Jun 14, 2018 at 9:05 AM Li Jin wrote: > >> Ah, looks like it's this change: >&g

Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Li Jin
> would have been true for a while now. > > > On Thu, Jun 14, 2018 at 10:38 AM Li Jin wrote: > >> Hey all, >> >> I just did a clean checkout of github.com/apache/spark but failed to >> start PySpark, this is what I did: >> >> git clone g...@github.com:a

Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Li Jin
I can work around by using: bin/pyspark --conf spark.sql.catalogImplementation=in-memory now, but still wonder what's going on with HiveConf.. On Thu, Jun 14, 2018 at 11:37 AM, Li Jin wrote: > Hey all, > > I just did a clean checkout of github.com/apache/spark but failed to > s

Missing HiveConf when starting PySpark from head

2018-06-14 Thread Li Jin
Hey all, I just did a clean checkout of github.com/apache/spark but failed to start PySpark, this is what I did: git clone g...@github.com:apache/spark.git; cd spark; build/sbt package; bin/pyspark And got this exception: (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.3

Re: Optimizer rule ConvertToLocalRelation causes expressions to be eager-evaluated in Planning phase

2018-06-08 Thread Li Jin
n Fri, Jun 8, 2018 at 3:57 PM, Herman van Hövell tot Westerflier < her...@databricks.com> wrote: > But that is still cheaper than executing that expensive UDF for each row > in your dataset right? > > On Fri, Jun 8, 2018 at 9:51 PM Li Jin wrote: > >> I see. Thanks fo

Re: Optimizer rule ConvertToLocalRelation causes expressions to be eager-evaluated in Planning phase

2018-06-08 Thread Li Jin
run right? So it is > still lazy. > > > On Fri, Jun 8, 2018 at 12:35 PM Li Jin wrote: > >> Hi All, >> >> Sorry for the long email title. I am a bit surprised to find that the >> current optimizer rule "ConvertToLocalRelation" causes expressions

Optimizer rule ConvertToLocalRelation causes expressions to be eager-evaluated in Planning phase

2018-06-08 Thread Li Jin
Hi All, Sorry for the long email title. I am a bit surprised to find that the current optimizer rule "ConvertToLocalRelation" causes expressions to be eager-evaluated in planning phase, this can be demonstrated with the following code: scala> val myUDF = udf((x: String) => { println("UDF

Re: MatrixUDT and VectorUDT in Spark ML

2018-05-31 Thread Li Jin
@aexp.com.invalid> wrote: > >> I agree >> >> >> >> >> >> >> >> Thanks >> >> Himanshu >> >> >> >> *From:* Li Jin [mailto:ice.xell...@gmail.com] >> *Sent:* Friday, March 23, 2018 8:24 PM >>

Re: [VOTE] Spark 2.3.1 (RC2)

2018-05-24 Thread Li Jin
I'd like to bring https://issues.apache.org/jira/browse/SPARK-24373 to people's attention cause it could be a regression from 2.2. I will leave it to more experienced people to decide whether this should be a blocker or not. On Wed, May 23, 2018 at 12:54 PM, Marcelo Vanzin

[DISCUSS] PySpark Window UDF

2018-05-16 Thread Li Jin
Hi All, I have been looking into leverage the Arrow and Pandas UDF work we have done so far for Window UDF in PySpark. I have done some investigation and believe there is a way to do PySpark window UDF efficiently. The basic idea is instead of passing each window to Python separately, we can

Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Li Jin
ROW. >>>> >>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW >>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND >>>>> UNBOUNDED FOLLOWING. >>>> >>>> >>>> It sort of makes sense

Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Li Jin
PM, Reynold Xin <r...@databricks.com> wrote: > Seems like a bug. > > > > On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xell...@gmail.com> wrote: > >> Hi Devs, >> >> I am seeing some behavior with window functions that is a bit unintuitive >> and

Clarify window behavior in Spark SQL

2018-04-03 Thread Li Jin
Hi Devs, I am seeing some behavior with window functions that is a bit unintuitive and would like to get some clarification. When using aggregation function with window, the frame boundary seems to change depending on the order of the window. Example: (1) df = spark.createDataFrame([[0, 1],

MatrixUDT and VectorUDT in Spark ML

2018-03-23 Thread Li Jin
Hi All, I came across these two types MatrixUDT and VectorUDF in Spark ML when doing feature extraction and preprocessing with PySpark. However, when trying to do some basic operations, such as vector multiplication and matrix multiplication, I had to go down to Python UDF. It seems to be it

Re: Time Series Functionality with Spark

2018-03-12 Thread Li Jin
Thanks for the pointer! On Mon, Mar 12, 2018 at 1:40 PM, Sean Owen <sro...@gmail.com> wrote: > (There was also https://github.com/sryza/spark-timeseries -- might be > another point of reference for you.) > > On Mon, Mar 12, 2018 at 10:33 AM Li Jin <ice.xell...@gmail.co

Time Series Functionality with Spark

2018-03-12 Thread Li Jin
Hi All, This is Li Jin. We (me and my fellow colleagues at Two Sigma) have been using Spark for time series analysis for the past two years and it has been a success to scale up our time series analysis. Recently, we start a conversation with Reynold about potential opportunities to collaborate

Re: Welcoming some new committers

2018-03-02 Thread Li Jin
Congrats! On Fri, Mar 2, 2018 at 5:49 PM Holden Karau wrote: > Congratulations and welcome everyone! So excited to see the project grow > our committer base. > > On Mar 2, 2018 2:42 PM, "Reynold Xin" wrote: > >> Congrats and welcome! >> >> >> On Fri,

[SPIP] as-of join in Spark SQL

2018-01-03 Thread Li Jin
Hi community, Following instruction on https://spark.apache.org/improvement-proposals.html, I'd like to propose a SPIP: as-of join in Spark SQL. Here is the Jira: https://issues.apache.org/jira/browse/SPARK-22947 If you are interested, please take a look and let me know what you think. I am

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Li Jin
Sorry, s/ordered distributed/ordered distribution/g On Mon, Dec 4, 2017 at 10:37 AM, Li Jin <ice.xell...@gmail.com> wrote: > Just to give another data point: most of the data we use with Spark are > sorted on disk, having a way to allow data source to pass ordered > distribute

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Li Jin
Just to give another data point: most of the data we use with Spark are sorted on disk, having a way to allow data source to pass ordered distributed to DataFrames is really useful for us. On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков wrote: > Hello, guys. > > Thank you

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

2017-11-14 Thread Li Jin
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3) One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require