Re: Revisiting Python / pandas UDF (new proposal)
Hi all, I made a PR - https://github.com/apache/spark/pull/27165 Please have a look when you guys fine some times. I addressed another point (by Maciej), "A couple of less-intuitive pandas UDF types" together because the more I look, the more I felt I should deal with it together with the proposal. 2020년 1월 6일 (월) 오후 10:52, Hyukjin Kwon 님이 작성: > I happened to propose a somewhat big refactoring PR as a preparation for > this. > Basically, grouping all related codes into one sub-package since currently > all pandas and PyArrow related codes are here and there. > I would appreciate if you guys can review and give some feedback. > > https://github.com/apache/spark/pull/27109 > > Thanks! > > > 2020년 1월 4일 (토) 오전 5:11, Li Jin 님이 작성: > >> Hyukjin, >> >> Thanks for putting this together. I took a look at the proposal and left >> some comments. At the high level I like using type hints to specify >> input/output types but not so use about type hints for cordiality. I have >> commented on more details in the doc. >> >> Li >> >> On Thu, Jan 2, 2020 at 9:42 AM Li Jin wrote: >> >>> I am going to review this carefully today. Thanks for the work! >>> >>> Li >>> >>> On Wed, Jan 1, 2020 at 10:34 PM Hyukjin Kwon >>> wrote: >>> Thanks for comments Maciej - I am addressing them. adding Li Jin too. I plan to proceed this late this week or early next week to make it on time before code freeze. I am going to pretty actively respond so please give feedback if there's any :-). 2019년 12월 30일 (월) 오후 6:45, Hyukjin Kwon 님이 작성: > Hi all, > > I happen to come up with another idea about pandas redesign. > Thanks Reynold, Bryan, Xiangrui, Takuya and Tim for offline > discussions and > helping me to write this proposal. > > Please take a look and let me know what you guys think. > > - > https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing > - https://issues.apache.org/jira/browse/SPARK-28264 > > I know it's a holiday season but please have some time to take a look > so > we can make it on time before code freeze (31st Jan). > >
[DISCUSS] Support year-month and day-time Intervals
Hi, Devs I’d like to propose to add two new interval types which are year-month and day-time intervals for better ANSI support and future improvements. We will keep the current CalenderIntervalType but mark it as deprecated until we find the right time to remove it completely. The backward compatibility of the old interval type usages in 2.4 will be guaranteed. Here is the design doc: [SPIP] Support Year-Month and Day-Time Intervals - https://docs.google.com/document/d/1JNRzcBk4hcm7k2cOXSG1A9U9QM2iNGQzBSXZzScUwAU/edit?usp=sharing All comments are welcome! Thanks, Kent Yao -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: unsubscribe
unsubscribe
unsubscribe
unsubscribe
Re: [DISCUSS] Support year-month and day-time Intervals
Hi, Kent. Thank you for the proposal. Does your proposal need to revert something from the master branch? I'm just asking because it's not clear in the proposal document. Bests, Dongjoon. On Fri, Jan 10, 2020 at 5:31 AM Dr. Kent Yao wrote: > Hi, Devs > > I’d like to propose to add two new interval types which are year-month and > day-time intervals for better ANSI support and future improvements. We will > keep the current CalenderIntervalType but mark it as deprecated until we > find the right time to remove it completely. The backward compatibility of > the old interval type usages in 2.4 will be guaranteed. > > Here is the design doc: > > [SPIP] Support Year-Month and Day-Time Intervals - > > https://docs.google.com/document/d/1JNRzcBk4hcm7k2cOXSG1A9U9QM2iNGQzBSXZzScUwAU/edit?usp=sharing > > All comments are welcome! > > Thanks, > > Kent Yao > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: [DISCUSS] Support year-month and day-time Intervals
Hi Dongjoon,Yes, As we want make CalenderIntervalType deprecated and so far, we just find1. The make_interval function that produces legacy CalenderIntervalType values, 2. `interval` -> CalenderIntervalType support in the parserThanks Kent YaoData Science Center, Hangzhou Research Institute, Netease Corp.PHONE: (86) 186-5715-3499EMAIL: hzyao...@corp.netease.com On 01/11/2020 01:57,Dongjoon Hyun wrote: Hi, Kent. Thank you for the proposal.Does your proposal need to revert something from the master branch?I'm just asking because it's not clear in the proposal document.Bests,Dongjoon.On Fri, Jan 10, 2020 at 5:31 AM Dr. Kent Yaowrote:Hi, Devs I’d like to propose to add two new interval types which are year-month and day-time intervals for better ANSI support and future improvements. We will keep the current CalenderIntervalType but mark it as deprecated until we find the right time to remove it completely. The backward compatibility of the old interval type usages in 2.4 will be guaranteed. Here is the design doc: [SPIP] Support Year-Month and Day-Time Intervals - https://docs.google.com/document/d/1JNRzcBk4hcm7k2cOXSG1A9U9QM2iNGQzBSXZzScUwAU/edit?usp=sharing All comments are welcome! Thanks, Kent Yao -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] Support year-month and day-time Intervals
Thank you for clarification. Bests, Dongjoon. On Fri, Jan 10, 2020 at 10:07 AM Kent Yao wrote: > Hi Dongjoon, > > Yes, As we want make CalenderIntervalType deprecated and so far, we just > find > 1. The make_interval function that produces legacy CalenderIntervalType > values, > 2. `interval` -> CalenderIntervalType support in the parser > > Thanks > > *Kent Yao* > Data Science Center, Hangzhou Research Institute, Netease Corp. > PHONE: (86) 186-5715-3499 > EMAIL: hzyao...@corp.netease.com > > On 01/11/2020 01:57,Dongjoon Hyun > wrote: > > Hi, Kent. > > Thank you for the proposal. > > Does your proposal need to revert something from the master branch? > I'm just asking because it's not clear in the proposal document. > > Bests, > Dongjoon. > > On Fri, Jan 10, 2020 at 5:31 AM Dr. Kent Yao wrote: > >> Hi, Devs >> >> I’d like to propose to add two new interval types which are year-month and >> day-time intervals for better ANSI support and future improvements. We >> will >> keep the current CalenderIntervalType but mark it as deprecated until we >> find the right time to remove it completely. The backward compatibility of >> the old interval type usages in 2.4 will be guaranteed. >> >> Here is the design doc: >> >> [SPIP] Support Year-Month and Day-Time Intervals - >> >> https://docs.google.com/document/d/1JNRzcBk4hcm7k2cOXSG1A9U9QM2iNGQzBSXZzScUwAU/edit?usp=sharing >> >> All comments are welcome! >> >> Thanks, >> >> Kent Yao >> >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>
Re: [DISCUSS] Support year-month and day-time Intervals
Introducing a new data type has high overhead, both in terms of internal complexity and users' cognitive load. Introducing two data types would have even higher overhead. I looked quickly and looks like both Redshift and Snowflake, two of the most recent SQL analytics successes, have only one interval type, and don't support storing that. That gets me thinking in reality storing interval type is not that useful. Do we really need to do this? One of the worst things we can do as a community is to introduce features that are almost never used, but at the same time have high internal complexity for maintenance. On Fri, Jan 10, 2020 at 10:45 AM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > Thank you for clarification. > > > Bests, > Dongjoon. > > On Fri, Jan 10, 2020 at 10:07 AM Kent Yao < yaooqinn@ qq. com ( > yaooq...@qq.com ) > wrote: > > >> >> Hi Dongjoon, >> >> >> Yes, As we want make CalenderIntervalType deprecated and so far, we just >> find >> 1. The make_interval function that produces legacy CalenderIntervalType >> values, >> 2. `interval` -> CalenderIntervalType support in the parser >> >> >> Thanks >> >> >> *Kent Yao* >> Data Science Center, Hangzhou Research Institute, Netease Corp. >> PHONE: (86) 186-5715-3499 >> EMAIL: hzyaoqin@ corp. netease. com ( hzyao...@corp.netease.com ) >> >> >> On 01/11/2020 01:57 , Dongjoon Hyun ( >> dongjoon.h...@gmail.com ) wrote: >> >>> Hi, Kent. >>> >>> >>> Thank you for the proposal. >>> >>> >>> Does your proposal need to revert something from the master branch? >>> I'm just asking because it's not clear in the proposal document. >>> >>> >>> Bests, >>> Dongjoon. >>> >>> On Fri, Jan 10, 2020 at 5:31 AM Dr. Kent Yao < yaooqinn@ qq. com ( >>> yaooq...@qq.com ) > wrote: >>> >>> Hi, Devs I’d like to propose to add two new interval types which are year-month and day-time intervals for better ANSI support and future improvements. We will keep the current CalenderIntervalType but mark it as deprecated until we find the right time to remove it completely. The backward compatibility of the old interval type usages in 2.4 will be guaranteed. Here is the design doc: [SPIP] Support Year-Month and Day-Time Intervals - https:/ / docs. google. com/ document/ d/ 1JNRzcBk4hcm7k2cOXSG1A9U9QM2iNGQzBSXZzScUwAU/ edit?usp=sharing ( https://docs.google.com/document/d/1JNRzcBk4hcm7k2cOXSG1A9U9QM2iNGQzBSXZzScUwAU/edit?usp=sharing ) All comments are welcome! Thanks, Kent Yao -- Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/ ( http://apache-spark-developers-list.1001551.n3.nabble.com/ ) - To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( dev-unsubscr...@spark.apache.org ) >>> >>> >>> >> >> > >
Build error: python/lib/pyspark.zip is not a ZIP archive
Greetings, I'm getting an error when building, on latest master (2bd873181 as of this writing). Full build command I'm running is: ./build/mvn -DskipTests clean package [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run (create-tmp-dir) on project spark-assembly_2.12: An Ant BuildException has occured: Problem reading /Users/jeff/dev/spark/python/lib/pyspark.zip [ERROR] around Ant part .. @ 6:76 in /Users/jeff/dev/spark/assembly/target/antrun/build-main.xml: archive is not a ZIP archive [ERROR] -> [Help 1] Trying to run unzip -l python/lib/pyspark.zip does seem to suggest it's not a valid zip file. Any ideas what might be wrong? I tried searching the archives and didn't see anything relevant. Thanks. - OS X Catalina 10.5.2 - OpenJDK 1.8.0_212 - Maven 3.6.3 - Python 3.8.1 (via pyenv)
Re: Build error: python/lib/pyspark.zip is not a ZIP archive
Sounds like you might have some corrupted file locally. I don't see any of the automated test builders failing. Nuke your local assembly build and try again? On Fri, Jan 10, 2020 at 3:49 PM Jeff Evans wrote: > > Greetings, > > I'm getting an error when building, on latest master (2bd873181 as of this > writing). Full build command I'm running is: ./build/mvn -DskipTests clean > package > > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (create-tmp-dir) on > project spark-assembly_2.12: An Ant BuildException has occured: Problem > reading /Users/jeff/dev/spark/python/lib/pyspark.zip > [ERROR] around Ant part ... destfile="/Users/jeff/dev/spark/assembly/../python/lib/pyspark.zip">... @ > 6:76 in /Users/jeff/dev/spark/assembly/target/antrun/build-main.xml: archive > is not a ZIP archive > [ERROR] -> [Help 1] > > Trying to run unzip -l python/lib/pyspark.zip does seem to suggest it's not a > valid zip file. Any ideas what might be wrong? I tried searching the > archives and didn't see anything relevant. Thanks. > > OS X Catalina 10.5.2 > OpenJDK 1.8.0_212 > Maven 3.6.3 > Python 3.8.1 (via pyenv) - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Build error: python/lib/pyspark.zip is not a ZIP archive
Thanks for the tip. Fixed by simply removing python/lib/pyspark.zip (since it's apparently generated), and rebuilding. I guess clean does not remove it. On Fri, Jan 10, 2020 at 3:50 PM Sean Owen wrote: > Sounds like you might have some corrupted file locally. I don't see > any of the automated test builders failing. Nuke your local assembly > build and try again? > > On Fri, Jan 10, 2020 at 3:49 PM Jeff Evans > wrote: > > > > Greetings, > > > > I'm getting an error when building, on latest master (2bd873181 as of > this writing). Full build command I'm running is: ./build/mvn -DskipTests > clean package > > > > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (create-tmp-dir) on > project spark-assembly_2.12: An Ant BuildException has occured: Problem > reading /Users/jeff/dev/spark/python/lib/pyspark.zip > > [ERROR] around Ant part ... destfile="/Users/jeff/dev/spark/assembly/../python/lib/pyspark.zip">... @ > 6:76 in /Users/jeff/dev/spark/assembly/target/antrun/build-main.xml: > archive is not a ZIP archive > > [ERROR] -> [Help 1] > > > > Trying to run unzip -l python/lib/pyspark.zip does seem to suggest it's > not a valid zip file. Any ideas what might be wrong? I tried searching > the archives and didn't see anything relevant. Thanks. > > > > OS X Catalina 10.5.2 > > OpenJDK 1.8.0_212 > > Maven 3.6.3 > > Python 3.8.1 (via pyenv) >
Re: Build error: python/lib/pyspark.zip is not a ZIP archive
Actually, there is a really trivial fix for that (an existing file not being deleted when packaging). Opened SPARK-30489 for it. On Fri, Jan 10, 2020 at 3:52 PM Jeff Evans wrote: > Thanks for the tip. Fixed by simply removing python/lib/pyspark.zip > (since it's apparently generated), and rebuilding. I guess clean does > not remove it. > > On Fri, Jan 10, 2020 at 3:50 PM Sean Owen wrote: > >> Sounds like you might have some corrupted file locally. I don't see >> any of the automated test builders failing. Nuke your local assembly >> build and try again? >> >> On Fri, Jan 10, 2020 at 3:49 PM Jeff Evans >> wrote: >> > >> > Greetings, >> > >> > I'm getting an error when building, on latest master (2bd873181 as of >> this writing). Full build command I'm running is: ./build/mvn -DskipTests >> clean package >> > >> > [ERROR] Failed to execute goal >> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (create-tmp-dir) on >> project spark-assembly_2.12: An Ant BuildException has occured: Problem >> reading /Users/jeff/dev/spark/python/lib/pyspark.zip >> > [ERROR] around Ant part ...> destfile="/Users/jeff/dev/spark/assembly/../python/lib/pyspark.zip">... @ >> 6:76 in /Users/jeff/dev/spark/assembly/target/antrun/build-main.xml: >> archive is not a ZIP archive >> > [ERROR] -> [Help 1] >> > >> > Trying to run unzip -l python/lib/pyspark.zip does seem to suggest it's >> not a valid zip file. Any ideas what might be wrong? I tried searching >> the archives and didn't see anything relevant. Thanks. >> > >> > OS X Catalina 10.5.2 >> > OpenJDK 1.8.0_212 >> > Maven 3.6.3 >> > Python 3.8.1 (via pyenv) >> >