Re: [VOTE] Release Spark 3.4.1 (RC1)

2023-06-22 Thread Driesprong, Fokko
Thank you for running the release Dongjoon +1 Tested against Iceberg and it looks good. Op do 22 jun 2023 om 18:03 schreef yangjie01 : > +1 > > > > *发件人**: *Dongjoon Hyun > *日期**: *2023年6月22日 星期四 23:35 > *收件人**: *Chao Sun > *抄送**: *Yuming Wang , Jacek Laskowski , > dev > *主题**: *Re: [VOTE]

Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Driesprong, Fokko
Well deserved all! Welcome! Op vr 26 mrt. 2021 om 21:21 schreef Matei Zaharia > Hi all, > > The Spark PMC recently voted to add several new committers. Please join me > in welcoming them to their new role! Our new committers are: > > - Maciej Szymkiewicz (contributor to PySpark) > - Max Gekk

Re: I'm going to be out starting Nov 5th

2020-11-01 Thread Driesprong, Fokko
Hope everything goes well, and see you soon Holden! Take care and stay strong! Cheers, Fokko Op zo 1 nov. 2020 om 18:09 schreef rahul kumar : > > Come back strong and healthy Holden! > On Sun, Nov 1, 2020 at 9:01 AM Holden Karau wrote: > >> Thanks everyone, these kind words mean a lot :) I

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Driesprong, Fokko
; at least rudimentary relationships between inputs and outputs (i.e. udfs). > > Just saying... > > > > On 8/27/20 6:09 PM, Driesprong, Fokko wrote: > > Also, it is very cumbersome to add everything to the pyi file. In > practice, this means copying the method definition from the py

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Driesprong, Fokko
tests and/or > documentation works. So I filed an umbrella JIRA for that SPARK-32681 > <https://issues.apache.org/jira/browse/SPARK-32681>. > If there's no objections in this direction, I think hopefully we can > start. Let me know if you guys have thoughts on this. > > Thanks! > > &g

Re: [PySpark] Revisiting PySpark type annotations

2020-08-20 Thread Driesprong, Fokko
No worries, thanks for the update! Op do 20 aug. 2020 om 12:50 schreef Hyukjin Kwon > Yeah, we had a short meeting. I had to check a few other things so some > delays happened. I will share soon. > > 2020년 8월 20일 (목) 오후 7:14, Driesprong, Fokko 님이 작성: > >> Hi Maciej, Hyukji

Re: [PySpark] Revisiting PySpark type annotations

2020-08-20 Thread Driesprong, Fokko
Hi Maciej, Hyukjin, Did you find any time to discuss adding the types to the Python repository? Would love to know what came out of it. Cheers, Fokko Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko : > Mostly echoing stuff that we've discussed in > https://github.com/apache/spar

Allow average out of a Date

2020-08-19 Thread Driesprong, Fokko
Hi all, Personally, I'm a big fan of the .summary() function to compute statistics of a dataframe. I often use this for debugging pipelines, and check what the impact of the RDD is after changing code. I've noticed that not all datatypes are in this summary. Currently, there is a list

Re: [PySpark] Revisiting PySpark type annotations

2020-08-05 Thread Driesprong, Fokko
ix Cheung wrote: >> >> So IMO maintaining outside in a separate repo is going to be harder. That >> was why I asked. >> >> >> >> -- >> *From:* Maciej Szymkiewicz >> >> *Sent:* Tuesday, August 4, 2020 12:59 PM >>

Re: [PySpark] Revisiting PySpark type annotations

2020-08-03 Thread Driesprong, Fokko
ink about porting the files as are. > I will try to have a short sync with the author Maciej, and share what we > discussed offline. > > > 2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz 님이 > 작성: > >> >> >> W dniu środa, 22 lipca 2020 Driesprong, Fokko >> n

Re: Python xmlrunner being used?

2020-07-24 Thread Driesprong, Fokko
I found this ticket: https://issues.apache.org/jira/browse/SPARK-7021 Is anybody actually using this? Cheers, Fokko Op vr 24 jul. 2020 om 16:27 schreef Driesprong, Fokko : > Hi all, > > Does anyone know if the xmlrunner package is still being used? > > We're working on enforc

Python xmlrunner being used?

2020-07-24 Thread Driesprong, Fokko
Hi all, Does anyone know if the xmlrunner package is still being used? We're working on enforcing some static code analysis checks on the Python codebase, and the imports of the xmlrunner generates quite some noise: https://github.com/apache/spark/pull/29121 It looks like the entry point for a

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Driesprong, Fokko
That's probably one-time overhead so it is not a big issue. In my opinion, a bigger one is possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase. This can be addressed, but don't look great. This is not true (anymore). With Python 3.6 you can add

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Driesprong, Fokko
020 at 12:33 PM Driesprong, Fokko > wrote: > >> Since we've recently dropped support for Python <=3.5 >> <https://github.com/apache/spark/pull/28957>, I think it would be nice >> to add support for type annotations. Having this in the main repository >>

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Driesprong, Fokko
Since we've recently dropped support for Python <=3.5 , I think it would be nice to add support for type annotations. Having this in the main repository allows us to do type checking using MyPy in the CI itself.

Re: Use /usr/bin/env python3 in scripts?

2020-07-17 Thread Driesprong, Fokko
+1 I'm in favor of using python3 Cheers, Fokko Op vr 17 jul. 2020 om 19:49 schreef Sean Owen : > Yeah I figured it's a best practice, so I'll raise a PR unless > somebody tells me not to. This is about build scripts, not Pyspark > itself, and half the scripts already specify python3. > > On

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Driesprong, Fokko
Welcome! Op di 14 jul. 2020 om 19:53 schreef shane knapp ☠ : > welcome, all! > > On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia > wrote: > >> Hi all, >> >> The Spark PMC recently voted to add several new committers. Please join >> me in welcoming them to their new roles! The new committers are:

Re: [VOTE] Decommissioning SPIP

2020-07-03 Thread Driesprong, Fokko
+1 (non-binding) Cheers, Fokko Op vr 3 jul. 2020 om 09:16 schreef Xin Jinhan <18183124...@163.com>: > +1 > this really make sense!! > > Regards, > Jinhan > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > >

Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Driesprong, Fokko
Hi Bing, Good question and the answer is; it depends on what your use-case is. If you really just want to write raw bytes, then you could create a .foreach where you open an OutputStream and write it to some file. But this is probably not what you want, and in practice not very handy since you

Re:

2019-12-27 Thread Driesprong, Fokko
Anyone any opinion on this? A link to the PR: https://github.com/apache/spark/pull/26644 Cheers, Fokko Op vr 20 dec. 2019 om 16:00 schreef Driesprong, Fokko : > Folks, > > I've opened a PR a while ago with a PR to merge the possibility to merge > a custom data type, into a nati

[no subject]

2019-12-20 Thread Driesprong, Fokko
Folks, I've opened a PR a while ago with a PR to merge the possibility to merge a custom data type, into a native data type. This is something new because of the introduction of Delta. To have some background, I'm having a DataSet that has fields of the type XMLGregorianCalendarType. I don't

Re: [DISCUSS] PostgreSQL dialect

2019-12-01 Thread Driesprong, Fokko
+1 (non-binding) Cheers, Fokko Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun : > +1 > > Bests, > Dongjoon. > > On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro > wrote: > >> Yea, +1, that looks pretty reasonable to me. >> > Here I'm proposing to hold off the PostgreSQL dialect. Let's

Re: override collect_list

2019-12-01 Thread Driesprong, Fokko
Hi Abhnav, this sounds to me like a bad design, since it isn't scalable. Would it be possible to store all the data in a database like hbase/bigtable/cassandra? This would allow you to write the data from all the workers in parallel to the database/ Cheers, Fokko Op wo 27 nov. 2019 om 06:58

Re: Thoughts on Spark 3 release, or a preview release

2019-09-13 Thread Driesprong, Fokko
Michael Heuer, that's an interesting issue. 1.8.2 to 1.9.0 is almost binary compatible (94%): http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html. Most of the stuff is removing the Jackson and Netty API from Avro's public API and deprecating the Joda library. I

Re: Welcoming some new committers and PMC members

2019-09-10 Thread Driesprong, Fokko
Congrats all, well deserved! Cheers, Fokko Op di 10 sep. 2019 om 10:21 schreef Gabor Somogyi : > Congrats Guys! > > G > > > On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia > wrote: > >> Hi all, >> >> The Spark PMC recently voted to add several new committers and one PMC >> member. Join me in

Re: [DISCUSS] Migrate development scripts under dev/ from Python2 to Python 3

2019-08-15 Thread Driesprong, Fokko
Sorry for the late reply, was a bit busy lately, but I still would like to share my thoughts on this. For Apache Airflow we're dropping support for Python 2 in the next major release. We're now supporting Python 3.5+. Mostly because: - Easier to maintain and test, and less if/else

Re: Jackson version updation

2019-06-28 Thread Driesprong, Fokko
The PR of bumping Jackson to 2.9.6 gives some examples of the behavioral changes that Sean is referring to: https://github.com/apache/spark/pull/21596 Cheers, Fokko Driesprong Op vr 28 jun. 2019 om 14:13 schreef Sean Owen : > https://github.com/apache/spark/blob/branch-2.4/pom.xml#L161 >

Re: Spark 2.4.2

2019-04-19 Thread Driesprong, Fokko
For me a +1 on upgrading Jackson as well. This has been long overdue. There are some behavioural changes regarding handling null/None. This is also described in the PR: https://github.com/apache/spark/pull/21596 Also it has a positive impact on the performance. Cheers, Fokko Op vr 19 apr. 2019

Re: [Events] Events not fired for SaveAsTextFile (?)

2018-10-15 Thread Driesprong, Fokko
Hi Bolke, I would argue that Spark is not the right level of abstraction of doing this. I would create a wrapper around the particular filesystem: http://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html Therefore you can write a wrapper around the LocalFileSystem if data

Re: welcome a new batch of committers

2018-10-04 Thread Driesprong, Fokko
Congratulations all! Op wo 3 okt. 2018 om 23:03 schreef Bryan Cutler : > Congratulations everyone! Very well deserved!! > > On Wed, Oct 3, 2018, 1:59 AM Reynold Xin wrote: > >> Hi all, >> >> The Apache Spark PMC has recently voted to add several new committers to >> the project, for their

Re: Spark data quality bug when reading parquet files from hive metastore

2018-08-24 Thread Driesprong, Fokko
Hi Andrew, This blog gives an idea how to schema is resolved: https://blog.godatadriven.com/multiformat-spark-partition There is some optimisation going on when reading Parquet using Spark. Hope this helps. Cheers, Fokko Op wo 22 aug. 2018 om 23:59 schreef t4 : >

Re: New to dev community | Contribution to Mlib

2017-09-22 Thread Driesprong, Fokko
Hi Venna, Sounds like a very interesting algorithm. I have to agree with Seth, in the end you don't want to add a lot of algorithms to Spark itself, it will blow up the codebase and in the end the tests will run forever. You can also consider publishing it to the Spark Packages website. I've also

Re: Scala 2.11 default build

2016-01-30 Thread Driesprong, Fokko
Nice, good work! I've been using a Docker container to compile against 2.11: https://github.com/fokko/docker-spark Cheers, Fokko 2016-01-30 9:22 GMT+01:00 Reynold Xin : > FYI - I just merged Josh's pull request to switch to Scala 2.11 as the > default build. > >

Re: How Spark utilize low-level architecture features?

2016-01-21 Thread Driesprong, Fokko
Hi Boric, For the Spark Mllib package, which is build on top of Breeze , which uses in turn netlib-java . This netlib-java library can be optimized for each system by compiling the specific architecture: *To get optimal

Optimized toIndexedRowMatrix

2016-01-20 Thread Driesprong, Fokko
Hi guys, I've been working on an optimized implementation of the toIndexedRowMatrix of the BlockMatrix. I already created a ticket