Re: Data Engineering Track at ApacheCon (October 3-6, New Orleans) - CFP ends 23/05

2022-05-17 Thread Ismaël Mejía
Hello Pasha,

This is not only for Apache project maintainers, if you contribute or
maintain other tool that integrates with an existing Apache project to
do or improve common Data Engineering tasks it can definitely fit.

Regards,
Ismaël

On Tue, May 17, 2022 at 11:23 PM Pasha Finkelshtein
 wrote:
>
> Hi Ismaël,
>
> Thank you, it's interesting. Is this message relevant only to 
> maintainers/contributors of top-level Apache projects or works for other 
> maintainers of Apache-licensed software too?
>
> Regards,
> Pasha
>
> ср, 18 мая 2022 г., 00:05 Ismaël Mejía :
>>
>> Hello,
>>
>> ApacheCon North America is back in person this year in October.
>> https://apachecon.com/acna2022/
>>
>> Together with Jarek Potiuk, we are organizing for the first time a Data
>> Engineering Track as part of ApacheCon.
>>
>> You might be wondering why a different track if we already have the Big Data
>> track. Simple, this new track covers the ‘other’ open-source projects we use 
>> to
>> clean data, orchestrate workloads, do observability, visualization, 
>> governance,
>> data lineage and many other tasks that are part of data engineering and that 
>> are
>> usually not covered by the data processing / database tracks.
>>
>> If you are curious you can find more details here:
>> https://s.apache.org/apacheconna-2022-dataeng-track
>>
>> So why are you getting this message? Well it could be that (1) you are
>> already a contributor to a project in the data engineering space and you
>> might be interested in sending your proposal, or (2) you are interested in
>> integrations of these tools with your existing data tools/projects.
>>
>> If you are interested you can submit a proposal using the CfP link below.
>> Don’t forget to choose the Data Engineering Track.
>> https://apachecon.com/acna2022/cfp.html
>>
>> The Call for Presentations (CfP) closes in less than one week on May 23th,
>> 2022.
>>
>> We are looking forward to receiving your submissions and hopefully seeing 
>> you in
>> New Orleans in October.
>>
>> Thanks,
>> Ismaël and Jarek
>>
>> ps. Excuses if you already received this email by a different channel/ML
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Data Engineering Track at ApacheCon (October 3-6, New Orleans) - CFP ends 23/05

2022-05-17 Thread Ismaël Mejía
Hello,

ApacheCon North America is back in person this year in October.
https://apachecon.com/acna2022/

Together with Jarek Potiuk, we are organizing for the first time a Data
Engineering Track as part of ApacheCon.

You might be wondering why a different track if we already have the Big Data
track. Simple, this new track covers the ‘other’ open-source projects we use to
clean data, orchestrate workloads, do observability, visualization, governance,
data lineage and many other tasks that are part of data engineering and that are
usually not covered by the data processing / database tracks.

If you are curious you can find more details here:
https://s.apache.org/apacheconna-2022-dataeng-track

So why are you getting this message? Well it could be that (1) you are
already a contributor to a project in the data engineering space and you
might be interested in sending your proposal, or (2) you are interested in
integrations of these tools with your existing data tools/projects.

If you are interested you can submit a proposal using the CfP link below.
Don’t forget to choose the Data Engineering Track.
https://apachecon.com/acna2022/cfp.html

The Call for Presentations (CfP) closes in less than one week on May 23th,
2022.

We are looking forward to receiving your submissions and hopefully seeing you in
New Orleans in October.

Thanks,
Ismaël and Jarek

ps. Excuses if you already received this email by a different channel/ML

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Apache Spark 3.1.3 released + Docker images

2022-02-25 Thread Ismaël Mejía
The ready to use docker images are great news. I have been waiting for
this for so long! Extra kudos for including ARM64 versions too!

I am curious, what are the non-ASF artifacts included in them (or you
refer to the OS specific elements with other licenses?), and what
consequences might be for end users because of that.

Thanks and kudos to everyone who helped to make this happen!
Ismaël

ps. Any plans to make this images official docker images at some point
(for the extra security/validation) [1]
[1] https://docs.docker.com/docker-hub/official_images/

On Mon, Feb 21, 2022 at 10:09 PM Holden Karau  wrote:
>
> We are happy to announce the availability of Spark 3.1.3!
>
> Spark 3.1.3 is a maintenance release containing stability fixes. This
> release is based on the branch-3.1 maintenance branch of Spark. We strongly
> recommend all 3.1 users to upgrade to this stable release.
>
> To download Spark 3.1.3, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-1-3.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> New Dockerhub magic in this release:
>
> We've also started publishing docker containers to the Apache Dockerhub,
> these contain non-ASF artifacts that are subject to different license terms 
> than the
> Spark release. The docker containers are built for Linux x86 and ARM64 since 
> that's
> what I have access to (thanks to NV for the ARM64 machines).
>
> You can get them from https://hub.docker.com/apache/spark (and spark-r and 
> spark-py) :)
> (And version 3.2.1 is also now published on Dockerhub).
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-14 Thread Ismaël Mejía
+1 (non-binding)

Tested on downstream project without further issues. Ship it time!


On Tue, May 11, 2021 at 9:37 PM Liang-Chi Hsieh  wrote:

> The staging repository for this release can be accessed now too:
> https://repository.apache.org/content/repositories/orgapachespark-1383/
>
> Thanks for the guidance.
>
>
> Liang-Chi Hsieh wrote
> > Seems it is closed now after clicking close button in the UI.
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-29 Thread Ismaël Mejía
+1 (non-binding)

On Mon, Mar 29, 2021 at 7:54 AM Wenchen Fan  wrote:
>
> +1
>
> On Mon, Mar 29, 2021 at 1:45 PM Holden Karau  wrote:
>>
>> +1
>>
>> On Sun, Mar 28, 2021 at 10:25 PM sarutak  wrote:
>>>
>>> +1 (non-binding)
>>>
>>> - Kousuke
>>>
>>> > +1 (non-binding)
>>> >
>>> > On Sun, Mar 28, 2021 at 9:06 PM 郑瑞峰 
>>> > wrote:
>>> >
>>> >> +1 (non-binding)
>>> >>
>>> >> -- 原始邮件 --
>>> >>
>>> >> 发件人: "Maxim Gekk" ;
>>> >> 发送时间: 2021年3月29日(星期一) 凌晨2:08
>>> >> 收件人: "Matei Zaharia";
>>> >> 抄送: "Gengliang Wang";"Mridul
>>> >> Muralidharan";"Xiao
>>> >> Li";"Spark dev
>>> >> list";"Takeshi
>>> >> Yamamuro";
>>> >> 主题: Re: [VOTE] SPIP: Support pandas API layer on PySpark
>>> >>
>>> >> +1 (non-binding)
>>> >>
>>> >> On Sun, Mar 28, 2021 at 8:53 PM Matei Zaharia
>>> >>  wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> Matei
>>> >>
>>> >> On Mar 28, 2021, at 1:45 AM, Gengliang Wang 
>>> >> wrote:
>>> >>
>>> >> +1 (non-binding)
>>> >>
>>> >> On Sun, Mar 28, 2021 at 11:12 AM Mridul Muralidharan
>>> >>  wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> Regards,
>>> >> Mridul
>>> >>
>>> >> On Sat, Mar 27, 2021 at 6:09 PM Xiao Li 
>>> >> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> Xiao
>>> >>
>>> >> Takeshi Yamamuro  于2021年3月26日周五
>>> >> 下午4:14写道:
>>> >>
>>> >> +1 (non-binding)
>>> >>
>>> >> On Sat, Mar 27, 2021 at 4:53 AM Liang-Chi Hsieh 
>>> >> wrote:
>>> >> +1 (non-binding)
>>> >>
>>> >> rxin wrote
>>> >>> +1. Would open up a huge persona for Spark.
>>> >>>
>>> >>> On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler <
>>> >>
>>> >>> cutlerb@
>>> >>
>>>  wrote:
>>> >>>
>>> 
>>>  +1 (non-binding)
>>> 
>>> 
>>>  On Fri, Mar 26, 2021 at 9:49 AM Maciej <
>>> >>
>>> >>> mszymkiewicz@
>>> >>
>>>  wrote:
>>> 
>>> 
>>> > +1 (nonbinding)
>>> >>
>>> >> --
>>> >> Sent from:
>>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>> >>
>>> >>
>>> > -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>> >> --
>>> >>
>>> >> ---
>>> >> Takeshi Yamamuro
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Ismaël Mejía
+1

Bringing a Pandas API for pyspark to upstream Spark will only bring
benefits for everyone (more eyes to use/see/fix/improve the API) as
well as better alignment with core Spark improvements, the extra
weight looks manageable.

On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
 wrote:
>
> On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin  wrote:
>>
>> I don't think we should deprecate existing APIs.
>
>
> +1
>
> I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could 
> be wrong, but I wager most people who have worked with both Spark and Pandas 
> feel the same way.
>
> For the large community of current PySpark users, or users switching to 
> PySpark from another Spark language API, it doesn't make sense to deprecate 
> the current API, even by convention.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark Docker image repository

2021-03-03 Thread Ismaël Mejía
Since Spark 3.1.1 is out now I was wondering if it would make sense to
try to get some consensus about starting to release docker images as
part of Spark 3.2.
Having ready to use images would definitely benefit adoption in
particular now that we support containerized runs via k8s became GA.

WDYT? Are there still some issues/blockers or reasons to not move forward?

On Tue, Feb 18, 2020 at 2:29 PM Ismaël Mejía  wrote:
>
> +1 to have Spark docker images for Dongjoon's arguments, having a container
> based distribution is definitely something in the benefit of users and the
> project too. Having this in the Apache Spark repo matters because of multiple
> eyes to fix/ímprove the images for the benefit of everyone.
>
> What still needs to be tested is the best distribution approach. I have been
> involved in both Flink and Beam's docker images processes (and passed the 
> whole
> 'docker official image' validation and some of the learnt lessons is that the
> less you put in an image the best it is for everyone. So I wonder if the whole
> include everything in the world (Python, R, etc) would scale or if those 
> should
> be overlays on top of a more core minimal image,  but well those are details 
> to
> fix once consensus on this is agreed.
>
> On the Apache INFRA side there is some stuff to deal with at the beginning, 
> but
> things become smoother once they are in place.  In any case fantastic idea and
> if I can help around I would be glad to.
>
> Regards,
> Ismaël
>
> On Tue, Feb 11, 2020 at 10:56 PM Dongjoon Hyun  
> wrote:
>>
>> Hi, Sean.
>>
>> Yes. We should keep this minimal.
>>
>> BTW, for the following questions,
>>
>> > But how much value does that add?
>>
>> How much value do you think we have at our binary distribution in the 
>> following link?
>>
>> - 
>> https://www.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>>
>> Docker image can have a similar value with the above for the users who are 
>> using Dockerized environment.
>>
>> If you are assuming the users who build from the source code or lives on 
>> vendor distributions, both the above existing binary distribution link and 
>> Docker image have no value.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Feb 11, 2020 at 8:51 AM Sean Owen  wrote:
>>>
>>> To be clear this is a convenience 'binary' for end users, not just an
>>> internal packaging to aid the testing framework?
>>>
>>> There's nothing wrong with providing an additional official packaging
>>> if we vote on it and it follows all the rules. There is an open
>>> question about how much value it adds vs that maintenance. I see we do
>>> already have some Dockerfiles, sure. Is it possible to reuse or
>>> repurpose these so that we don't have more to maintain? or: what is
>>> different from the existing Dockerfiles here? (dumb question, never
>>> paid much attention to them)
>>>
>>> We definitely can't release GPL bits or anything, yes. Just releasing
>>> a Dockerfile referring to GPL bits is a gray area - no bits are being
>>> redistributed, but, does it constitute a derived work where the GPL
>>> stuff is a non-optional dependency? Would any publishing of these
>>> images cause us to put a copy of third party GPL code anywhere?
>>>
>>> At the least, we should keep this minimal. One image if possible, that
>>> you overlay on top of your preferred OS/Java/Python image. But how
>>> much value does that add? I have no info either way that people want
>>> or don't need such a thing.
>>>
>>> On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson  wrote:
>>> >
>>> > My takeaway from the last time we discussed this was:
>>> > 1) To be ASF compliant, we needed to only publish images at official 
>>> > releases
>>> > 2) There was some ambiguity about whether or not a container image that 
>>> > included GPL'ed packages (spark images do) might trip over the GPL "viral 
>>> > propagation" due to integrating ASL and GPL in a "binary release".  The 
>>> > "air gap" GPL provision may apply - the GPL software interacts only at 
>>> > command-line boundaries.
>>> >
>>> > On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun  
>>> > wrote:
>>> >>
>>> >> Hi, All.
>>> >>
>>> >> From 2020, shall we have an official Docker image repository as an 
>>> >> additional distribution channel?
>>> >>
>>> >> I'm considering the following images.
>>> >>
>>> >> - Public binary release (no snapshot image)
>>> >> - Public non-Spark base image (OS + R + Python)
>>> >>   (This can be used in GitHub Action Jobs and Jenkins K8s 
>>> >> Integration Tests to speed up jobs and to have more stabler environments)
>>> >>
>>> >> Bests,
>>> >> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-25 Thread Ismaël Mejía
Since the TPC-DS performance tests are one of the main validation sources
for regressions on Spark releases maybe it is time to automate the query
outputs validation to find correctness issues eagerly (it would be also
nice to validate the performance regressions but correctness >>>
performance).

This has been a long standing open issue [1] that is probably worth to
address and it seems that automating this via Github Actions could be
relatively straight-forward.

[1] https://github.com/databricks/spark-sql-perf/issues/184


On Wed, Feb 24, 2021 at 8:15 PM Reynold Xin  wrote:

> +1 Correctness issues are serious!
>
>
> On Wed, Feb 24, 2021 at 11:08 AM, Mridul Muralidharan 
> wrote:
>
>> That is indeed cause for concern.
>> +1 on extending the voting deadline until we finish investigation of this.
>>
>> Regards,
>> Mridul
>>
>>
>> On Wed, Feb 24, 2021 at 12:55 PM Xiao Li  wrote:
>>
>>> -1 Could we extend the voting deadline?
>>>
>>> A few TPC-DS queries (q17, q18, q39a, q39b) are returning different
>>> results between Spark 3.0 and Spark 3.1. We need a few more days to
>>> understand whether these changes are expected.
>>>
>>> Xiao
>>>
>>>
>>> Mridul Muralidharan  于2021年2月24日周三 上午10:41写道:
>>>

 Sounds good, thanks for clarifying Hyukjin !
 +1 on release.

 Regards,
 Mridul


 On Wed, Feb 24, 2021 at 2:46 AM Hyukjin Kwon 
 wrote:

> I remember HiveExternalCatalogVersionsSuite was flaky for a while
> which is fixed in
> https://github.com/apache/spark/commit/0d5d248bdc4cdc71627162a3d20c42ad19f24ef4
> and .. KafkaDelegationTokenSuite is flaky (
> https://issues.apache.org/jira/browse/SPARK-31250).
>
> 2021년 2월 24일 (수) 오후 5:19, Mridul Muralidharan 님이 작성:
>
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
>> -Phive-thriftserver -Pmesos -Pkubernetes
>>
>> I keep getting test failures with
>> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
>> * org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.
>> (Note: I remove $HOME/.m2 and $HOME/.iv2 paths before build)
>>
>> Removing these suites gets the build through though - does anyone
>> have suggestions on how to fix it ? I did not face this with RC1.
>>
>> Regards,
>> Mridul
>>
>>
>> On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 3.1.1.
>>>
>>> The vote is open until February 24th 11PM PST and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.1.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.1.1-rc3 (commit
>>> 1d550c4e90275ab418b9161925049239227f3dc9):
>>> https://github.com/apache/spark/tree/v3.1.1-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> 
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1367
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>>>
>>> The list of bug fixes going into 3.1.1 can be found at the following
>>> URL:
>>> https://s.apache.org/41kf2
>>>
>>> This release is using the release script of the tag v3.1.1-rc3.
>>>
>>> FAQ
>>>
>>> ===
>>> What happened to 3.1.0?
>>> ===
>>>
>>> There was a technical issue during Apache Spark 3.1.0 preparation,
>>> and it was discussed and decided to skip 3.1.0.
>>> Please see
>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>>> more details.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate,
>>> then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC via "pip install
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/pyspark-3.1.1.tar.gz
>>> "
>>> and see if anything important breaks.
>>> In the Java/Scala, you can add the 

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-15 Thread Ismaël Mejía
Any chance that SPARK-29536 PySpark does not work with Python 3.8.0
can be backported to 2.4.7 ?
This was not done for Spark 2.4.6 because it was too late on the vote
process but it makes perfect sense to have this in 2.4.7.

On Wed, Jul 15, 2020 at 9:07 AM Wenchen Fan  wrote:
>
> Yea I think 2.4.7 is good to go. Let's start!
>
> On Wed, Jul 15, 2020 at 1:50 PM Prashant Sharma  wrote:
>>
>> Hi Folks,
>>
>> So, I am back, and searched the JIRAS with target version as "2.4.7" and 
>> Resolved, found only 2 jiras. So, are we good to go, with just a couple of 
>> jiras fixed ? Shall I proceed with making a RC?
>>
>> Thanks,
>> Prashant
>>
>> On Thu, Jul 2, 2020 at 5:23 PM Prashant Sharma  wrote:
>>>
>>> Thank you, Holden.
>>>
>>> Folks, My health has gone down a bit. So, I will start working on this in a 
>>> few days. If this needs to be published sooner, then maybe someone else has 
>>> to help out.
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 2, 2020 at 10:11 AM Holden Karau  wrote:

 I’m happy to have Prashant do 2.4.7 :)

 On Wed, Jul 1, 2020 at 9:40 PM Xiao Li  wrote:
>
> +1 on releasing both 3.0.1 and 2.4.7
>
> Great! Three committers volunteer to be a release manager. Ruifeng, 
> Prashant and Holden. Holden just helped release Spark 2.4.6. This time, 
> maybe, Ruifeng and Prashant can be the release manager of 3.0.1 and 2.4.7 
> respectively.
>
> Xiao
>
> On Wed, Jul 1, 2020 at 2:24 PM Jungtaek Lim 
>  wrote:
>>
>> https://issues.apache.org/jira/browse/SPARK-32148 was reported 
>> yesterday, and if the report is valid it looks to be a blocker. I'll try 
>> to take a look sooner.
>>
>> On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman 
>>  wrote:
>>>
>>> Thanks Holden -- it would be great to also get 2.4.7 started
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau  
>>> wrote:
>>> >
>>> > I can take care of 2.4.7 unless someone else wants to do it.
>>> >
>>> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore 
>>> >  wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >>
>>> >>
>>> >> Could I get some input on the severity of this one that I found 
>>> >> yesterday?  If that’s a correctness issue, should it block this 
>>> >> patch?  Let me know under the ticket if there’s more info that I can 
>>> >> provide to help.
>>> >>
>>> >>
>>> >>
>>> >> https://issues.apache.org/jira/browse/SPARK-32136
>>> >>
>>> >>
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Jason.
>>> >>
>>> >>
>>> >>
>>> >> From: Jungtaek Lim 
>>> >> Date: Wednesday, 1 July 2020 at 10:20 am
>>> >> To: Shivaram Venkataraman 
>>> >> Cc: Prashant Sharma , 郑瑞峰 
>>> >> , Gengliang Wang 
>>> >> , gurwls223 , 
>>> >> Dongjoon Hyun , Jules Damji 
>>> >> , Holden Karau , Reynold 
>>> >> Xin , Yuanjian Li , 
>>> >> "dev@spark.apache.org" , Takeshi Yamamuro 
>>> >> 
>>> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
>>> >>
>>> >>
>>> >>
>>> >> SPARK-32130 [1] looks to be a performance regression introduced in 
>>> >> Spark 3.0.0, which is ideal to look into before releasing another 
>>> >> bugfix version.
>>> >>
>>> >>
>>> >>
>>> >> 1. https://issues.apache.org/jira/browse/SPARK-32130
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman 
>>> >>  wrote:
>>> >>
>>> >> Hi all
>>> >>
>>> >>
>>> >>
>>> >> I just wanted to ping this thread to see if all the outstanding 
>>> >> blockers for 3.0.1 have been fixed. If so, it would be great if we 
>>> >> can get the release going. The CRAN team sent us a note that the 
>>> >> version SparkR available on CRAN for the current R version (4.0.2) 
>>> >> is broken and hence we need to update the package soon --  it will 
>>> >> be great to do it with 3.0.1.
>>> >>
>>> >>
>>> >>
>>> >> Thanks
>>> >>
>>> >> Shivaram
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma 
>>> >>  wrote:
>>> >>
>>> >> +1 for 3.0.1 release.
>>> >>
>>> >> I too can help out as release manager.
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:
>>> >>
>>> >> I volunteer to be a release manager of 3.0.1, if nobody is working 
>>> >> on this.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> -- 原始邮件 --
>>> >>
>>> >> 发件人: "Gengliang Wang";
>>> >>
>>> >> 发送时间: 2020年6月24日(星期三) 下午4:15
>>> >>
>>> >> 收件人: "Hyukjin Kwon";
>>> >>
>>> >> 抄送: "Dongjoon Hyun";"Jungtaek 
>>> >> Lim";"Jules 
>>> >> Damji";"Holden 
>>> >> Karau";"Reynold 
>>> >> Xin";"Shivaram 
>>> >> Venkataraman";"Yuanjian 
>>> >> Li";"Spark dev 

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-30 Thread Ismaël Mejía
I was wondering if there is any chance that "SPARK-29536 PySpark does not work
with Python 3.8.0" could get eventually backported for a future RC.  Seems
important enough considering that python 3.8 is now the default version for
people working on the latest Ubuntu LTS.  I understand however that there should
be probably extra consequences to make this happen (CI and friends) but it would
be a really nice one to have.

On Mon, May 18, 2020 at 11:59 PM Holden Karau  wrote:
>
> That seems like an important concern. I'm going to go ahead and vote -1 on 
> this RC and I'll roll a new RC once the IndyLambda support is backported into 
> the 2.4 branch.
>
> On Mon, May 18, 2020 at 2:58 PM DB Tsai  wrote:
>>
>> I am changing my vote from +1 to +0.
>>
>> Since Spark 3.0 is Scala 2.12 only, having a transitional 2.4.x
>> release with great support of Scala 2.12 is very important. I would
>> like to have [SPARK-31399][CORE] Support indylambda Scala closure in
>> ClosureCleaner backported. Without it, it might break users' code when
>> upgrading from Scala 2.11 to Scala 2.12.
>>
>> Thanks,
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>>
>> On Mon, May 18, 2020 at 2:47 PM Holden Karau  wrote:
>> >
>> > Another two candidates for backporting that have come up since this RC are 
>> > SPARK-31692 & SPARK-31399. What are folks thoughts, should we roll an RC4?
>> >
>> > On Mon, May 18, 2020 at 2:13 PM Sean Owen  wrote:
>> >>
>> >> Ah OK, I assumed from the timing that this was cut to include that 
>> >> commit. I should have looked.
>> >> Yes, it is not strictly a regression so does not have to block the 
>> >> release and this can pass. We can release 2.4.7 in a few months, too.
>> >> How important is the fix? If it's pretty important, it may still be 
>> >> useful to run one more RC, if it's not too much trouble.
>> >>
>> >> On Mon, May 18, 2020 at 11:25 AM Holden Karau  
>> >> wrote:
>> >>>
>> >>> That is correct. I asked on the PR if that was ok with folks before I 
>> >>> moved forward with the RC and was told that it was ok. I believe that 
>> >>> particular bug is not a regression and is a long standing issue so we 
>> >>> wouldn’t normally block the release on it.
>> >>>
>> >>> On Mon, May 18, 2020 at 7:40 AM Xiao Li  wrote:
>> 
>>  This RC does not include the correctness bug fix 
>>  https://github.com/apache/spark/commit/a4885f3654899bcb852183af70cc0a82e7dd81d0
>>   which is just after RC3 cut.
>> 
>>  On Mon, May 18, 2020 at 7:21 AM Tom Graves 
>>   wrote:
>> >
>> > +1.
>> >
>> > Tom
>> >
>> > On Monday, May 18, 2020, 08:05:24 AM CDT, Wenchen Fan 
>> >  wrote:
>> >
>> >
>> > +1, no known blockers.
>> >
>> > On Mon, May 18, 2020 at 12:49 AM DB Tsai  wrote:
>> >
>> > +1 as well. Thanks.
>> >
>> > On Sun, May 17, 2020 at 7:39 AM Sean Owen  wrote:
>> >
>> > +1 , same response as to the last RC.
>> > This looks like it includes the fix discussed last time, as well as a
>> > few more small good fixes.
>> >
>> > On Sat, May 16, 2020 at 12:08 AM Holden Karau  
>> > wrote:
>> > >
>> > > Please vote on releasing the following candidate as Apache Spark 
>> > > version 2.4.6.
>> > >
>> > > The vote is open until May 22nd at 9AM PST and passes if a majority 
>> > > +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> > >
>> > > [ ] +1 Release this package as Apache Spark 2.4.6
>> > > [ ] -1 Do not release this package because ...
>> > >
>> > > To learn more about Apache Spark, please see http://spark.apache.org/
>> > >
>> > > There are currently no issues targeting 2.4.6 (try project = SPARK 
>> > > AND "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In 
>> > > Progress"))
>> > >
>> > > The tag to be voted on is v2.4.6-rc3 (commit 
>> > > 570848da7c48ba0cb827ada997e51677ff672a39):
>> > > https://github.com/apache/spark/tree/v2.4.6-rc3
>> > >
>> > > The release files, including signatures, digests, etc. can be found 
>> > > at:
>> > > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc3-bin/
>> > >
>> > > Signatures used for Spark RCs can be found in this file:
>> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > >
>> > > The staging repository for this release can be found at:
>> > > https://repository.apache.org/content/repositories/orgapachespark-1344/
>> > >
>> > > The documentation corresponding to this release can be found at:
>> > > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc3-docs/
>> > >
>> > > The list of bug fixes going into 2.4.6 can be found at the 

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-08 Thread Ismaël Mejía
+1 (non-binding)

Michael's section on the trade-offs of maintaining / removing an API are one of
the best reads I have seeing in this mailing list. Enthusiast +1

On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun  wrote:
>
> This new policy has a good indention, but can we narrow down on the migration 
> from Apache Spark 2.4.5 to Apache Spark 3.0+?
>
> I saw that there already exists a reverting PR to bring back Spark 1.4 and 
> 1.5 APIs based on this AS-IS suggestion.
>
> The AS-IS policy is clearly mentioning that JVM/Scala-level difficulty, and 
> it's nice.
>
> However, for the other cases, it sounds like `recommending older APIs as much 
> as possible` due to the following.
>
>  > How long has the API been in Spark?
>
> We had better be more careful when we add a new policy and should aim not to 
> mislead the users and 3rd party library developers to say "older is better".
>
> Technically, I'm wondering who will use new APIs in their examples (of books 
> and StackOverflow) if they need to write an additional warning like `this 
> only works at 2.4.0+` always .
>
> Bests,
> Dongjoon.
>
> On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan  wrote:
>>
>> I am in broad agreement with the prposal, as any developer, I prefer
>> stable well designed API's :-)
>>
>> Can we tie the proposal to stability guarantees given by spark and
>> reasonable expectation from users ?
>> In my opinion, an unstable or evolving could change - while an
>> experimental api which has been around for ages should be more
>> conservatively handled.
>> Which brings in question what are the stability guarantees as
>> specified by annotations interacting with the proposal.
>>
>> Also, can we expand on 'when' an API change can occur ?  Since we are
>> proposing to diverge from semver.
>> Patch release ? Minor release ? Only major release ? Based on 'impact'
>> of API ? Stability guarantees ?
>>
>> Regards,
>> Mridul
>>
>>
>>
>> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust  
>> wrote:
>> >
>> > I'll start off the vote with a strong +1 (binding).
>> >
>> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust  
>> > wrote:
>> >>
>> >> I propose to add the following text to Spark's Semantic Versioning policy 
>> >> and adopt it as the rubric that should be used when deciding to break 
>> >> APIs (even at major versions such as 3.0).
>> >>
>> >>
>> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a 
>> >> procedural vote, the measure will pass if there are more favourable votes 
>> >> than unfavourable ones. PMC votes are binding, but the community is 
>> >> encouraged to add their voice to the discussion.
>> >>
>> >>
>> >> [ ] +1 - Spark should adopt this policy.
>> >>
>> >> [ ] -1  - Spark should not adopt this policy.
>> >>
>> >>
>> >> 
>> >>
>> >>
>> >> Considerations When Breaking APIs
>> >>
>> >> The Spark project strives to avoid breaking APIs or silently changing 
>> >> behavior, even at major versions. While this is not always possible, the 
>> >> balance of the following factors should be considered before choosing to 
>> >> break an API.
>> >>
>> >>
>> >> Cost of Breaking an API
>> >>
>> >> Breaking an API almost always has a non-trivial cost to the users of 
>> >> Spark. A broken API means that Spark programs need to be rewritten before 
>> >> they can be upgraded. However, there are a few considerations when 
>> >> thinking about what the cost will be:
>> >>
>> >> Usage - an API that is actively used in many different places, is always 
>> >> very costly to break. While it is hard to know usage for sure, there are 
>> >> a bunch of ways that we can estimate:
>> >>
>> >> How long has the API been in Spark?
>> >>
>> >> Is the API common even for basic programs?
>> >>
>> >> How often do we see recent questions in JIRA or mailing lists?
>> >>
>> >> How often does it appear in StackOverflow or blogs?
>> >>
>> >> Behavior after the break - How will a program that works today, work 
>> >> after the break? The following are listed roughly in order of increasing 
>> >> severity:
>> >>
>> >> Will there be a compiler or linker error?
>> >>
>> >> Will there be a runtime exception?
>> >>
>> >> Will that exception happen after significant processing has been done?
>> >>
>> >> Will we silently return different answers? (very hard to debug, might not 
>> >> even notice!)
>> >>
>> >>
>> >> Cost of Maintaining an API
>> >>
>> >> Of course, the above does not mean that we will never break any APIs. We 
>> >> must also consider the cost both to the project and to our users of 
>> >> keeping the API in question.
>> >>
>> >> Project Costs - Every API we have needs to be tested and needs to keep 
>> >> working as other parts of the project changes. These costs are 
>> >> significantly exacerbated when external dependencies change (the JVM, 
>> >> Scala, etc). In some cases, while not completely technically infeasible, 
>> >> the cost of maintaining a particular API can become too high.
>> >>
>> >> User Costs - APIs also have 

Re: [DISCUSS] Shall we mark spark streaming component as deprecated.

2020-03-02 Thread Ismaël Mejía
Is it really ready to be deprecated? The fact that we cannot do multiple
aggregations with structured streaming [1] is a serious runtime limitation that
the DStream API does not have,
Is it worth to deprecate without having an equivalent set of features?

[1] https://issues.apache.org/jira/browse/SPARK-26655


On Mon, Mar 2, 2020 at 11:52 AM Prashant Sharma  wrote:
>
> Hi All,
>
> It is noticed that some of the users of Spark streaming do not immediately 
> realise that it is a deprecated component and it would be scary, if they end 
> up with it in production. Now that we are in a position to release about 
> Spark 3.0.0, may be we should discuss - should the spark streaming carry an 
> explicit notice? That it is deprecated and not under active development.
>
> I have opened an issue already, but I think a mailing list discussion would 
> be more appropriate. https://issues.apache.org/jira/browse/SPARK-31006
>
> Thanks,
> Prashant.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark Docker image repository

2020-02-18 Thread Ismaël Mejía
+1 to have Spark docker images for Dongjoon's arguments, having a container
based distribution is definitely something in the benefit of users and the
project too. Having this in the Apache Spark repo matters because of
multiple
eyes to fix/ímprove the images for the benefit of everyone.

What still needs to be tested is the best distribution approach. I have been
involved in both Flink and Beam's docker images processes (and passed the
whole
'docker official image' validation and some of the learnt lessons is that
the
less you put in an image the best it is for everyone. So I wonder if the
whole
include everything in the world (Python, R, etc) would scale or if those
should
be overlays on top of a more core minimal image,  but well those are
details to
fix once consensus on this is agreed.

On the Apache INFRA side there is some stuff to deal with at the beginning,
but
things become smoother once they are in place.  In any case fantastic idea
and
if I can help around I would be glad to.

Regards,
Ismaël

On Tue, Feb 11, 2020 at 10:56 PM Dongjoon Hyun 
wrote:

> Hi, Sean.
>
> Yes. We should keep this minimal.
>
> BTW, for the following questions,
>
> > But how much value does that add?
>
> How much value do you think we have at our binary distribution in the
> following link?
>
> -
> https://www.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>
> Docker image can have a similar value with the above for the users who are
> using Dockerized environment.
>
> If you are assuming the users who build from the source code or lives on
> vendor distributions, both the above existing binary distribution link
> and Docker image have no value.
>
> Bests,
> Dongjoon.
>
>
> On Tue, Feb 11, 2020 at 8:51 AM Sean Owen  wrote:
>
>> To be clear this is a convenience 'binary' for end users, not just an
>> internal packaging to aid the testing framework?
>>
>> There's nothing wrong with providing an additional official packaging
>> if we vote on it and it follows all the rules. There is an open
>> question about how much value it adds vs that maintenance. I see we do
>> already have some Dockerfiles, sure. Is it possible to reuse or
>> repurpose these so that we don't have more to maintain? or: what is
>> different from the existing Dockerfiles here? (dumb question, never
>> paid much attention to them)
>>
>> We definitely can't release GPL bits or anything, yes. Just releasing
>> a Dockerfile referring to GPL bits is a gray area - no bits are being
>> redistributed, but, does it constitute a derived work where the GPL
>> stuff is a non-optional dependency? Would any publishing of these
>> images cause us to put a copy of third party GPL code anywhere?
>>
>> At the least, we should keep this minimal. One image if possible, that
>> you overlay on top of your preferred OS/Java/Python image. But how
>> much value does that add? I have no info either way that people want
>> or don't need such a thing.
>>
>> On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson 
>> wrote:
>> >
>> > My takeaway from the last time we discussed this was:
>> > 1) To be ASF compliant, we needed to only publish images at official
>> releases
>> > 2) There was some ambiguity about whether or not a container image that
>> included GPL'ed packages (spark images do) might trip over the GPL "viral
>> propagation" due to integrating ASL and GPL in a "binary release".  The
>> "air gap" GPL provision may apply - the GPL software interacts only at
>> command-line boundaries.
>> >
>> > On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun 
>> wrote:
>> >>
>> >> Hi, All.
>> >>
>> >> From 2020, shall we have an official Docker image repository as an
>> additional distribution channel?
>> >>
>> >> I'm considering the following images.
>> >>
>> >> - Public binary release (no snapshot image)
>> >> - Public non-Spark base image (OS + R + Python)
>> >>   (This can be used in GitHub Action Jobs and Jenkins K8s
>> Integration Tests to speed up jobs and to have more stabler environments)
>> >>
>> >> Bests,
>> >> Dongjoon.
>>
>


Re: SPIP: Spark on Kubernetes

2017-08-16 Thread Ismaël Mejía
+1 (non-binding)

This is something really great to have. More schedulers and runtime
environments are a HUGE win for the Spark ecosystem.
Amazing work, Big kudos for the guys who created and continue working on this.

On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com
 wrote:
> From our perspective, we have invested heavily in Kubernetes as our cluster
> manager of choice.
>
> We also make quite heavy use of spark.  We've been experimenting with using
> these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
> already 'paid the price' to operate Kubernetes in AWS it seems rational to
> move our jobs over to spark on k8s.  Having this project merged into the
> master will significantly ease keeping our Data Munging toolchain primarily
> on Spark.
>
>
> Gary Lucas
> Data Ops Team Lead
> Unbounce
>
> On 15 August 2017 at 15:52, Andrew Ash  wrote:
>>
>> +1 (non-binding)
>>
>> We're moving large amounts of infrastructure from a combination of open
>> source and homegrown cluster management systems to unify on Kubernetes and
>> want to bring Spark workloads along with us.
>>
>> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926  wrote:
>>>
>>> +1 (non-binding)
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: spark-packages with maven

2016-07-15 Thread Ismaël Mejía
Thanks for the info Burak, I will check the repo you mention, do you know
concretely what is the 'magic' that spark-packages need or if is there any
document with info about it ?

On Fri, Jul 15, 2016 at 10:12 PM, Luciano Resende <luckbr1...@gmail.com>
wrote:

>
> On Fri, Jul 15, 2016 at 10:48 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> +1000
>>
>> Thanks Ismael for bringing this up! I meant to have send it earlier too
>> since I've been struggling with a sbt-based Scala project for a Spark
>> package myself this week and haven't yet found out how to do local
>> publishing.
>>
>> If such a guide existed for Maven I could use it for sbt easily too :-)
>>
>> Ping me Ismael if you don't hear back from the group so I feel invited
>> for digging into the plugin's sources.
>>
>> Best,
>> Jacek
>>
>> On 15 Jul 2016 2:29 p.m., "Ismaël Mejía" <ieme...@gmail.com> wrote:
>>
>> Hello, I would like to know if there is an easy way to package a new
>> spark-package
>> with maven, I just found this repo, but I am not an sbt user.
>>
>> https://github.com/databricks/sbt-spark-package
>>
>> One more question, is there a formal specification or documentation of
>> what do
>> you need to include in a spark-package (any special file, manifest, etc)
>> ? I
>> have not found any doc in the website.
>>
>> Thanks,
>> Ismael
>>
>>
>>
>
> I was under the impression that spark-packages was more like a place for
> one to list/advertise their extensions,  but when you do spark submit with
> --packages, it will use maven to resolve your package
> and as long as it succeeds, it will use it (e.g. you can do mvn clean
> install for your local packages, and use --packages with a spark server
> running on that same machine).
>
> From sbt, I think you can just use publishTo and define a local
> repository, something like
>
> publishTo := Some("Local Maven Repository" at 
> "file://"+Path.userHome.absolutePath+"/.m2/repository")
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


spark-packages with maven

2016-07-15 Thread Ismaël Mejía
Hello, I would like to know if there is an easy way to package a new
spark-package
with maven, I just found this repo, but I am not an sbt user.

https://github.com/databricks/sbt-spark-package

One more question, is there a formal specification or documentation of what
do
you need to include in a spark-package (any special file, manifest, etc) ? I
have not found any doc in the website.

Thanks,
Ismael