Re: [DISCUSS] Build error message guideline

2021-04-15 Thread Karen
I've created a PR to add the error message guidelines to the Spark
contributing guide. Would appreciate some eyes on it!
https://github.com/apache/spark-website/pull/332

On Wed, Apr 14, 2021 at 5:34 PM Yuming Wang  wrote:

> +1 LGTM.
>
> On Thu, Apr 15, 2021 at 1:50 AM Karen  wrote:
>
>> That makes sense to me - given that an assert failure throws an
>> AssertException, I would say that the same guidelines should apply for
>> asserts.
>>
>> On Tue, Apr 13, 2021 at 7:41 PM Yuming Wang  wrote:
>>
>>> Do we have plans to apply these guidelines to assert? For example:
>>>
>>>
>>> https://github.com/apache/spark/blob/5b478416f8e3fe2f015af1b6c8faa7fe9f15c05d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L136-L138
>>>
>>> https://github.com/apache/spark/blob/053dd858d38e6107bc71e0aa3a4954291b74f8c8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourcePartitioning.scala#L41
>>>
>>> On Wed, Apr 14, 2021 at 9:27 AM Hyukjin Kwon 
>>> wrote:
>>>
 I would just go ahead and create a PR for that. Nothing written there
 looks unreasonable.
 But probably it should be best to wait a couple of days to make sure
 people are happy with it.

 2021년 4월 14일 (수) 오전 6:38, Karen 님이 작성:

> If the proposed guidelines look good, it would be useful to share
> these guidelines with the wider community. A good landing page for
> contributors could be https://spark.apache.org/contributing.html.
> What do you think?
>
> Thank you,
>
> Karen Feng
>
> On Wed, Apr 7, 2021 at 8:19 PM Hyukjin Kwon 
> wrote:
>
>> LGTM (I took a look, and had some offline discussions w/ some
>> corrections before it came out)
>>
>> 2021년 4월 8일 (목) 오전 5:28, Karen 님이 작성:
>>
>>> Hi all,
>>>
>>> As discussed in SPIP: Standardize Exception Messages in Spark (
>>> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing),
>>> improving error message quality in Apache Spark involves establishing an
>>> error message guideline for developers. Error message style guidelines 
>>> are
>>> common practice across open-source projects, for example PostgreSQL (
>>> https://www.postgresql.org/docs/current/error-style-guide.html).
>>>
>>> To move towards the goal of improving error message quality, we
>>> would like to start building an error message guideline. We have 
>>> attached a
>>> rough draft to kick off this discussion:
>>> https://docs.google.com/document/d/12k4zmaKmmdm6Pk63HS0N1zN1QT-6TihkWaa5CkLmsn8/edit?usp=sharing
>>> .
>>>
>>> Please let us know what you think should be in the guideline! We
>>> look forward to building this as a community.
>>>
>>> Thank you,
>>>
>>> Karen Feng
>>>
>>


Re: [SPARK-34738] issues w/k8s+minikube and PV tests

2021-04-15 Thread shane knapp ☠
i'm all for that...  and once they're turned off, we can finish the
minikube/k8s/move-to-docker project in a couple of hours max.

On Thu, Apr 15, 2021 at 3:00 PM Holden Karau  wrote:

> What about if we just turn off the PV tests for now?
> I'd be happy to help with the debugging/upgrading.
>
> On Thu, Apr 15, 2021 at 2:28 AM Rob Vesse  wrote:
> >
> > There’s at least one test (the persistent volumes one) that relies on
> some Minikube functionality because we run integration tests for our
> $dayjob Spark image builds using Docker for Desktop instead and that one
> test fails because it relies on some minikube specific functionality.  That
> test could be refactored because I think it’s just adding a minimal Ceph
> cluster to the K8S cluster which can be done to any K8S cluster in principal
> >
> >
> >
> > Rob
> >
> >
> >
> > From: shane knapp ☠ 
> > Date: Wednesday, 14 April 2021 at 18:56
> > To: Frank Luo 
> > Cc: dev , Brian K Shiratsuki 
> > Subject: Re: [SPARK-34738] issues w/k8s+minikube and PV tests
> >
> >
> >
> > On Wed, Apr 14, 2021 at 10:32 AM Frank Luo  wrote:
> >
> > Is there any hard dependency on minkube? (i.e, GPU setting), kind (
> https://kind.sigs.k8s.io/) is a stabler and simpler k8s cluster env on a
> single machine (only requires docker) , it been widely used by k8s projects
> testing.
> >
> >
> >
> > there are no hard deps on minikube...  it installs happily and
> successfully runs every integration test except for persistent volumes.
> >
> >
> >
> > i haven't tried kind yet, but my time is super limited on this and i'd
> rather not venture down another rabbit hole unless we absolutely have to.
> >
> >
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [SPARK-34738] issues w/k8s+minikube and PV tests

2021-04-15 Thread Holden Karau
What about if we just turn off the PV tests for now?
I'd be happy to help with the debugging/upgrading.

On Thu, Apr 15, 2021 at 2:28 AM Rob Vesse  wrote:
>
> There’s at least one test (the persistent volumes one) that relies on some 
> Minikube functionality because we run integration tests for our $dayjob Spark 
> image builds using Docker for Desktop instead and that one test fails because 
> it relies on some minikube specific functionality.  That test could be 
> refactored because I think it’s just adding a minimal Ceph cluster to the K8S 
> cluster which can be done to any K8S cluster in principal
>
>
>
> Rob
>
>
>
> From: shane knapp ☠ 
> Date: Wednesday, 14 April 2021 at 18:56
> To: Frank Luo 
> Cc: dev , Brian K Shiratsuki 
> Subject: Re: [SPARK-34738] issues w/k8s+minikube and PV tests
>
>
>
> On Wed, Apr 14, 2021 at 10:32 AM Frank Luo  wrote:
>
> Is there any hard dependency on minkube? (i.e, GPU setting), kind 
> (https://kind.sigs.k8s.io/) is a stabler and simpler k8s cluster env on a 
> single machine (only requires docker) , it been widely used by k8s projects 
> testing.
>
>
>
> there are no hard deps on minikube...  it installs happily and successfully 
> runs every integration test except for persistent volumes.
>
>
>
> i haven't tried kind yet, but my time is super limited on this and i'd rather 
> not venture down another rabbit hole unless we absolutely have to.
>
>



-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Add error IDs

2021-04-15 Thread Karen
We could leave space in the numbering system, but a more flexible method
may be to have the severity as a field associated with the error class -
the same way we would associate error ID with SQLSTATE, or with whether an
error is user-facing or internal. As you noted, I don't believe there is a
standard framework for hints/warnings in Spark today. I propose that we
leave out severity as a field until there is sufficient demand. We will
leave room in the format for other fields.

On Thu, Apr 15, 2021 at 3:18 AM Steve Loughran 
wrote:

>
> Machine readable logs are always good, especially if you can read the
> entire logs into an SQL query.
>
> It might be good to use some specific differentiation between
> hint/warn/fatal error in the numbering so that any automated analysis of
> the logs can identify the class of an error even if its an error not
> actually recognised. See VMS docs for an example of this; that in Windows
> is apparently based on their work
> https://www.stsci.edu/ftp/documents/system-docs/vms-guide/html/VUG_19.html
> . Even if things are only errors for now, leaving room in the format for
> other levels is wise.
>
> The trend in cloud infras is always to have some string "NoSuchBucket"
> which is (a) guaranteed to be maintained over time and (b) searchable in
> google.
>
> (That said. AWS has every service not just making up their own values but
> not even consistent responses for the same problem. S3 throttling: 503.
> DynamoDB: 500 + one of two different messages. see
> com.amazonaws.retry.RetryUtils for the details )
>
> On Wed, 14 Apr 2021 at 20:04, Karen  wrote:
>
>> Hi all,
>>
>> We would like to kick off a discussion on adding error IDs to Spark.
>>
>> Proposal:
>>
>> Add error IDs to provide a language-agnostic, locale-agnostic, specific,
>> and succinct answer for which class the problem falls under. When partnered
>> with a text-based error class (eg. 12345 TABLE_OR_VIEW_NOT_FOUND), error
>> IDs can provide meaningful categorization. They are useful for all Spark
>> personas: from users, to support engineers, to developers.
>>
>> Add SQLSTATEs. As discussed in #32013
>> , SQLSTATEs
>> 
>> are portable error codes that are part of the ANSI/ISO SQL-99 standard
>> , and especially
>> useful for JDBC/ODBC users. They are not mutually exclusive with adding
>> product-specific error IDs, which can be more specific; for example, MySQL
>> uses an N-1 mapping from error IDs to SQLSTATEs:
>> https://dev.mysql.com/doc/refman/8.0/en/error-message-elements.html.
>>
>> Uniquely link error IDs to error messages (1-1). This simplifies the
>> auditing process and ensures that we uphold quality standards, as outlined
>> in SPIP: Standardize Error Message in Spark (
>> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit
>> ).
>>
>> Requirements:
>>
>> Changes are backwards compatible; developers should still be able to
>> throw exceptions in the existing style (eg. throw new
>> AnalysisException(“Arbitrary error message.”)). Adding error IDs will be a
>> gradual process, as there are thousands of exceptions thrown across the
>> code base.
>>
>> Optional:
>>
>> Label errors as user-facing or internal. Internal errors should be
>> logged, and end-users should be aware that they likely cannot fix the error
>> themselves.
>>
>> End result:
>>
>> Before:
>>
>> AnalysisException: Cannot find column ‘fakeColumn’; line 1 pos 14;
>>
>> After:
>>
>> AnalysisException: SPK-12345 COLUMN_NOT_FOUND: Cannot find column
>> ‘fakeColumn’; line 1 pos 14; (SQLSTATE 42704)
>>
>> Please let us know what you think about this proposal! We’d love to hear
>> what you think.
>>
>> Best,
>>
>> Karen Feng
>>
>


Production results of push-based shuffle after rolling out to 100% of Spark workloads at LinkedIn

2021-04-15 Thread mshen
Hi,

We previously raised the SPIP for push-based shuffle in  SPARK-30602
  .
Thanks for the reviews from the community, a significant portion of the code
has already been merged.

In the meantime, we have been continuing to improve the solution at LinkedIn
to scale it to cover 100% of offline Spark workloads at LinkedIn, and we
reached that milestone last month.
We have observed a significant improvement to the shuffle operation
efficiency as well as job runtime across the clusters, and the results are
shared in the following blog post.
https://www.linkedin.com/pulse/bringing-next-gen-shuffle-architecture-data-linkedin-scale-min-shen/

Would like to get feedbacks from the community on the content covered in the
blog post.
In addition, since the release timeline for Spark 3.2 is now postponed till
September, we believe it would be reasonable to include push-based shuffle
as part of Spark 3.2 release itself, given that this feature has already
been validated in production at scale.
Want to also bring attention to the various patches currently under/pending
reviews under SPARK-30602, so we can get more eyes on the remaining patches.



-
Min Shen
Sr. Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-15 Thread Liang-Chi Hsieh
Thanks all for voting. Unfortunately, we found a long-standing correctness
bug SPARK-35080 and 2.4 was affected too. That is said we need to drop RC2
in favor of RC3.

The fix is ready for merging at https://github.com/apache/spark/pull/32179.






--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Add error IDs

2021-04-15 Thread Steve Loughran
Machine readable logs are always good, especially if you can read the
entire logs into an SQL query.

It might be good to use some specific differentiation between
hint/warn/fatal error in the numbering so that any automated analysis of
the logs can identify the class of an error even if its an error not
actually recognised. See VMS docs for an example of this; that in Windows
is apparently based on their work
https://www.stsci.edu/ftp/documents/system-docs/vms-guide/html/VUG_19.html
. Even if things are only errors for now, leaving room in the format for
other levels is wise.

The trend in cloud infras is always to have some string "NoSuchBucket"
which is (a) guaranteed to be maintained over time and (b) searchable in
google.

(That said. AWS has every service not just making up their own values but
not even consistent responses for the same problem. S3 throttling: 503.
DynamoDB: 500 + one of two different messages. see
com.amazonaws.retry.RetryUtils for the details )

On Wed, 14 Apr 2021 at 20:04, Karen  wrote:

> Hi all,
>
> We would like to kick off a discussion on adding error IDs to Spark.
>
> Proposal:
>
> Add error IDs to provide a language-agnostic, locale-agnostic, specific,
> and succinct answer for which class the problem falls under. When partnered
> with a text-based error class (eg. 12345 TABLE_OR_VIEW_NOT_FOUND), error
> IDs can provide meaningful categorization. They are useful for all Spark
> personas: from users, to support engineers, to developers.
>
> Add SQLSTATEs. As discussed in #32013
> , SQLSTATEs
> 
> are portable error codes that are part of the ANSI/ISO SQL-99 standard
> , and especially
> useful for JDBC/ODBC users. They are not mutually exclusive with adding
> product-specific error IDs, which can be more specific; for example, MySQL
> uses an N-1 mapping from error IDs to SQLSTATEs:
> https://dev.mysql.com/doc/refman/8.0/en/error-message-elements.html.
>
> Uniquely link error IDs to error messages (1-1). This simplifies the
> auditing process and ensures that we uphold quality standards, as outlined
> in SPIP: Standardize Error Message in Spark (
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit
> ).
>
> Requirements:
>
> Changes are backwards compatible; developers should still be able to throw
> exceptions in the existing style (eg. throw new
> AnalysisException(“Arbitrary error message.”)). Adding error IDs will be a
> gradual process, as there are thousands of exceptions thrown across the
> code base.
>
> Optional:
>
> Label errors as user-facing or internal. Internal errors should be logged,
> and end-users should be aware that they likely cannot fix the error
> themselves.
>
> End result:
>
> Before:
>
> AnalysisException: Cannot find column ‘fakeColumn’; line 1 pos 14;
>
> After:
>
> AnalysisException: SPK-12345 COLUMN_NOT_FOUND: Cannot find column
> ‘fakeColumn’; line 1 pos 14; (SQLSTATE 42704)
>
> Please let us know what you think about this proposal! We’d love to hear
> what you think.
>
> Best,
>
> Karen Feng
>


Re: UserGroupInformation.doAS is working well in Spark Executors?

2021-04-15 Thread Steve Loughran
If are using kerberized HDFS the spark principal (or whoever is running the
cluster) has to be declared as a proxy user.

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html

Once done, you call the


val ugi =  UserGroupInformation.createProxyUser("joe",
UserGroupInformation.getLoginUser())

that user is then used to create the FS

val proxyFS = ugi.doAs( { FileSystem.newInstance(new
URI("hdfs://nn1/home/user/"), conf)  }}) /* whatever the scala syntax
is here */


The proxyFS will then do all its IO as the given user, even when done
outside a doAs clause, e.g.

proxyFS.mkdirs(new Path("/home/user/alice/"))

FileSystem.get() also works on a UGI basis, so ugi.doAs(
FileSystem.get("hdfs://nn1"))) returns a different FS instance than
FileSystem.get() outside of the clause

Once you are done with the FS, close it. If you know you are completely
done with the user across all threads, you can release them all

FileSystem.closeAllForUGI(ugi)

This closes all filesystems for that user. This is critical on long-lived
processes as otherwise you'll run out memory/threads.

On Mon, 12 Apr 2021 at 16:20, Kwangsun Noh  wrote:

> Hi, Spark users.
>
>
> I wanted to make unknown users create HDFS files, not the OS user who
> executes the spark application.
>
>
> And I thought it would be possible using
> UserGroupInformation.createRemoteUser(“other”).doAS(…)
>
>
> However, the files are created by the OS user who launched the spark
> application in Spark Executors.
>
>
> Although I’ve tested it on Spark Standalone and Yarn, I got the same
> results.
>
>
> Is it impossible to impersonate a Spark job user using the
> UserGroupInformation.doAS?
>
>
> PS. In fact, I posted a similar question on the Spark user mailing list,
>
>But I didn’t get the answer I wanted.
>
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-enable-to-use-Multiple-UGIs-in-One-Spark-Context-td39859.html
>


Re: [SPARK-34738] issues w/k8s+minikube and PV tests

2021-04-15 Thread Rob Vesse
There’s at least one test (the persistent volumes one) that relies on some 
Minikube functionality because we run integration tests for our $dayjob Spark 
image builds using Docker for Desktop instead and that one test fails because 
it relies on some minikube specific functionality.  That test could be 
refactored because I think it’s just adding a minimal Ceph cluster to the K8S 
cluster which can be done to any K8S cluster in principal

 

Rob

 

From: shane knapp ☠ 
Date: Wednesday, 14 April 2021 at 18:56
To: Frank Luo 
Cc: dev , Brian K Shiratsuki 
Subject: Re: [SPARK-34738] issues w/k8s+minikube and PV tests

 

On Wed, Apr 14, 2021 at 10:32 AM Frank Luo  wrote:

Is there any hard dependency on minkube? (i.e, GPU setting), kind 
(https://kind.sigs.k8s.io/) is a stabler and simpler k8s cluster env on a 
single machine (only requires docker) , it been widely used by k8s projects 
testing.

 

there are no hard deps on minikube...  it installs happily and successfully 
runs every integration test except for persistent volumes.

 

i haven't tried kind yet, but my time is super limited on this and i'd rather 
not venture down another rabbit hole unless we absolutely have to.

 



Re: please read: current state and the future of the apache spark build system

2021-04-15 Thread Yikun Jiang
Much thanks for your work on infra @Shane. Especially, we (I and
@huangtianhua) got really much help from you when make Arm CI work. [1]

> prepare jenkins worker ansible configs and stick in the spark repo

https://github.com/apache/spark/pull/32178 I take a quick glance on it, it
seems it doesn't contain any Arm node setup and config related code.

*Do you have any plan to update the existing code to cover the Arm node
setup and configuration?* or just some exiting script is also okay.

*Do you have any special plan on Arm node migration?* If needed, I will
help some the Arm related node setup and config in new infra to make sure
Spark Arm CI work.

BTW, We also is considering to move the Arm build from jenkins to Github
Action (using self-host or cloud deploy
https://github.com/actions/starter-workflows/tree/main/ci), there are some
pre-work is being done by our team see PoC in [2]. (cc @mgrigorov)[2],
maybe it could bring some idea on future infrastructure.

[1] https://amplab.cs.berkeley.edu/jenkins/label/spark-arm/
[2]
https://martin-grigorov.medium.com/githubactions-build-and-test-on-huaweicloud-arm64-af9d5c97b766

Regards,
Yikun


Holden Karau  于2021年4月15日周四 上午8:29写道:

> Thanks Shane for keeping the build infra structure running for all of
> these years :)
>
> I've got some Kubernetes infra on AS399306 down in HE in Fremont but
> it's also perhaps not of the newest variety, but so far no disk
> failures or anything like that (knock on wood of course). The catch is
> it's on a 15 amp circuit and frankly I'm still learning how BGP works.
>
> Maybe we could expirement with
> https://github.com/lazybit-ch/actions-runner/tree/master/actions-runner
> and try nested MiniKube (which I know is... not great but might make
> things more portable)?
>
> Would the community (and or some of our corporate contributors) be
> open to contributing some hardware + power money or cloud credits?
>
> On Wed, Apr 14, 2021 at 5:13 PM Hyukjin Kwon  wrote:
> >
> > Thanks Shane!!
> >
> > On Thu, 15 Apr 2021, 09:03 shane knapp ☠,  wrote:
> >>>
> >>> medium term (in 6 months):
> >>> * prepare jenkins worker ansible configs and stick in the spark repo
> >>>   - nothing fancy, but enough to config ubuntu workers
> >>>   - could be used to create docker containers for testing in
> THE CLOUD
> >>>
> >> fwiw, i just decided to bang this out today:
> >> https://github.com/apache/spark/pull/32178
> >>
> >> shane
> >> --
> >> Shane Knapp
> >> Computer Guy / Voice of Reason
> >> UC Berkeley EECS Research / RISELab Staff Technical Lead
> >> https://rise.cs.berkeley.edu
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>