[Discuss] Metrics Support for DS V2

2020-01-16 Thread Sandeep Katta
Hi Devs, Currently DS V2 does not update any input metrics. SPARK-30362 aims at solving this problem. We can have the below approach. Have marker interface let's say "ReportMetrics" If the DataSource Implements this interface, then it will be easy to collect the metrics. For e.g.

Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-16 Thread Wenchen Fan
The proposal makes sense to me. If we are not going to make interval type ANSI-compliant in this release, we should not expose it widely. Thanks for driving it, Kent! On Fri, Jan 17, 2020 at 10:52 AM Dr. Kent Yao wrote: > Following ANSI might be a good option but also a serious user behavior >

Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Hyukjin Kwon
Each configuration has its documentation already. What we need to do would be just to list up. 2020년 1월 17일 (금) 오후 12:25, Jules Damji 님이 작성: > It’s one thing to get the names/values of the configurations, via the > Spark.sql(“set -v”), but another thing to understand what each achieves and >

Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Jules Damji
It’s one thing to get the names/values of the configurations, via the Spark.sql(“set -v”), but another thing to understand what each achieves and when and why you’ll want to use it. A webpage with a table and description of each is huge benefit. Cheers Jules Sent from my iPhone Pardon the

Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Hyukjin Kwon
Nicholas, are you interested in taking a stab at this? You could refer https://github.com/apache/spark/commit/60472dbfd97acfd6c4420a13f9b32bc9d84219f3 2020년 1월 17일 (금) 오전 8:48, Takeshi Yamamuro 님이 작성: > The idea looks nice. I think web documents always help end users. > > Bests, > Takeshi > > On

Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-16 Thread Dr. Kent Yao
Following ANSI might be a good option but also a serious user behavior change to introduce two different interval types, so I also agree with Reynold to follow what we have done since version 1.5.0, just like Snowflake and Redshift. Perhaps, we can make some efforts for the current interval type

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Hyukjin Kwon
Thanks for giving me some context and clarification, Ryan. I think I was rather trying to propose to revert because I don't see the explicit plan here and it was just left half-done for a long while. >From reading the PR description and codes, I could not guess in which way we should fix this API

Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Takeshi Yamamuro
The idea looks nice. I think web documents always help end users. Bests, Takeshi On Fri, Jan 17, 2020 at 4:04 AM Shixiong(Ryan) Zhu wrote: > "spark.sql("set -v")" returns a Dataset that has all non-internal SQL > configurations. Should be pretty easy to automatically generate a SQL >

Re: [FYI] SBT Build Failure

2020-01-16 Thread Sean Owen
Ah. The Maven build already long since points at https:// for resolution for security. I tried just overriding the resolver for the SBT build, but it doesn't seem to work. I don't understand the SBT build well enough to debug right now. I think it's possible to override resolvers with local config

[FYI] SBT Build Failure

2020-01-16 Thread Dongjoon Hyun
Hi, All. As of now, Apache Spark sbt build is broken by the Maven Central repository policy. - https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an > Effective January 15, 2020, The Central Maven Repository no longer supports

Re: PR lint-scala jobs failing with http error

2020-01-16 Thread Dongjoon Hyun
Hi, Tom and Shane. It looks like an old `sbt` bug. Maven seems to start to ban the `http` access recently. If you use Maven, it's okay because it goes to `https`. $ build/sbt clean [error] org.apache.maven.model.building.ModelBuildingException: 1 problem was encountered while building the

Re: PR lint-scala jobs failing with http error

2020-01-16 Thread Shane Knapp
ah ok... looks like these were set up by dongjoon a while back. i've added him to this thread as i can't see the settings in the spark github repo. On Thu, Jan 16, 2020 at 1:58 PM Tom Graves wrote: > > Sorry should have included the link. It shows up in the pre checks failures, > but the

Re: PR lint-scala jobs failing with http error

2020-01-16 Thread Tom Graves
Sorry should have included the link. It shows up in the pre checks failures, but the tests still run and pass. For instance:https://github.com/apache/spark/pull/26682 more:https://github.com/apache/spark/pull/27240/checks?check_run_id=393888081

Re: PR lint-scala jobs failing with http error

2020-01-16 Thread Shane Knapp
i'm seeing a lot of green builds currently... if you think this is still happening, please include links to the failed jobs. thanks! shane (at a conference) On Thu, Jan 16, 2020 at 11:16 AM Tom Graves wrote: > > I'm seeing the scala-lint jobs fail on the pull request builds with: > > [error]

Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Maxim Gekk
Hi Bing, You can try Text datasource. It shouldn't modify strings: scala> Seq(20192_1",1,24,0,2,”S66.000x001”""").toDS.write.text("tmp/text.txt") $ cat tmp/text.txt/part-0-256d960f-9f85-47fe-8edd-8428276eb3c6-c000.txt "20192_1",1,24,0,2,”S66.000x001” Maxim Gekk Software Engineer

PR lint-scala jobs failing with http error

2020-01-16 Thread Tom Graves
I'm seeing the scala-lint jobs fail on the pull request builds with: [error] [FATAL] Non-resolvable parent POM: Could not transfer artifact org.apache:apache:pom:18 from/to central ( http://repo.maven.apache.org/maven2): Error transferring file: Server returned HTTP response code: 501 for URL:

Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Shixiong(Ryan) Zhu
"spark.sql("set -v")" returns a Dataset that has all non-internal SQL configurations. Should be pretty easy to automatically generate a SQL configuration page. Best Regards, Ryan On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon wrote: > I think automatically creating a configuration page isn't a

Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Long, Andrew
Hey Bing, There’s a couple different approaches you could take. The quickest and easiest would be to use the existing APIs val bytes = spark.range(1000 bytes.foreachPartition(bytes =>{ //W ARNING anything used in here will need to be serializable. // There's some magic to serializing the

Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Felix Cheung
I think it’s a good idea From: Hyukjin Kwon Sent: Wednesday, January 15, 2020 5:49:12 AM To: dev Cc: Sean Owen ; Nicholas Chammas Subject: Re: More publicly documenting the options under spark.sql.* Resending to the dev list for archive purpose: I think

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Ryan Blue
Hi everyone, Let me recap some of the discussions that got us to where we are with this today. Hopefully that will provide some clarity. The purpose of partition transforms is to allow source implementations to internally handle partitioning. Right now, users are responsible for this. For

Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-16 Thread Xiao Li
-1 Let us include the correctness fix: https://github.com/apache/spark/pull/27229 Thanks, Xiao On Thu, Jan 16, 2020 at 8:46 AM Dongjoon Hyun wrote: > Thank you, Jungtaek! > > Bests, > Dongjoon. > > > On Wed, Jan 15, 2020 at 8:57 PM Jungtaek Lim > wrote: > >> Once we decided to cancel the

Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-16 Thread Dongjoon Hyun
Thank you, Jungtaek! Bests, Dongjoon. On Wed, Jan 15, 2020 at 8:57 PM Jungtaek Lim wrote: > Once we decided to cancel the RC1, what about including SPARK-29450 ( > https://github.com/apache/spark/pull/27209) into RC2? > > SPARK-29450 was merged into master, and Xiao figured out it fixed a >

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Hyukjin Kwon
I think the problem here is if there is an explicit plan or not. The PR was merged one year ago and not many changes have been made to this API to address the main concerns mentioned. Also, the followup JIRA requested seems still open https://issues.apache.org/jira/browse/SPARK-27386 I heard this

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Wenchen Fan
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss. About expression duplication. This is an existing design choice. We don't

[DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Hyukjin Kwon
Hi all, I would like to suggest to take one step back at https://github.com/apache/spark/pull/24117 and rethink about it. I am writing this email as I raised the issue few times but could not have enough responses promptly, and the code freeze is being close. In particular, please refer the

Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Driesprong, Fokko
Hi Bing, Good question and the answer is; it depends on what your use-case is. If you really just want to write raw bytes, then you could create a .foreach where you open an OutputStream and write it to some file. But this is probably not what you want, and in practice not very handy since you

How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Duan,Bing
Hi all: I read binary data(protobuf format) from filesystem by binaryFiles function to a RDD[Array[Byte]] it works fine. But when I save the it to filesystem by saveAsTextFile, the quotation mark was be escaped like this: "\"20192_1\"",1,24,0,2,"\"S66.000x001\””,which should be