[Discuss] Metrics Support for DS V2

2020-01-16 Thread Sandeep Katta
Hi Devs,

Currently DS V2 does not update any input metrics. SPARK-30362 aims at
solving this problem.

We can have the below approach. Have marker interface let's say
"ReportMetrics"

If the DataSource Implements this interface, then it will be easy to
collect the metrics.

For e.g. FilePartitionReaderFactory can support metrics.

So it will be easy to collect the metrics if FilePartitionReaderFactory
implements ReportMetrics


Please let me know the views, or even if we want to have new solution or
design.


Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-16 Thread Wenchen Fan
The proposal makes sense to me. If we are not going to make interval type
ANSI-compliant in this release, we should not expose it widely.

Thanks for driving it, Kent!

On Fri, Jan 17, 2020 at 10:52 AM Dr. Kent Yao  wrote:

> Following ANSI might be a good option but also a serious user behavior
> change
> to introduce two different interval types, so I also agree with Reynold to
> follow what we have done since version 1.5.0, just like Snowflake and
> Redshift.
>
> Perhaps, we can make some efforts for the current interval type to make it
> more future-proofing. e.g.
> 1. add unstable annotation to the CalendarInterval class. People already
> use
> it as UDF inputs so it’s better to make it clear it’s unstable.
> 2. Add a schema checker to prohibit create v2 custom catalog table with
> intervals, as same as what we do for the builtin catalog
> 3. Add a schema checker for DataFrameWriterV2 too
> 4. Make the interval type incomparable as version 2.4 for disambiguation of
> comparison between year-month and day-time fields
> 5. The 3.0 newly added to_csv should not support output intervals as same
> as
> using CSV file format
> 6. The function to_json should not allow using interval as a key field as
> same as the value field and JSON datasource, with a legacy config to
> restore.
> 7. Revert interval ISO/ANSI SQL Standard output since we decide not to
> follow ANSI, so there is no round trip.
>
> Bests,
>
> Kent
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Hyukjin Kwon
Each configuration has its documentation already. What we need to do would
be just to list up.

2020년 1월 17일 (금) 오후 12:25, Jules Damji 님이 작성:

> It’s one thing to get the names/values of the configurations, via the
> Spark.sql(“set -v”), but another thing to understand what each achieves and
> when and why you’ll want to use it.
>
> A webpage with a table and description of each is huge benefit.
>
> Cheers
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jan 16, 2020, at 11:04 AM, Shixiong(Ryan) Zhu 
> wrote:
>
> 
> "spark.sql("set -v")" returns a Dataset that has all non-internal SQL
> configurations. Should be pretty easy to automatically generate a SQL
> configuration page.
>
> Best Regards,
> Ryan
>
>
> On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon  wrote:
>
>> I think automatically creating a configuration page isn't a bad idea
>> because I think we deprecate and remove configurations which are not
>> created via .internal() in SQLConf anyway.
>>
>> I already tried this automatic generation from the codes at SQL built-in
>> functions and I'm pretty sure we can do the similar thing for
>> configurations as well.
>>
>> We could perhaps mimic what hadoop does
>> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>>
>> On Wed, 15 Jan 2020, 10:46 Sean Owen,  wrote:
>>
>>> Some of it is intentionally undocumented, as far as I know, as an
>>> experimental option that may change, or legacy, or safety valve flag.
>>> Certainly anything that's marked an internal conf. (That does raise
>>> the question of who it's for, if you have to read source to find it.)
>>>
>>> I don't know if we need to overhaul the conf system, but there may
>>> indeed be some confs that could legitimately be documented. I don't
>>> know which.
>>>
>>> On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
>>>  wrote:
>>> >
>>> > I filed SPARK-30510 thinking that we had forgotten to document an
>>> option, but it turns out that there's a whole bunch of stuff under
>>> SQLConf.scala that has no public documentation under
>>> http://spark.apache.org/docs.
>>> >
>>> > Would it be appropriate to somehow automatically generate a
>>> documentation page from SQLConf.scala, as Hyukjin suggested on that ticket?
>>> >
>>> > Another thought that comes to mind is moving the config definitions
>>> out of Scala and into a data format like YAML or JSON, and then sourcing
>>> that both for SQLConf as well as for whatever documentation page we want to
>>> generate. What do you think of that idea?
>>> >
>>> > Nick
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Jules Damji
It’s one thing to get the names/values of the configurations, via the 
Spark.sql(“set -v”), but another thing to understand what each achieves and 
when and why you’ll want to use it. 

A webpage with a table and description of each is huge benefit. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Jan 16, 2020, at 11:04 AM, Shixiong(Ryan) Zhu  
> wrote:
> 
> 
> "spark.sql("set -v")" returns a Dataset that has all non-internal SQL 
> configurations. Should be pretty easy to automatically generate a SQL 
> configuration page.
> Best Regards,
> 
> Ryan
> 
> 
>> On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon  wrote:
>> I think automatically creating a configuration page isn't a bad idea because 
>> I think we deprecate and remove configurations which are not created via 
>> .internal() in SQLConf anyway.
>> 
>> I already tried this automatic generation from the codes at SQL built-in 
>> functions and I'm pretty sure we can do the similar thing for configurations 
>> as well.
>> 
>> We could perhaps mimic what hadoop does 
>> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>> 
>>> On Wed, 15 Jan 2020, 10:46 Sean Owen,  wrote:
>>> Some of it is intentionally undocumented, as far as I know, as an
>>> experimental option that may change, or legacy, or safety valve flag.
>>> Certainly anything that's marked an internal conf. (That does raise
>>> the question of who it's for, if you have to read source to find it.)
>>> 
>>> I don't know if we need to overhaul the conf system, but there may
>>> indeed be some confs that could legitimately be documented. I don't
>>> know which.
>>> 
>>> On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
>>>  wrote:
>>> >
>>> > I filed SPARK-30510 thinking that we had forgotten to document an option, 
>>> > but it turns out that there's a whole bunch of stuff under SQLConf.scala 
>>> > that has no public documentation under http://spark.apache.org/docs.
>>> >
>>> > Would it be appropriate to somehow automatically generate a documentation 
>>> > page from SQLConf.scala, as Hyukjin suggested on that ticket?
>>> >
>>> > Another thought that comes to mind is moving the config definitions out 
>>> > of Scala and into a data format like YAML or JSON, and then sourcing that 
>>> > both for SQLConf as well as for whatever documentation page we want to 
>>> > generate. What do you think of that idea?
>>> >
>>> > Nick
>>> >
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 


Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Hyukjin Kwon
Nicholas, are you interested in taking a stab at this? You could refer
https://github.com/apache/spark/commit/60472dbfd97acfd6c4420a13f9b32bc9d84219f3

2020년 1월 17일 (금) 오전 8:48, Takeshi Yamamuro 님이 작성:

> The idea looks nice. I think web documents always help end users.
>
> Bests,
> Takeshi
>
> On Fri, Jan 17, 2020 at 4:04 AM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> "spark.sql("set -v")" returns a Dataset that has all non-internal SQL
>> configurations. Should be pretty easy to automatically generate a SQL
>> configuration page.
>>
>> Best Regards,
>> Ryan
>>
>>
>> On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon  wrote:
>>
>>> I think automatically creating a configuration page isn't a bad idea
>>> because I think we deprecate and remove configurations which are not
>>> created via .internal() in SQLConf anyway.
>>>
>>> I already tried this automatic generation from the codes at SQL built-in
>>> functions and I'm pretty sure we can do the similar thing for
>>> configurations as well.
>>>
>>> We could perhaps mimic what hadoop does
>>> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>>>
>>> On Wed, 15 Jan 2020, 10:46 Sean Owen,  wrote:
>>>
 Some of it is intentionally undocumented, as far as I know, as an
 experimental option that may change, or legacy, or safety valve flag.
 Certainly anything that's marked an internal conf. (That does raise
 the question of who it's for, if you have to read source to find it.)

 I don't know if we need to overhaul the conf system, but there may
 indeed be some confs that could legitimately be documented. I don't
 know which.

 On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
  wrote:
 >
 > I filed SPARK-30510 thinking that we had forgotten to document an
 option, but it turns out that there's a whole bunch of stuff under
 SQLConf.scala that has no public documentation under
 http://spark.apache.org/docs.
 >
 > Would it be appropriate to somehow automatically generate a
 documentation page from SQLConf.scala, as Hyukjin suggested on that ticket?
 >
 > Another thought that comes to mind is moving the config definitions
 out of Scala and into a data format like YAML or JSON, and then sourcing
 that both for SQLConf as well as for whatever documentation page we want to
 generate. What do you think of that idea?
 >
 > Nick
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>
> --
> ---
> Takeshi Yamamuro
>


Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-16 Thread Dr. Kent Yao
Following ANSI might be a good option but also a serious user behavior change
to introduce two different interval types, so I also agree with Reynold to
follow what we have done since version 1.5.0, just like Snowflake and
Redshift.

Perhaps, we can make some efforts for the current interval type to make it
more future-proofing. e.g.
1. add unstable annotation to the CalendarInterval class. People already use
it as UDF inputs so it’s better to make it clear it’s unstable.
2. Add a schema checker to prohibit create v2 custom catalog table with
intervals, as same as what we do for the builtin catalog
3. Add a schema checker for DataFrameWriterV2 too
4. Make the interval type incomparable as version 2.4 for disambiguation of
comparison between year-month and day-time fields
5. The 3.0 newly added to_csv should not support output intervals as same as
using CSV file format
6. The function to_json should not allow using interval as a key field as
same as the value field and JSON datasource, with a legacy config to
restore.
7. Revert interval ISO/ANSI SQL Standard output since we decide not to
follow ANSI, so there is no round trip.

Bests,

Kent




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Hyukjin Kwon
Thanks for giving me some context and clarification, Ryan.

I think I was rather trying to propose to revert because I don't see the
explicit plan here and it was just left half-done for a long while.
>From reading the PR description and codes, I could not guess in which way
we should fix this API (e.g., is this expression only for partition or
replacement of all expressions?). Also, if you take a look at the commit
log, it has not been fixed for 10 months except moving around or minor
fixes.

Do you mind if I ask how we plan to extend this feature? For example,
- if we want to replace existing expressions at the end
- if we want to add a copy of expressions for some reasons.
- How will we handle ambiguity of supported expressions between other
datasource implementations and Spark.
- If we can't tell other expressions are supported here, why don't we just
use different syntax to clarify?

If we have this plan or doc, and people can fix accordingly with
incremental improvements, I am good to keep it.


Here are some of followup questions and answers:

> I don't think there is reason to revert this simply because of some of
the early choices, like deciding to start a public expression API. If you'd
like to extend this to "fix" areas where you find it confusing, then please
do.

If it's clear that we should redesign the API, or there is no more plan
about that API at this moment, I think it can be an option to revert, in
particular, considering that code freeze is being close. For example, why
did we try UDF-like way by exposing a function interface only.


> The idea was that Spark needs a public expression API anyway for other
uses

I was wondering why we should we a public expression API in DSv2. Is there
some places where UDFs can't cover?
As said above, it requires a duplication of existing expressions is
required and wonder if this is worthwhile.
With the stub of Transform API, it looks we want this but I don't know why.


> None of this has been confusing or misleading for our users, who caught
on quickly.

Maybe using simple case wouldn't bring so much confusions if they were
already told about it.
However, if we think about the difference and subtleties, I doubt if the
users already know the answers:

CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*

  - It looks expressions and allowing other expressions / combinations
  - Since the expressions are handled by DSv2, the behaviours are dependent
on DSv2 e.g., using *transform* against Datasource implementation A and B
are different.
 - Likewise, if Spark supports *transform* here, the behaviour will be
different.


2020년 1월 17일 (금) 오전 2:36, Ryan Blue 님이 작성:

> Hi everyone,
>
> Let me recap some of the discussions that got us to where we are with this
> today. Hopefully that will provide some clarity.
>
> The purpose of partition transforms is to allow source implementations to
> internally handle partitioning. Right now, users are responsible for this.
> For example, users will transform timestamps into date strings when writing
> and other people will provide a filter on those date strings when scanning.
> This is error-prone: users commonly forget to add partition filters in
> addition to data filters, if anyone uses the wrong format or transformation
> queries will silently return incorrect results, etc. But sources can (and
> should) automatically handle storing and retrieving data internally because
> it is much easier for users.
>
> When we first proposed transforms, I wanted to use Expression. But Reynold
> rightly pointed out that Expression is an internal API that should not be
> exposed. So we decided to compromise by building a public expressions API
> like the public Filter API for the initial purpose of passing transform
> expressions to sources. The idea was that Spark needs a public expression
> API anyway for other uses, like requesting a distribution and ordering for
> a writer. To keep things simple, we chose to build a minimal public
> expression API and expand it incrementally as we need more features.
>
> We also considered whether to parse all expressions and convert only
> transformations to the public API, or to parse just transformations. We
> went with just parsing transformations because it was easier and we can
> expand it to improve error messages later.
>
> I don't think there is reason to revert this simply because of some of the
> early choices, like deciding to start a public expression API. If you'd
> like to extend this to "fix" areas where you find it confusing, then please
> do. We know that by parsing more expressions we could improve error
> messages. But that's not to say that we need to revert it.
>
> None of this has been confusing or misleading for our users, who caught on
> quickly.
>
> On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon  wrote:
>
>> I think the problem here is if there is an explicit plan or not.
>> The PR was merged one year ago and not many changes have been made to
>> t

Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Takeshi Yamamuro
The idea looks nice. I think web documents always help end users.

Bests,
Takeshi

On Fri, Jan 17, 2020 at 4:04 AM Shixiong(Ryan) Zhu 
wrote:

> "spark.sql("set -v")" returns a Dataset that has all non-internal SQL
> configurations. Should be pretty easy to automatically generate a SQL
> configuration page.
>
> Best Regards,
> Ryan
>
>
> On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon  wrote:
>
>> I think automatically creating a configuration page isn't a bad idea
>> because I think we deprecate and remove configurations which are not
>> created via .internal() in SQLConf anyway.
>>
>> I already tried this automatic generation from the codes at SQL built-in
>> functions and I'm pretty sure we can do the similar thing for
>> configurations as well.
>>
>> We could perhaps mimic what hadoop does
>> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>>
>> On Wed, 15 Jan 2020, 10:46 Sean Owen,  wrote:
>>
>>> Some of it is intentionally undocumented, as far as I know, as an
>>> experimental option that may change, or legacy, or safety valve flag.
>>> Certainly anything that's marked an internal conf. (That does raise
>>> the question of who it's for, if you have to read source to find it.)
>>>
>>> I don't know if we need to overhaul the conf system, but there may
>>> indeed be some confs that could legitimately be documented. I don't
>>> know which.
>>>
>>> On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
>>>  wrote:
>>> >
>>> > I filed SPARK-30510 thinking that we had forgotten to document an
>>> option, but it turns out that there's a whole bunch of stuff under
>>> SQLConf.scala that has no public documentation under
>>> http://spark.apache.org/docs.
>>> >
>>> > Would it be appropriate to somehow automatically generate a
>>> documentation page from SQLConf.scala, as Hyukjin suggested on that ticket?
>>> >
>>> > Another thought that comes to mind is moving the config definitions
>>> out of Scala and into a data format like YAML or JSON, and then sourcing
>>> that both for SQLConf as well as for whatever documentation page we want to
>>> generate. What do you think of that idea?
>>> >
>>> > Nick
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
---
Takeshi Yamamuro


Re: [FYI] SBT Build Failure

2020-01-16 Thread Sean Owen
Ah. The Maven build already long since points at https:// for
resolution for security. I tried just overriding the resolver for the
SBT build, but it doesn't seem to work. I don't understand the SBT
build well enough to debug right now. I think it's possible to
override resolvers with local config files as a workaround.

On Thu, Jan 16, 2020 at 4:33 PM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> As of now, Apache Spark sbt build is broken by the Maven Central repository 
> policy.
>
> - 
> https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an
>
> > Effective January 15, 2020, The Central Maven Repository no longer supports 
> > insecure
> > communication over plain HTTP and requires that all requests to the 
> > repository are
> > encrypted over HTTPS.
>
> You can reproduce this locally by the following.
>
> $ rm -rf ~/.m2/repository/org/apache/apache/18/
> $ build/sbt clean
>
> The usual suspects are the following two.
>
> - SBT: sbt 0.13.18
> - Plugin: addSbtPlugin("org.spark-project" % "sbt-pom-reader" % 
> "1.0.0-spark")
>
> Bests,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[FYI] SBT Build Failure

2020-01-16 Thread Dongjoon Hyun
Hi, All.

As of now, Apache Spark sbt build is broken by the Maven Central repository
policy.

-
https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an

> Effective January 15, 2020, The Central Maven Repository no longer
supports insecure
> communication over plain HTTP and requires that all requests to the
repository are
> encrypted over HTTPS.

You can reproduce this locally by the following.

$ rm -rf ~/.m2/repository/org/apache/apache/18/
$ build/sbt clean

The usual suspects are the following two.

- SBT: sbt 0.13.18
- Plugin: addSbtPlugin("org.spark-project" % "sbt-pom-reader" %
"1.0.0-spark")

Bests,
Dongjoon.


Re: PR lint-scala jobs failing with http error

2020-01-16 Thread Dongjoon Hyun
Hi, Tom and Shane.

It looks like an old `sbt` bug. Maven seems to start to ban the `http`
access recently.

If you use Maven, it's okay because it goes to `https`.

$ build/sbt clean
[error] org.apache.maven.model.building.ModelBuildingException: 1 problem
was encountered while building the effective model for
org.apache.spark:spark-parent_2.12:3.0.0-SNAPSHOT
[error] [FATAL] Non-resolvable parent POM: Could not transfer artifact
org.apache:apache:pom:18 from/to central (
http://repo.maven.apache.org/maven2): Error transferring file: Server
returned HTTP response code: 501 for URL:
http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom from
 http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom and
'parent.relativePath' points at wrong local POM @ line 22, column 11

GitHub Action uses Maven for building, but `dev/lint-scala` ->
`dev/scalastyle` -> `build/sbt`. So, it fails.

I guess Jenkins SBT job will fail if we clean up the Maven local cache.

Bests,
Dongjoon.




On Thu, Jan 16, 2020 at 2:08 PM Shane Knapp  wrote:

> ah ok...  looks like these were set up by dongjoon a while back.  i've
> added him to this thread as i can't see the settings in the spark
> github repo.
>
>
> On Thu, Jan 16, 2020 at 1:58 PM Tom Graves  wrote:
> >
> > Sorry should have included the link. It shows up in the pre checks
> failures, but the tests still run and pass. For instance:
> > https://github.com/apache/spark/pull/26682
> >
> > more:
> > https://github.com/apache/spark/pull/27240/checks?check_run_id=393888081
> > https://github.com/apache/spark/pull/27233/checks?check_run_id=393123209
> > https://github.com/apache/spark/pull/27239/checks?check_run_id=393884643
> >
> > Tom
> > On Thursday, January 16, 2020, 03:17:03 PM CST, Shane Knapp <
> skn...@berkeley.edu> wrote:
> >
> >
> > i'm seeing a lot of green builds currently...  if you think this is
> > still happening, please include links to the failed jobs.  thanks!
> >
> > shane (at a conference)
> >
> > On Thu, Jan 16, 2020 at 11:16 AM Tom Graves 
> wrote:
> > >
> > > I'm seeing the scala-lint jobs fail on the pull request builds with:
> > >
> > > [error] [FATAL] Non-resolvable parent POM: Could not transfer artifact
> org.apache:apache:pom:18 from/to central (
> http://repo.maven.apache.org/maven2): Error transferring file: Server
> returned HTTP response code: 501 for URL:
> http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom
> from
> http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom
> and 'parent.relativePath' points at wrong local POM @ line 22, column 11
> > >
> > > It seems we are hitting the http endpoint vs the https one. Our pom
> file already has the repo as the https version though.
> > >
> > > Anyone know why its trying to go to http version?
> > >
> > >
> > > Tom
> >
> >
> >
> >
> > --
> > Shane Knapp
> > Computer Guy / Voice of Reason
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: PR lint-scala jobs failing with http error

2020-01-16 Thread Shane Knapp
ah ok...  looks like these were set up by dongjoon a while back.  i've
added him to this thread as i can't see the settings in the spark
github repo.


On Thu, Jan 16, 2020 at 1:58 PM Tom Graves  wrote:
>
> Sorry should have included the link. It shows up in the pre checks failures, 
> but the tests still run and pass. For instance:
> https://github.com/apache/spark/pull/26682
>
> more:
> https://github.com/apache/spark/pull/27240/checks?check_run_id=393888081
> https://github.com/apache/spark/pull/27233/checks?check_run_id=393123209
> https://github.com/apache/spark/pull/27239/checks?check_run_id=393884643
>
> Tom
> On Thursday, January 16, 2020, 03:17:03 PM CST, Shane Knapp 
>  wrote:
>
>
> i'm seeing a lot of green builds currently...  if you think this is
> still happening, please include links to the failed jobs.  thanks!
>
> shane (at a conference)
>
> On Thu, Jan 16, 2020 at 11:16 AM Tom Graves  wrote:
> >
> > I'm seeing the scala-lint jobs fail on the pull request builds with:
> >
> > [error] [FATAL] Non-resolvable parent POM: Could not transfer artifact 
> > org.apache:apache:pom:18 from/to central ( 
> > http://repo.maven.apache.org/maven2): Error transferring file: Server 
> > returned HTTP response code: 501 for URL: 
> > http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom from 
> > http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom and 
> > 'parent.relativePath' points at wrong local POM @ line 22, column 11
> >
> > It seems we are hitting the http endpoint vs the https one. Our pom file 
> > already has the repo as the https version though.
> >
> > Anyone know why its trying to go to http version?
> >
> >
> > Tom
>
>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: PR lint-scala jobs failing with http error

2020-01-16 Thread Tom Graves
 Sorry should have included the link. It shows up in the pre checks failures, 
but the tests still run and pass. For 
instance:https://github.com/apache/spark/pull/26682

more:https://github.com/apache/spark/pull/27240/checks?check_run_id=393888081
https://github.com/apache/spark/pull/27233/checks?check_run_id=393123209
https://github.com/apache/spark/pull/27239/checks?check_run_id=393884643

TomOn Thursday, January 16, 2020, 03:17:03 PM CST, Shane Knapp 
 wrote:  
 
 i'm seeing a lot of green builds currently...  if you think this is
still happening, please include links to the failed jobs.  thanks!

shane (at a conference)

On Thu, Jan 16, 2020 at 11:16 AM Tom Graves  wrote:
>
> I'm seeing the scala-lint jobs fail on the pull request builds with:
>
> [error] [FATAL] Non-resolvable parent POM: Could not transfer artifact 
> org.apache:apache:pom:18 from/to central ( 
> http://repo.maven.apache.org/maven2): Error transferring file: Server 
> returned HTTP response code: 501 for URL: 
> http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom from 
> http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom and 
> 'parent.relativePath' points at wrong local POM @ line 22, column 11
>
> It seems we are hitting the http endpoint vs the https one. Our pom file 
> already has the repo as the https version though.
>
> Anyone know why its trying to go to http version?
>
>
> Tom



-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
  

Re: PR lint-scala jobs failing with http error

2020-01-16 Thread Shane Knapp
i'm seeing a lot of green builds currently...  if you think this is
still happening, please include links to the failed jobs.  thanks!

shane (at a conference)

On Thu, Jan 16, 2020 at 11:16 AM Tom Graves  wrote:
>
> I'm seeing the scala-lint jobs fail on the pull request builds with:
>
> [error] [FATAL] Non-resolvable parent POM: Could not transfer artifact 
> org.apache:apache:pom:18 from/to central ( 
> http://repo.maven.apache.org/maven2): Error transferring file: Server 
> returned HTTP response code: 501 for URL: 
> http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom from 
> http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom and 
> 'parent.relativePath' points at wrong local POM @ line 22, column 11
>
> It seems we are hitting the http endpoint vs the https one. Our pom file 
> already has the repo as the https version though.
>
> Anyone know why its trying to go to http version?
>
>
> Tom



-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Maxim Gekk
Hi Bing,

You can try Text datasource. It shouldn't modify strings:
scala>
Seq(20192_1",1,24,0,2,”S66.000x001”""").toDS.write.text("tmp/text.txt")
$ cat tmp/text.txt/part-0-256d960f-9f85-47fe-8edd-8428276eb3c6-c000.txt
"20192_1",1,24,0,2,”S66.000x001”

Maxim Gekk

Software Engineer

Databricks B. V.  


On Thu, Jan 16, 2020 at 10:02 PM Long, Andrew 
wrote:

> Hey Bing,
>
>
>
> There’s a couple different approaches you could take.  The quickest and
> easiest would be to use the existing APIs
>
>
>
> val bytes = *spark*.range(1000
>
> bytes.foreachPartition(bytes =>{
>   //W ARNING anything used in here will need to be serializable.
>   // There's some magic to serializing the hadoop conf. see the hadoop
> wrapper class in the source
>   val writer = FileSystem.*get*(null).create(new Path("s3://..."))
>   bytes.foreach(b => writer.write(b))
>   writer.close()
> })
>
>
>
> The more complicated but pretty approach would be to either implement a
> custom datasource.
>
>
>
> *From: *"Duan,Bing" 
> *Date: *Thursday, January 16, 2020 at 12:35 AM
> *To: *"dev@spark.apache.org" 
> *Subject: *How to implement a "saveAsBinaryFile" function?
>
>
>
> Hi all:
>
>
>
> I read binary data(protobuf format) from filesystem by binaryFiles
> function to a RDD[Array[Byte]]   it works fine. But when I save the it to
> filesystem by saveAsTextFile, the quotation mark was be escaped like this:
>
> "\"20192_1\"",1,24,0,2,"\"S66.000x001\””,which  should
> be "20192_1",1,24,0,2,”S66.000x001”.
>
>
>
> Anyone could give me some tip to implement a function
> like saveAsBinaryFile to persist the RDD[Array[Byte]]?
>
>
>
> Bests!
>
>
>
> Bing
>


PR lint-scala jobs failing with http error

2020-01-16 Thread Tom Graves
I'm seeing the scala-lint jobs fail on the pull request builds with:
[error] [FATAL] Non-resolvable parent POM: Could not transfer artifact 
org.apache:apache:pom:18 from/to central ( 
http://repo.maven.apache.org/maven2): Error transferring file: Server returned 
HTTP response code: 501 for URL: 
http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom from 
http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom and 
'parent.relativePath' points at wrong local POM @ line 22, column 11

It seems we are hitting the http endpoint vs the https one. Our pom file 
already has the repo as the https version though.
Anyone know why its trying to go to http version?

Tom

Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Shixiong(Ryan) Zhu
"spark.sql("set -v")" returns a Dataset that has all non-internal SQL
configurations. Should be pretty easy to automatically generate a SQL
configuration page.

Best Regards,
Ryan


On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon  wrote:

> I think automatically creating a configuration page isn't a bad idea
> because I think we deprecate and remove configurations which are not
> created via .internal() in SQLConf anyway.
>
> I already tried this automatic generation from the codes at SQL built-in
> functions and I'm pretty sure we can do the similar thing for
> configurations as well.
>
> We could perhaps mimic what hadoop does
> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>
> On Wed, 15 Jan 2020, 10:46 Sean Owen,  wrote:
>
>> Some of it is intentionally undocumented, as far as I know, as an
>> experimental option that may change, or legacy, or safety valve flag.
>> Certainly anything that's marked an internal conf. (That does raise
>> the question of who it's for, if you have to read source to find it.)
>>
>> I don't know if we need to overhaul the conf system, but there may
>> indeed be some confs that could legitimately be documented. I don't
>> know which.
>>
>> On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
>>  wrote:
>> >
>> > I filed SPARK-30510 thinking that we had forgotten to document an
>> option, but it turns out that there's a whole bunch of stuff under
>> SQLConf.scala that has no public documentation under
>> http://spark.apache.org/docs.
>> >
>> > Would it be appropriate to somehow automatically generate a
>> documentation page from SQLConf.scala, as Hyukjin suggested on that ticket?
>> >
>> > Another thought that comes to mind is moving the config definitions out
>> of Scala and into a data format like YAML or JSON, and then sourcing that
>> both for SQLConf as well as for whatever documentation page we want to
>> generate. What do you think of that idea?
>> >
>> > Nick
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Long, Andrew
Hey Bing,

There’s a couple different approaches you could take.  The quickest and easiest 
would be to use the existing APIs

val bytes = spark.range(1000

bytes.foreachPartition(bytes =>{
  //W ARNING anything used in here will need to be serializable.
  // There's some magic to serializing the hadoop conf. see the hadoop wrapper 
class in the source
  val writer = FileSystem.get(null).create(new Path("s3://..."))
  bytes.foreach(b => writer.write(b))
  writer.close()
})

The more complicated but pretty approach would be to either implement a custom 
datasource.

From: "Duan,Bing" 
Date: Thursday, January 16, 2020 at 12:35 AM
To: "dev@spark.apache.org" 
Subject: How to implement a "saveAsBinaryFile" function?

Hi all:

I read binary data(protobuf format) from filesystem by binaryFiles function to 
a RDD[Array[Byte]]   it works fine. But when I save the it to filesystem by 
saveAsTextFile, the quotation mark was be escaped like this:
"\"20192_1\"",1,24,0,2,"\"S66.000x001\””,which  should be 
"20192_1",1,24,0,2,”S66.000x001”.

Anyone could give me some tip to implement a function like saveAsBinaryFile to 
persist the RDD[Array[Byte]]?

Bests!

Bing


Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Felix Cheung
I think it’s a good idea


From: Hyukjin Kwon 
Sent: Wednesday, January 15, 2020 5:49:12 AM
To: dev 
Cc: Sean Owen ; Nicholas Chammas 
Subject: Re: More publicly documenting the options under spark.sql.*

Resending to the dev list for archive purpose:

I think automatically creating a configuration page isn't a bad idea because I 
think we deprecate and remove configurations which are not created via 
.internal() in SQLConf anyway.

I already tried this automatic generation from the codes at SQL built-in 
functions and I'm pretty sure we can do the similar thing for configurations as 
well.

We could perhaps mimic what hadoop does 
https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml

On Wed, 15 Jan 2020, 22:46 Hyukjin Kwon, 
mailto:gurwls...@gmail.com>> wrote:
I think automatically creating a configuration page isn't a bad idea because I 
think we deprecate and remove configurations which are not created via 
.internal() in SQLConf anyway.

I already tried this automatic generation from the codes at SQL built-in 
functions and I'm pretty sure we can do the similar thing for configurations as 
well.

We could perhaps mimic what hadoop does 
https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml

On Wed, 15 Jan 2020, 10:46 Sean Owen, 
mailto:sro...@gmail.com>> wrote:
Some of it is intentionally undocumented, as far as I know, as an
experimental option that may change, or legacy, or safety valve flag.
Certainly anything that's marked an internal conf. (That does raise
the question of who it's for, if you have to read source to find it.)

I don't know if we need to overhaul the conf system, but there may
indeed be some confs that could legitimately be documented. I don't
know which.

On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
mailto:nicholas.cham...@gmail.com>> wrote:
>
> I filed SPARK-30510 thinking that we had forgotten to document an option, but 
> it turns out that there's a whole bunch of stuff under SQLConf.scala that has 
> no public documentation under http://spark.apache.org/docs.
>
> Would it be appropriate to somehow automatically generate a documentation 
> page from SQLConf.scala, as Hyukjin suggested on that ticket?
>
> Another thought that comes to mind is moving the config definitions out of 
> Scala and into a data format like YAML or JSON, and then sourcing that both 
> for SQLConf as well as for whatever documentation page we want to generate. 
> What do you think of that idea?
>
> Nick
>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Ryan Blue
Hi everyone,

Let me recap some of the discussions that got us to where we are with this
today. Hopefully that will provide some clarity.

The purpose of partition transforms is to allow source implementations to
internally handle partitioning. Right now, users are responsible for this.
For example, users will transform timestamps into date strings when writing
and other people will provide a filter on those date strings when scanning.
This is error-prone: users commonly forget to add partition filters in
addition to data filters, if anyone uses the wrong format or transformation
queries will silently return incorrect results, etc. But sources can (and
should) automatically handle storing and retrieving data internally because
it is much easier for users.

When we first proposed transforms, I wanted to use Expression. But Reynold
rightly pointed out that Expression is an internal API that should not be
exposed. So we decided to compromise by building a public expressions API
like the public Filter API for the initial purpose of passing transform
expressions to sources. The idea was that Spark needs a public expression
API anyway for other uses, like requesting a distribution and ordering for
a writer. To keep things simple, we chose to build a minimal public
expression API and expand it incrementally as we need more features.

We also considered whether to parse all expressions and convert only
transformations to the public API, or to parse just transformations. We
went with just parsing transformations because it was easier and we can
expand it to improve error messages later.

I don't think there is reason to revert this simply because of some of the
early choices, like deciding to start a public expression API. If you'd
like to extend this to "fix" areas where you find it confusing, then please
do. We know that by parsing more expressions we could improve error
messages. But that's not to say that we need to revert it.

None of this has been confusing or misleading for our users, who caught on
quickly.

On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon  wrote:

> I think the problem here is if there is an explicit plan or not.
> The PR was merged one year ago and not many changes have been made to this
> API to address the main concerns mentioned.
> Also, the followup JIRA requested seems still open
> https://issues.apache.org/jira/browse/SPARK-27386
> I heard this was already discussed but I cannot find the summary of the
> meeting or any documentation.
>
> I would like to make sure how we plan to extend. I had a couple of
> questions such as:
>   - Why can't we use UDF-interface-like as an example?
>   - Is this expression only for partition or do we plan to expose this to
> replace other existing expressions?
>
> > About extensibility, it's similar to DS V1 Filter again. We don't cover
> all the expressions at the beginning, but we can add more in future
> versions when needed. The data source implementations should be defensive
> and fail when seeing unrecognized Filter/Transform.
>
> I think there are differences in that:
> - DSv1 filter works whether the filters are pushed or not However, this
> case does not work.
> - There are too many expressions whereas the number of predicates are
> relatively limited. If we plan to push all expressions eventually, I doubt
> if this is a good idea.
>
>
> 2020년 1월 16일 (목) 오후 9:22, Wenchen Fan 님이 작성:
>
>> The DS v2 project is still evolving so half-backed is inevitable
>> sometimes. This feature is definitely in the right direction to allow more
>> flexible partition implementations, but there are a few problems we can
>> discuss.
>>
>> About expression duplication. This is an existing design choice. We don't
>> want to expose the Expression class directly but we do need to expose some
>> Expression-like stuff in the developer APIs. So we pick some basic
>> expressions, make a copy and create a public version of them. This is what
>> we did for DS V1 Filter, and I think we can continue to do this for DS v2
>> Transform.
>>
>> About extensibility, it's similar to DS V1 Filter again. We don't cover
>> all the expressions at the beginning, but we can add more in future
>> versions when needed. The data source implementations should be defensive
>> and fail when seeing unrecognized Filter/Transform.
>>
>> About compatibility. This is the place that I have a concern as well. For
>> DS V1 Filter, we just expose all the Filter classes, like `EqualTo`,
>> `GreaterThan`, etc. These classes have well-defined semantic. For DS V2
>> Transform, we only expose the Transform interface, and data sources need to
>> look at `Transform#name` and search the document to see the semantic.
>> What's worse, the parser/analyzer allows arbitrary string as Transform
>> name, so it's impossible to have well-defined semantic, and also different
>> sources may have different semantic for the same Transform name.
>>
>> I'd suggest we forbid arbitrary string as Transform (the ApplyTransform
>> cla

Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-16 Thread Xiao Li
-1

Let us include the correctness fix:
https://github.com/apache/spark/pull/27229

Thanks,

Xiao

On Thu, Jan 16, 2020 at 8:46 AM Dongjoon Hyun 
wrote:

> Thank you, Jungtaek!
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jan 15, 2020 at 8:57 PM Jungtaek Lim 
> wrote:
>
>> Once we decided to cancel the RC1, what about including SPARK-29450 (
>> https://github.com/apache/spark/pull/27209) into RC2?
>>
>> SPARK-29450 was merged into master, and Xiao figured out it fixed a
>> regression, long lasting one (broken at 2.3.0). The link refers the PR for
>> 2.4 branch.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Thu, Jan 16, 2020 at 12:56 PM Dongjoon Hyun 
>> wrote:
>>
>>> Sure. Wenchen and Hyukjin.
>>>
>>> I observed all of the above reported issues and have been waiting to
>>> collect more information before cancelling RC1 vote.
>>>
>>> The other stuff I've observed is that Marcelo and Sean also requested
>>> reverting the existing commit.
>>> - https://github.com/apache/spark/pull/24732 (spark.shuffle.io.backLog
>>> change)
>>>
>>> To All.
>>> We want your explicit feedbacks. Please reply on this thread.
>>>
>>> Although we get enough positive feedbacks here, I'll cancel this RC1.
>>> I want to address at least the above negative feedbacks and roll RC2
>>> next Monday.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Jan 15, 2020 at 7:47 PM Hyukjin Kwon 
>>> wrote:
>>>
 If we go for RC2, we should include both:

 https://github.com/apache/spark/pull/27210
 https://github.com/apache/spark/pull/27184

 just for the sake of being complete and making the maintenance simple.


 2020년 1월 16일 (목) 오후 12:38, Wenchen Fan 님이 작성:

> Recently we merged several fixes to 2.4:
> https://issues.apache.org/jira/browse/SPARK-30325   a driver hang
> issue
> https://issues.apache.org/jira/browse/SPARK-30246   a memory leak
> issue
> https://issues.apache.org/jira/browse/SPARK-29708   a correctness
> issue(for a rarely used feature, so not merged to 2.4 yet)
>
> Shall we include them?
>
>
> On Wed, Jan 15, 2020 at 9:51 PM Hyukjin Kwon 
> wrote:
>
>> +1
>>
>> On Wed, 15 Jan 2020, 08:24 Takeshi Yamamuro, 
>> wrote:
>>
>>> +1;
>>>
>>> I checked the links and materials, then I run the tests with
>>> `-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes
>>> -Psparkr`
>>> on macOS (Java 8).
>>> All the things look fine and I didn't see the error on my env
>>> that Sean said above.
>>>
>>> Thanks, Dongjoon!
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Wed, Jan 15, 2020 at 4:09 AM DB Tsai  wrote:
>>>
 +1 Thanks.

 Sincerely,

 DB Tsai
 --
 Web: https://www.dbtsai.com
 PGP Key ID: 42E5B25A8F7A82C1

 On Tue, Jan 14, 2020 at 11:08 AM Sean Owen 
 wrote:
 >
 > Yeah it's something about the env I spun up, but I don't know
 what. It
 > happens frequently when I test, but not on Jenkins.
 > The Kafka error comes up every now and then and a clean rebuild
 fixes
 > it, but not in my case. I don't know why.
 > But if nobody else sees it, I'm pretty sure it's just an artifact
 of
 > the local VM.
 >
 > On Tue, Jan 14, 2020 at 12:57 PM Dongjoon Hyun <
 dongjoon.h...@gmail.com> wrote:
 > >
 > > Thank you, Sean.
 > >
 > > First of all, the `Ubuntu` job on Amplab Jenkins farm is green.
 > >
 > >
 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7-ubuntu-testing/
 > >
 > > For the failures,
 > >1. Yes, the `HiveExternalCatalogVersionsSuite` flakiness is
 a known one.
 > >2. For `HDFSMetadataLogSuite` failure, I also observed a few
 time before in CentOS too.
 > >3. Kafka build error is new to me. Does it happen on `Maven`
 clean build?
 > >
 > > Bests,
 > > Dongjoon.
 > >
 > >
 > > On Tue, Jan 14, 2020 at 6:40 AM Sean Owen 
 wrote:
 > >>
 > >> +1 from me. I checked sigs/licenses, and built/tested from
 source on
 > >> Java 8 + Ubuntu 18.04 with " -Pyarn -Phive -Phive-thriftserver
 > >> -Phadoop-2.7 -Pmesos -Pkubernetes -Psparkr -Pkinesis-asl". I
 do get
 > >> test failures, but, these are some I have always seen on
 Ubuntu, and I
 > >> do not know why they happen. They don't seem to affect others,
 but,
 > >> let me know if anyone else sees these?
 > >>
 > >>
 > >> Always happens for me:
 > >>
 > >> - HDFSMetadataLog: metadata directory collision *** FAILED ***
>>

Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-16 Thread Dongjoon Hyun
Thank you, Jungtaek!

Bests,
Dongjoon.


On Wed, Jan 15, 2020 at 8:57 PM Jungtaek Lim 
wrote:

> Once we decided to cancel the RC1, what about including SPARK-29450 (
> https://github.com/apache/spark/pull/27209) into RC2?
>
> SPARK-29450 was merged into master, and Xiao figured out it fixed a
> regression, long lasting one (broken at 2.3.0). The link refers the PR for
> 2.4 branch.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Thu, Jan 16, 2020 at 12:56 PM Dongjoon Hyun 
> wrote:
>
>> Sure. Wenchen and Hyukjin.
>>
>> I observed all of the above reported issues and have been waiting to
>> collect more information before cancelling RC1 vote.
>>
>> The other stuff I've observed is that Marcelo and Sean also requested
>> reverting the existing commit.
>> - https://github.com/apache/spark/pull/24732 (spark.shuffle.io.backLog
>> change)
>>
>> To All.
>> We want your explicit feedbacks. Please reply on this thread.
>>
>> Although we get enough positive feedbacks here, I'll cancel this RC1.
>> I want to address at least the above negative feedbacks and roll RC2 next
>> Monday.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Jan 15, 2020 at 7:47 PM Hyukjin Kwon  wrote:
>>
>>> If we go for RC2, we should include both:
>>>
>>> https://github.com/apache/spark/pull/27210
>>> https://github.com/apache/spark/pull/27184
>>>
>>> just for the sake of being complete and making the maintenance simple.
>>>
>>>
>>> 2020년 1월 16일 (목) 오후 12:38, Wenchen Fan 님이 작성:
>>>
 Recently we merged several fixes to 2.4:
 https://issues.apache.org/jira/browse/SPARK-30325   a driver hang issue
 https://issues.apache.org/jira/browse/SPARK-30246   a memory leak issue
 https://issues.apache.org/jira/browse/SPARK-29708   a correctness
 issue(for a rarely used feature, so not merged to 2.4 yet)

 Shall we include them?


 On Wed, Jan 15, 2020 at 9:51 PM Hyukjin Kwon 
 wrote:

> +1
>
> On Wed, 15 Jan 2020, 08:24 Takeshi Yamamuro, 
> wrote:
>
>> +1;
>>
>> I checked the links and materials, then I run the tests with
>> `-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes
>> -Psparkr`
>> on macOS (Java 8).
>> All the things look fine and I didn't see the error on my env
>> that Sean said above.
>>
>> Thanks, Dongjoon!
>>
>> Bests,
>> Takeshi
>>
>> On Wed, Jan 15, 2020 at 4:09 AM DB Tsai  wrote:
>>
>>> +1 Thanks.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 42E5B25A8F7A82C1
>>>
>>> On Tue, Jan 14, 2020 at 11:08 AM Sean Owen 
>>> wrote:
>>> >
>>> > Yeah it's something about the env I spun up, but I don't know
>>> what. It
>>> > happens frequently when I test, but not on Jenkins.
>>> > The Kafka error comes up every now and then and a clean rebuild
>>> fixes
>>> > it, but not in my case. I don't know why.
>>> > But if nobody else sees it, I'm pretty sure it's just an artifact
>>> of
>>> > the local VM.
>>> >
>>> > On Tue, Jan 14, 2020 at 12:57 PM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>> > >
>>> > > Thank you, Sean.
>>> > >
>>> > > First of all, the `Ubuntu` job on Amplab Jenkins farm is green.
>>> > >
>>> > >
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7-ubuntu-testing/
>>> > >
>>> > > For the failures,
>>> > >1. Yes, the `HiveExternalCatalogVersionsSuite` flakiness is a
>>> known one.
>>> > >2. For `HDFSMetadataLogSuite` failure, I also observed a few
>>> time before in CentOS too.
>>> > >3. Kafka build error is new to me. Does it happen on `Maven`
>>> clean build?
>>> > >
>>> > > Bests,
>>> > > Dongjoon.
>>> > >
>>> > >
>>> > > On Tue, Jan 14, 2020 at 6:40 AM Sean Owen 
>>> wrote:
>>> > >>
>>> > >> +1 from me. I checked sigs/licenses, and built/tested from
>>> source on
>>> > >> Java 8 + Ubuntu 18.04 with " -Pyarn -Phive -Phive-thriftserver
>>> > >> -Phadoop-2.7 -Pmesos -Pkubernetes -Psparkr -Pkinesis-asl". I do
>>> get
>>> > >> test failures, but, these are some I have always seen on
>>> Ubuntu, and I
>>> > >> do not know why they happen. They don't seem to affect others,
>>> but,
>>> > >> let me know if anyone else sees these?
>>> > >>
>>> > >>
>>> > >> Always happens for me:
>>> > >>
>>> > >> - HDFSMetadataLog: metadata directory collision *** FAILED ***
>>> > >>   The await method on Waiter timed out.
>>> (HDFSMetadataLogSuite.scala:178)
>>> > >>
>>> > >> This one has been flaky at times due to external dependencies:
>>> > >>
>>> > >> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite ***
>>> ABORTED ***
>>> > >>   Exception encountered wh

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Hyukjin Kwon
I think the problem here is if there is an explicit plan or not.
The PR was merged one year ago and not many changes have been made to this
API to address the main concerns mentioned.
Also, the followup JIRA requested seems still open
https://issues.apache.org/jira/browse/SPARK-27386
I heard this was already discussed but I cannot find the summary of the
meeting or any documentation.

I would like to make sure how we plan to extend. I had a couple of
questions such as:
  - Why can't we use UDF-interface-like as an example?
  - Is this expression only for partition or do we plan to expose this to
replace other existing expressions?

> About extensibility, it's similar to DS V1 Filter again. We don't cover
all the expressions at the beginning, but we can add more in future
versions when needed. The data source implementations should be defensive
and fail when seeing unrecognized Filter/Transform.

I think there are differences in that:
- DSv1 filter works whether the filters are pushed or not However, this
case does not work.
- There are too many expressions whereas the number of predicates are
relatively limited. If we plan to push all expressions eventually, I doubt
if this is a good idea.


2020년 1월 16일 (목) 오후 9:22, Wenchen Fan 님이 작성:

> The DS v2 project is still evolving so half-backed is inevitable
> sometimes. This feature is definitely in the right direction to allow more
> flexible partition implementations, but there are a few problems we can
> discuss.
>
> About expression duplication. This is an existing design choice. We don't
> want to expose the Expression class directly but we do need to expose some
> Expression-like stuff in the developer APIs. So we pick some basic
> expressions, make a copy and create a public version of them. This is what
> we did for DS V1 Filter, and I think we can continue to do this for DS v2
> Transform.
>
> About extensibility, it's similar to DS V1 Filter again. We don't cover
> all the expressions at the beginning, but we can add more in future
> versions when needed. The data source implementations should be defensive
> and fail when seeing unrecognized Filter/Transform.
>
> About compatibility. This is the place that I have a concern as well. For
> DS V1 Filter, we just expose all the Filter classes, like `EqualTo`,
> `GreaterThan`, etc. These classes have well-defined semantic. For DS V2
> Transform, we only expose the Transform interface, and data sources need to
> look at `Transform#name` and search the document to see the semantic.
> What's worse, the parser/analyzer allows arbitrary string as Transform
> name, so it's impossible to have well-defined semantic, and also different
> sources may have different semantic for the same Transform name.
>
> I'd suggest we forbid arbitrary string as Transform (the ApplyTransform
> class). We can even follow DS  V1 Filter and expose the classes directly.
>
> On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I would like to suggest to take one step back at
>> https://github.com/apache/spark/pull/24117 and rethink about it.
>> I am writing this email as I raised the issue few times but could not
>> have enough responses promptly, and
>> the code freeze is being close.
>>
>> In particular, please refer the below comments for the full context:
>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>> - https://github.com/apache/spark/pull/24117#issuecomment-568614961
>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>>
>>
>> In short, this PR added an API in DSv2:
>>
>> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>>
>>
>> So people can write some classes for *transform(col)* for partitioned
>> column specifically.
>>
>> However, there are some design concerns which looked not addressed
>> properly.
>>
>> Note that one of the main point is to avoid half-baked or
>> just-work-for-now APIs. However, this looks
>> definitely like half-completed. Therefore, I would like to propose to
>> take one step back and revert it for now.
>> Please see below the concerns listed.
>>
>> *Duplication of existing expressions*
>> Seems like existing expressions are going to be duplicated. See below new
>> APIs added:
>>
>> def years(column: String): YearsTransform = YearsTransform(reference(column))
>> def months(column: String): MonthsTransform = 
>> MonthsTransform(reference(column))
>> def days(column: String): DaysTransform = DaysTransform(reference(column))
>> def hours(column: String): HoursTransform = HoursTransform(reference(column))
>> ...
>>
>> It looks like it requires to add a copy of our existing expressions, in
>> the future.
>>
>>
>> *Limited Extensibility*
>> It has a clear limitation. It looks other expressions are going to be
>> allowed together (e.g., `concat(years(col) + days(col))`);
>> however, it looks impossible to extend with the current design. It just
>> directly maps transformName to implementation class,
>> and just pass argume

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Wenchen Fan
The DS v2 project is still evolving so half-backed is inevitable sometimes.
This feature is definitely in the right direction to allow more flexible
partition implementations, but there are a few problems we can discuss.

About expression duplication. This is an existing design choice. We don't
want to expose the Expression class directly but we do need to expose some
Expression-like stuff in the developer APIs. So we pick some basic
expressions, make a copy and create a public version of them. This is what
we did for DS V1 Filter, and I think we can continue to do this for DS v2
Transform.

About extensibility, it's similar to DS V1 Filter again. We don't cover all
the expressions at the beginning, but we can add more in future versions
when needed. The data source implementations should be defensive and fail
when seeing unrecognized Filter/Transform.

About compatibility. This is the place that I have a concern as well. For
DS V1 Filter, we just expose all the Filter classes, like `EqualTo`,
`GreaterThan`, etc. These classes have well-defined semantic. For DS V2
Transform, we only expose the Transform interface, and data sources need to
look at `Transform#name` and search the document to see the semantic.
What's worse, the parser/analyzer allows arbitrary string as Transform
name, so it's impossible to have well-defined semantic, and also different
sources may have different semantic for the same Transform name.

I'd suggest we forbid arbitrary string as Transform (the ApplyTransform
class). We can even follow DS  V1 Filter and expose the classes directly.

On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon  wrote:

> Hi all,
>
> I would like to suggest to take one step back at
> https://github.com/apache/spark/pull/24117 and rethink about it.
> I am writing this email as I raised the issue few times but could not have
> enough responses promptly, and
> the code freeze is being close.
>
> In particular, please refer the below comments for the full context:
> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
> - https://github.com/apache/spark/pull/24117#issuecomment-568614961
> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>
>
> In short, this PR added an API in DSv2:
>
> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>
>
> So people can write some classes for *transform(col)* for partitioned
> column specifically.
>
> However, there are some design concerns which looked not addressed
> properly.
>
> Note that one of the main point is to avoid half-baked or
> just-work-for-now APIs. However, this looks
> definitely like half-completed. Therefore, I would like to propose to take
> one step back and revert it for now.
> Please see below the concerns listed.
>
> *Duplication of existing expressions*
> Seems like existing expressions are going to be duplicated. See below new
> APIs added:
>
> def years(column: String): YearsTransform = YearsTransform(reference(column))
> def months(column: String): MonthsTransform = 
> MonthsTransform(reference(column))
> def days(column: String): DaysTransform = DaysTransform(reference(column))
> def hours(column: String): HoursTransform = HoursTransform(reference(column))
> ...
>
> It looks like it requires to add a copy of our existing expressions, in
> the future.
>
>
> *Limited Extensibility*
> It has a clear limitation. It looks other expressions are going to be
> allowed together (e.g., `concat(years(col) + days(col))`);
> however, it looks impossible to extend with the current design. It just
> directly maps transformName to implementation class,
> and just pass arguments:
>
> transform
> ...
> | transformName=identifier
>   '(' argument+=transformArgument (',' argument+=transformArgument)* ')'  
> #applyTransform
> ;
>
> It looks regular expressions are supported; however, it's not.
> - If we should support, the design had to consider that.
> - if we should not support, different syntax might have to be used instead.
>
> *Limited Compatibility Management*
> The name can be arbitrary. For instance, if "transform" is supported in
> Spark side, the name is preempted by Spark.
> If every the datasource supported such name, it becomes not compatible.
>
>
>
>


[DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Hyukjin Kwon
Hi all,

I would like to suggest to take one step back at
https://github.com/apache/spark/pull/24117 and rethink about it.
I am writing this email as I raised the issue few times but could not have
enough responses promptly, and
the code freeze is being close.

In particular, please refer the below comments for the full context:
- https://github.com/apache/spark/pull/24117#issuecomment-568891483
- https://github.com/apache/spark/pull/24117#issuecomment-568614961
- https://github.com/apache/spark/pull/24117#issuecomment-568891483


In short, this PR added an API in DSv2:

CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*


So people can write some classes for *transform(col)* for partitioned
column specifically.

However, there are some design concerns which looked not addressed properly.

Note that one of the main point is to avoid half-baked or just-work-for-now
APIs. However, this looks
definitely like half-completed. Therefore, I would like to propose to take
one step back and revert it for now.
Please see below the concerns listed.

*Duplication of existing expressions*
Seems like existing expressions are going to be duplicated. See below new
APIs added:

def years(column: String): YearsTransform = YearsTransform(reference(column))
def months(column: String): MonthsTransform = MonthsTransform(reference(column))
def days(column: String): DaysTransform = DaysTransform(reference(column))
def hours(column: String): HoursTransform = HoursTransform(reference(column))
...

It looks like it requires to add a copy of our existing expressions, in the
future.


*Limited Extensibility*
It has a clear limitation. It looks other expressions are going to be
allowed together (e.g., `concat(years(col) + days(col))`);
however, it looks impossible to extend with the current design. It just
directly maps transformName to implementation class,
and just pass arguments:

transform
...
| transformName=identifier
  '(' argument+=transformArgument (','
argument+=transformArgument)* ')'  #applyTransform
;

It looks regular expressions are supported; however, it's not.
- If we should support, the design had to consider that.
- if we should not support, different syntax might have to be used instead.

*Limited Compatibility Management*
The name can be arbitrary. For instance, if "transform" is supported in
Spark side, the name is preempted by Spark.
If every the datasource supported such name, it becomes not compatible.


Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Driesprong, Fokko
Hi Bing,

Good question and the answer is; it depends on what your use-case is.

If you really just want to write raw bytes, then you could create a
.foreach where you open an OutputStream and write it to some file. But this
is probably not what you want, and in practice not very handy since you
want to keep the records.

My suggestion would be to write it as Parquet or Avro, and write it to a
binary field. With Avro you have the bytes primitive which converts in
Spark to Array[Byte]: https://avro.apache.org/docs/1.9.1/spec.html Similar
to Parquet where you have the BYTE_ARRAY:
https://github.com/apache/parquet-format/blob/master/Encodings.md#plain-plain--0

In the words of Linus Torvalds; *Talk is cheap, show me the code*:

MacBook-Pro-van-Fokko:~ fokkodriesprong$ spark-shell
20/01/16 10:58:44 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://172.20.10.3:4040
Spark context available as 'sc' (master = local[*], app id =
local-1579168731763).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
  /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val data: Array[Array[Byte]] = Array(
 |   Array(0x19.toByte, 0x25.toByte)
 | )
data: Array[Array[Byte]] = Array(Array(25, 37))

scala> val rdd = sc.parallelize(data, 1);
rdd: org.apache.spark.rdd.RDD[Array[Byte]] = ParallelCollectionRDD[0] at
parallelize at :26

scala> rdd.toDF("byte")
res1: org.apache.spark.sql.DataFrame = [byte: binary]

scala> val df = rdd.toDF("byte")
df: org.apache.spark.sql.DataFrame = [byte: binary]

scala> df.write.parquet("/tmp/bytes/")



MacBook-Pro-van-Fokko:~ fokkodriesprong$ ls -lah /tmp/bytes/
total 24
drwxr-xr-x   6 fokkodriesprong  wheel   192B 16 jan 11:01 .
drwxrwxrwt  16 root wheel   512B 16 jan 11:01 ..
-rw-r--r--   1 fokkodriesprong  wheel 8B 16 jan 11:01 ._SUCCESS.crc
-rw-r--r--   1 fokkodriesprong  wheel12B 16 jan 11:01
.part-0-d0d684bb-2371-4947-b2f3-6fca4ead69a7-c000.snappy.parquet.crc
-rw-r--r--   1 fokkodriesprong  wheel 0B 16 jan 11:01 _SUCCESS
-rw-r--r--   1 fokkodriesprong  wheel   384B 16 jan 11:01
part-0-d0d684bb-2371-4947-b2f3-6fca4ead69a7-c000.snappy.parquet

MacBook-Pro-van-Fokko:~ fokkodriesprong$ parquet-tools schema
/tmp/bytes/part-0-d0d684bb-2371-4947-b2f3-6fca4ead69a7-c000.snappy.parquet

message spark_schema {
  optional binary byte;
}

Hope this helps.

Cheers, Fokko


Op do 16 jan. 2020 om 09:34 schreef Duan,Bing :

> Hi all:
>
> I read binary data(protobuf format) from filesystem by binaryFiles
> function to a RDD[Array[Byte]]   it works fine. But when I save the it to
> filesystem by saveAsTextFile, the quotation mark was be escaped like this:
> "\"20192_1\"",1,24,0,2,"\"S66.000x001\””,which  should
> be "20192_1",1,24,0,2,”S66.000x001”.
>
> Anyone could give me some tip to implement a function
> like saveAsBinaryFile to persist the RDD[Array[Byte]]?
>
> Bests!
>
> Bing
>


How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Duan,Bing
Hi all:

I read binary data(protobuf format) from filesystem by binaryFiles function to 
a RDD[Array[Byte]]   it works fine. But when I save the it to filesystem by 
saveAsTextFile, the quotation mark was be escaped like this:
"\"20192_1\"",1,24,0,2,"\"S66.000x001\””,which  should be 
"20192_1",1,24,0,2,”S66.000x001”.

Anyone could give me some tip to implement a function like saveAsBinaryFile to 
persist the RDD[Array[Byte]]?

Bests!

Bing