Support structured plan logging

2018-10-11 Thread bo yang
Hi All,

Are there any people interested in adding structured plan logging in Spark?
Currently the logical/physical plan could be logged as plain text via
explain() method, which has some issues, for example, string truncation and
difficult for tool/program to use.

This PR  fixes the truncation
issue. A further step is to log the plan as structured content (e.g. json).
Do other people feel similar need?

Thanks,
Bo


Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-11 Thread Xiao Li
-1. We have two correctness bugs:
https://issues.apache.org/jira/browse/SPARK-25714 and
https://issues.apache.org/jira/browse/SPARK-25708.

Let us fix all the three issues in ScalaUDF, as mentioned by Sean.

Xiao


Sean Owen  于2018年10月11日周四 上午9:04写道:

> This is a legitimate question about the behavior of ScalaUDF after the
> change to support 2.12:
> https://github.com/apache/spark/pull/22259#discussion_r224295469
> Not quite a blocker I think, but a potential gotcha we definitely need
> to highlight in release notes. There may be an argument for changing
> ScalaUDF again before the release. Have a look, anyone familiar with
> catalyst.
> On Wed, Oct 10, 2018 at 3:00 PM Sean Owen  wrote:
> >
> > +1. I tested the source build against Scala 2.12 and common build
> > profiles. License and sigs look OK.
> >
> > No blockers; one critical:
> >
> > SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
> >
> > I think this one is "won't fix" though? not trying to restore the
> behavior?
> >
> > Other items open for 2.4.0:
> >
> > SPARK-25347 Document image data source in doc site
> > SPARK-25584 Document libsvm data source in doc site
> > SPARK-25179 Document the features that require Pyarrow 0.10
> > SPARK-25507 Update documents for the new features in 2.4 release
> > SPARK-25346 Document Spark builtin data sources
> > SPARK-24464 Unit tests for MLlib's Instrumentation
> > SPARK-23197 Flaky test:
> spark.streaming.ReceiverSuite."receiver_life_cycle"
> > SPARK-22809 pyspark is sensitive to imports with dots
> > SPARK-21030 extend hint syntax to support any expression for Python and R
> >
> > Anyone know enough to close or retarget them? they don't look critical
> > for 2.4, SPARK-25507 has no content, itself. SPARK-25179 "Document the
> > features that require Pyarrow 0.10" however sounds like it could have
> > been important for 2.4? if not a blocker.
> >
> > PS I don't think that SPARK-25150 is an issue; see JIRA. At least
> > there is some ongoing discussion there.
> >
> > I am evaluating
> > https://github.com/apache/spark/pull/22259#discussion_r224252642 right
> > now.
> >
> >
> > On Wed, Oct 10, 2018 at 9:47 AM Wenchen Fan  wrote:
> > >
> > > Please vote on releasing the following candidate as Apache Spark
> version 2.4.0.
> > >
> > > The vote is open until October 1 PST and passes if a majority +1 PMC
> votes are cast, with
> > > a minimum of 3 +1 votes.
> > >
> > > [ ] +1 Release this package as Apache Spark 2.4.0
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see http://spark.apache.org/
> > >
> > > The tag to be voted on is v2.4.0-rc3 (commit
> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
> > > https://github.com/apache/spark/tree/v2.4.0-rc3
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
> > >
> > > Signatures used for Spark RCs can be found in this file:
> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >
> > > The staging repository for this release can be found at:
> > > https://repository.apache.org/content/repositories/orgapachespark-1289
> > >
> > > The documentation corresponding to this release can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
> > >
> > > The list of bug fixes going into 2.4.0 can be found at the following
> URL:
> > > https://issues.apache.org/jira/projects/SPARK/versions/12342385
> > >
> > > FAQ
> > >
> > > =
> > > How can I help test this release?
> > > =
> > >
> > > If you are a Spark user, you can help us test this release by taking
> > > an existing Spark workload and running on this release candidate, then
> > > reporting any regressions.
> > >
> > > If you're working in PySpark you can set up a virtual env and install
> > > the current RC and see if anything important breaks, in the Java/Scala
> > > you can add the staging repository to your projects resolvers and test
> > > with the RC (make sure to clean up the artifact cache before/after so
> > > you don't end up building with a out of date RC going forward).
> > >
> > > ===
> > > What should happen to JIRA tickets still targeting 2.4.0?
> > > ===
> > >
> > > The current list of open tickets targeted at 2.4.0 can be found at:
> > > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
> > >
> > > Committers should look at those and triage. Extremely important bug
> > > fixes, documentation, and API tweaks that impact compatibility should
> > > be worked on immediately. Everything else please retarget to an
> > > appropriate release.
> > >
> > > ==
> > > But my bug isn't fixed?
> > > ==
> > >
> > > In order to make timely releases, we will typically not hold the
> > > release unless the bug in question is a 

Re: Remove Flume support in 3.0.0?

2018-10-11 Thread Wenchen Fan
Note that, it was deprecated in 2.3.0 already:
https://spark.apache.org/docs/2.3.0/streaming-flume-integration.html

On Fri, Oct 12, 2018 at 12:46 AM Reynold Xin  wrote:

> Sounds like a good idea...
>
> > On Oct 11, 2018, at 6:40 PM, Sean Owen  wrote:
> >
> > Yep, that already exists as Bahir.
> > Also, would anyone object to declaring Flume support at least
> > deprecated in 2.4.0?
> >> On Wed, Oct 10, 2018 at 2:29 PM Jörn Franke 
> wrote:
> >>
> >> I think it makes sense to remove it.
> >> If it is not too much effort and the architecture of the flume source
> is not considered as too strange one may extract it as a separate project
> and put it on github in a dedicated non-supported repository. This would
> enable distributors and other companies to continue to use it with minor
> adaptions in case their architecture depends on it. Furthermore, if there
> is a growing interest then one could pick it up and create a clean
> connector based on the current Spark architecture to be available as a
> dedicated connector or again in later Spark versions.
> >>
> >> That being said there are also „indirect“ ways to use Flume with Spark
> (eg via Kafka), so i believe people would not be affected so much by a
> removal.
> >>
> >> (Non-Voting just my opinion)
> >>
> >>> Am 10.10.2018 um 22:31 schrieb Sean Owen :
> >>>
> >>> Marcelo makes an argument that Flume support should be removed in
> >>> 3.0.0 at https://issues.apache.org/jira/browse/SPARK-25598
> >>>
> >>> I tend to agree. Is there an argument that it needs to be supported,
> >>> and can this move to Bahir if so?
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Remove Flume support in 3.0.0?

2018-10-11 Thread Reynold Xin
Sounds like a good idea...

> On Oct 11, 2018, at 6:40 PM, Sean Owen  wrote:
> 
> Yep, that already exists as Bahir.
> Also, would anyone object to declaring Flume support at least
> deprecated in 2.4.0?
>> On Wed, Oct 10, 2018 at 2:29 PM Jörn Franke  wrote:
>> 
>> I think it makes sense to remove it.
>> If it is not too much effort and the architecture of the flume source is not 
>> considered as too strange one may extract it as a separate project and put 
>> it on github in a dedicated non-supported repository. This would enable 
>> distributors and other companies to continue to use it with minor adaptions 
>> in case their architecture depends on it. Furthermore, if there is a growing 
>> interest then one could pick it up and create a clean connector based on the 
>> current Spark architecture to be available as a dedicated connector or again 
>> in later Spark versions.
>> 
>> That being said there are also „indirect“ ways to use Flume with Spark (eg 
>> via Kafka), so i believe people would not be affected so much by a removal.
>> 
>> (Non-Voting just my opinion)
>> 
>>> Am 10.10.2018 um 22:31 schrieb Sean Owen :
>>> 
>>> Marcelo makes an argument that Flume support should be removed in
>>> 3.0.0 at https://issues.apache.org/jira/browse/SPARK-25598
>>> 
>>> I tend to agree. Is there an argument that it needs to be supported,
>>> and can this move to Bahir if so?
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Remove Flume support in 3.0.0?

2018-10-11 Thread Sean Owen
Yep, that already exists as Bahir.
Also, would anyone object to declaring Flume support at least
deprecated in 2.4.0?
On Wed, Oct 10, 2018 at 2:29 PM Jörn Franke  wrote:
>
> I think it makes sense to remove it.
> If it is not too much effort and the architecture of the flume source is not 
> considered as too strange one may extract it as a separate project and put it 
> on github in a dedicated non-supported repository. This would enable 
> distributors and other companies to continue to use it with minor adaptions 
> in case their architecture depends on it. Furthermore, if there is a growing 
> interest then one could pick it up and create a clean connector based on the 
> current Spark architecture to be available as a dedicated connector or again 
> in later Spark versions.
>
> That being said there are also „indirect“ ways to use Flume with Spark (eg 
> via Kafka), so i believe people would not be affected so much by a removal.
>
> (Non-Voting just my opinion)
>
> > Am 10.10.2018 um 22:31 schrieb Sean Owen :
> >
> > Marcelo makes an argument that Flume support should be removed in
> > 3.0.0 at https://issues.apache.org/jira/browse/SPARK-25598
> >
> > I tend to agree. Is there an argument that it needs to be supported,
> > and can this move to Bahir if so?
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-11 Thread Sean Owen
This is a legitimate question about the behavior of ScalaUDF after the
change to support 2.12:
https://github.com/apache/spark/pull/22259#discussion_r224295469
Not quite a blocker I think, but a potential gotcha we definitely need
to highlight in release notes. There may be an argument for changing
ScalaUDF again before the release. Have a look, anyone familiar with
catalyst.
On Wed, Oct 10, 2018 at 3:00 PM Sean Owen  wrote:
>
> +1. I tested the source build against Scala 2.12 and common build
> profiles. License and sigs look OK.
>
> No blockers; one critical:
>
> SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
>
> I think this one is "won't fix" though? not trying to restore the behavior?
>
> Other items open for 2.4.0:
>
> SPARK-25347 Document image data source in doc site
> SPARK-25584 Document libsvm data source in doc site
> SPARK-25179 Document the features that require Pyarrow 0.10
> SPARK-25507 Update documents for the new features in 2.4 release
> SPARK-25346 Document Spark builtin data sources
> SPARK-24464 Unit tests for MLlib's Instrumentation
> SPARK-23197 Flaky test: spark.streaming.ReceiverSuite."receiver_life_cycle"
> SPARK-22809 pyspark is sensitive to imports with dots
> SPARK-21030 extend hint syntax to support any expression for Python and R
>
> Anyone know enough to close or retarget them? they don't look critical
> for 2.4, SPARK-25507 has no content, itself. SPARK-25179 "Document the
> features that require Pyarrow 0.10" however sounds like it could have
> been important for 2.4? if not a blocker.
>
> PS I don't think that SPARK-25150 is an issue; see JIRA. At least
> there is some ongoing discussion there.
>
> I am evaluating
> https://github.com/apache/spark/pull/22259#discussion_r224252642 right
> now.
>
>
> On Wed, Oct 10, 2018 at 9:47 AM Wenchen Fan  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version 
> > 2.4.0.
> >
> > The vote is open until October 1 PST and passes if a majority +1 PMC votes 
> > are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.0-rc3 (commit 
> > 8e4a99bd201b9204fec52580f19ae70a229ed94e):
> > https://github.com/apache/spark/tree/v2.4.0-rc3
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1289
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
> >
> > The list of bug fixes going into 2.4.0 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12342385
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.0?
> > ===
> >
> > The current list of open tickets targeted at 2.4.0 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target 
> > Version/s" = 2.4.0
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Possible bug in DatasourceV2

2018-10-11 Thread Wenchen Fan
Hi Hyukjin, can you open a PR to revert it from 2.4? Now I'm kind of
convinced this is too breaking and we need more discussion.

+ Ryan Blue
Hi Ryan,
I think we need to look back at the new write API design and consider data
sources that don't have table concept. We should opt-in for the schema
validation of append operator.

On Thu, Oct 11, 2018 at 8:12 PM Hyukjin Kwon  wrote:

> That's why I initially suggested to revert this part out of Spark 2.4 and
> have more discussion at 3.0 since one of the design goal of Data source V2
> is no behaviour changes to end users.
>
> 2018년 10월 11일 (목) 오후 7:11, Mendelson, Assaf 님이
> 작성:
>
>> Actually, it is not just a question of a write only data source. The
>> issue is that in my case (and I imagine this is true for others), the
>> schema is not read from the database but is understood from the options.
>> This means that I have no way of understanding the schema without supplying
>> the read options. On the other hand, when writing, I have the schema from
>> the dataframe.
>>
>>
>>
>> I know the data source V2 API is considered experimental API and I have
>> no problem with it, however, this means that the change will require a
>> change in how the end user works with it (they suddenly need to add schema
>> information which they did not before), not to mention this being a
>> regression.
>>
>>
>>
>> As to the pull request, this only handles cases where the save mode is
>> not append, for the original example (having non existent path but have
>> append will still fail and according to the documentation of Append, if the
>> path does not exist it should create it).
>>
>>
>>
>> I am currently having problem compiling everything so I can’t test it
>> myself but wouldn’t changing the relation definition in “save”:
>>
>>
>>
>> val relation = DataSourceV2Relation.create(source, options, None,
>> Option(df.schema))
>>
>>
>>
>> and changing create to look like this:
>>
>>
>>
>> def create(source: DataSourceV2, options: Map[String, String],
>> tableIdent: Option[TableIdentifier] = None, userSpecifiedSchema:
>> Option[StructType] = None): DataSourceV2Relation = {
>>
>> val schema =
>> userSpecifiedSchema.getOrElse(source.createReader(options,
>> userSpecifiedSchema).readSchema())
>>
>> val ident = tableIdent.orElse(tableFromOptions(options))
>>
>> DataSourceV2Relation(
>>
>>   source, schema.toAttributes, options, ident, userSpecifiedSchema)
>>
>>   }
>>
>>
>>
>> Correct this?
>>
>>
>>
>> Or even creating a new create which simply gets the schema as non
>> optional?
>>
>>
>>
>> Thanks,
>>
>> Assaf
>>
>>
>>
>> *From:* Hyukjin Kwon [mailto:gurwls...@gmail.com]
>> *Sent:* Thursday, October 11, 2018 10:24 AM
>> *To:* Mendelson, Assaf; Wenchen Fan
>> *Cc:* dev
>> *Subject:* Re: Possible bug in DatasourceV2
>>
>>
>>
>> [EXTERNAL EMAIL]
>> Please report any suspicious attachments, links, or requests for
>> sensitive information.
>>
>> See https://github.com/apache/spark/pull/22688
>>
>>
>>
>> +WEnchen, here looks the problem raised. This might have to be considered
>> as a blocker ...
>>
>> On Thu, 11 Oct 2018, 2:48 pm assaf.mendelson, 
>> wrote:
>>
>> Hi,
>>
>> I created a datasource writer WITHOUT a reader. When I do, I get an
>> exception: org.apache.spark.sql.AnalysisException: Data source is not
>> readable: DefaultSource
>>
>> The reason for this is that when save is called, inside the source match
>> to
>> WriterSupport we have the following code:
>>
>> val source = cls.newInstance().asInstanceOf[DataSourceV2]
>>   source match {
>> case ws: WriteSupport =>
>>   val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
>> source,
>> df.sparkSession.sessionState.conf)
>>   val options = sessionOptions ++ extraOptions
>> -->  val relation = DataSourceV2Relation.create(source, options)
>>
>>   if (mode == SaveMode.Append) {
>> runCommand(df.sparkSession, "save") {
>>   AppendData.byName(relation, df.logicalPlan)
>> }
>>
>>   } else {
>> val writer = ws.createWriter(
>>   UUID.randomUUID.toString,
>> df.logicalPlan.output.toStructType,
>> mode,
>>   new DataSourceOptions(options.asJava))
>>
>> if (writer.isPresent) {
>>   runCommand(df.sparkSession, "save") {
>> WriteToDataSourceV2(writer.get, df.logicalPlan)
>>   }
>> }
>>   }
>>
>> but DataSourceV2Relation.create actively creates a reader
>> (source.createReader) to extract the schema:
>>
>> def create(
>>   source: DataSourceV2,
>>   options: Map[String, String],
>>   tableIdent: Option[TableIdentifier] = None,
>>   userSpecifiedSchema: Option[StructType] = None):
>> DataSourceV2Relation
>> = {
>> val reader = source.createReader(options, userSpecifiedSchema)
>> val ident = tableIdent.orElse(tableFromOptions(options))
>> DataSourceV2Relation(
>>   

Re: Possible bug in DatasourceV2

2018-10-11 Thread Hyukjin Kwon
That's why I initially suggested to revert this part out of Spark 2.4 and
have more discussion at 3.0 since one of the design goal of Data source V2
is no behaviour changes to end users.

2018년 10월 11일 (목) 오후 7:11, Mendelson, Assaf 님이 작성:

> Actually, it is not just a question of a write only data source. The issue
> is that in my case (and I imagine this is true for others), the schema is
> not read from the database but is understood from the options. This means
> that I have no way of understanding the schema without supplying the read
> options. On the other hand, when writing, I have the schema from the
> dataframe.
>
>
>
> I know the data source V2 API is considered experimental API and I have no
> problem with it, however, this means that the change will require a change
> in how the end user works with it (they suddenly need to add schema
> information which they did not before), not to mention this being a
> regression.
>
>
>
> As to the pull request, this only handles cases where the save mode is not
> append, for the original example (having non existent path but have append
> will still fail and according to the documentation of Append, if the path
> does not exist it should create it).
>
>
>
> I am currently having problem compiling everything so I can’t test it
> myself but wouldn’t changing the relation definition in “save”:
>
>
>
> val relation = DataSourceV2Relation.create(source, options, None,
> Option(df.schema))
>
>
>
> and changing create to look like this:
>
>
>
> def create(source: DataSourceV2, options: Map[String, String], tableIdent:
> Option[TableIdentifier] = None, userSpecifiedSchema: Option[StructType] =
> None): DataSourceV2Relation = {
>
> val schema =
> userSpecifiedSchema.getOrElse(source.createReader(options,
> userSpecifiedSchema).readSchema())
>
> val ident = tableIdent.orElse(tableFromOptions(options))
>
> DataSourceV2Relation(
>
>   source, schema.toAttributes, options, ident, userSpecifiedSchema)
>
>   }
>
>
>
> Correct this?
>
>
>
> Or even creating a new create which simply gets the schema as non optional?
>
>
>
> Thanks,
>
> Assaf
>
>
>
> *From:* Hyukjin Kwon [mailto:gurwls...@gmail.com]
> *Sent:* Thursday, October 11, 2018 10:24 AM
> *To:* Mendelson, Assaf; Wenchen Fan
> *Cc:* dev
> *Subject:* Re: Possible bug in DatasourceV2
>
>
>
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive
> information.
>
> See https://github.com/apache/spark/pull/22688
>
>
>
> +WEnchen, here looks the problem raised. This might have to be considered
> as a blocker ...
>
> On Thu, 11 Oct 2018, 2:48 pm assaf.mendelson, 
> wrote:
>
> Hi,
>
> I created a datasource writer WITHOUT a reader. When I do, I get an
> exception: org.apache.spark.sql.AnalysisException: Data source is not
> readable: DefaultSource
>
> The reason for this is that when save is called, inside the source match to
> WriterSupport we have the following code:
>
> val source = cls.newInstance().asInstanceOf[DataSourceV2]
>   source match {
> case ws: WriteSupport =>
>   val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
> source,
> df.sparkSession.sessionState.conf)
>   val options = sessionOptions ++ extraOptions
> -->  val relation = DataSourceV2Relation.create(source, options)
>
>   if (mode == SaveMode.Append) {
> runCommand(df.sparkSession, "save") {
>   AppendData.byName(relation, df.logicalPlan)
> }
>
>   } else {
> val writer = ws.createWriter(
>   UUID.randomUUID.toString, df.logicalPlan.output.toStructType,
> mode,
>   new DataSourceOptions(options.asJava))
>
> if (writer.isPresent) {
>   runCommand(df.sparkSession, "save") {
> WriteToDataSourceV2(writer.get, df.logicalPlan)
>   }
> }
>   }
>
> but DataSourceV2Relation.create actively creates a reader
> (source.createReader) to extract the schema:
>
> def create(
>   source: DataSourceV2,
>   options: Map[String, String],
>   tableIdent: Option[TableIdentifier] = None,
>   userSpecifiedSchema: Option[StructType] = None): DataSourceV2Relation
> = {
> val reader = source.createReader(options, userSpecifiedSchema)
> val ident = tableIdent.orElse(tableFromOptions(options))
> DataSourceV2Relation(
>   source, reader.readSchema().toAttributes, options, ident,
> userSpecifiedSchema)
>   }
>
>
> This makes me a little confused.
>
> First, the schema is defined by the dataframe itself, not by the data
> source, i.e. it should be extracted from df.schema and not by
> source.createReader
>
> Second, I see that relation is actually only use if the mode is
> SaveMode.append (btw this means if it is needed it should be defined inside
> the "if"). I am not sure I understand the portion of the AppendData but why
> would reading from the source be 

Fwd: [VOTE] SPARK 2.4.0 (RC3)

2018-10-11 Thread Wenchen Fan
Forgot to cc dev-list

-- Forwarded message -
From: Wenchen Fan 
Date: Thu, Oct 11, 2018 at 10:14 AM
Subject: Re: [VOTE] SPARK 2.4.0 (RC3)
To: 
Cc: Sean Owen 


Ah sorry guys, I just copy-paste the voting email from the last RC and
forgot to update the date :P

The voting should be open until October 13 PST.

According to the discussion in the previous RC, I'm resolving SPARK-25378
as won't fix. It's OK to wait one or 2 weeks for the tensorflow release.

SPARK-25150 is a long-standing and known issue I believe, DataFrame join
API may have confusing behavior for indirect self-join, and is relatively
hard to fix, if breaking change is not allowed. I've seen many tickets
complaining about it and we should definitely fix it in 3.0, which accepts
necessary breaking changes.

SPARK-25588 does look like a potential issue, but there is not much we can
do if this problem is not reproducible.



On Thu, Oct 11, 2018 at 7:28 AM Michael Heuer  wrote:

> Hello Sean, Wenchen
>
> I could use triage on
>
> https://issues.apache.org/jira/browse/SPARK-25588
>
> I’ve struggled reporting Parquet+Avro dependency issues against Spark in
> the past, can’t seem to get any notice.
>
>michael
>
>
> On Oct 10, 2018, at 5:00 PM, Sean Owen  wrote:
>
> +1. I tested the source build against Scala 2.12 and common build
> profiles. License and sigs look OK.
>
> No blockers; one critical:
>
> SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
>
> I think this one is "won't fix" though? not trying to restore the behavior?
>
> Other items open for 2.4.0:
>
> SPARK-25347 Document image data source in doc site
> SPARK-25584 Document libsvm data source in doc site
> SPARK-25179 Document the features that require Pyarrow 0.10
> SPARK-25507 Update documents for the new features in 2.4 release
> SPARK-25346 Document Spark builtin data sources
> SPARK-24464 Unit tests for MLlib's Instrumentation
> SPARK-23197 Flaky test: spark.streaming.ReceiverSuite."receiver_life_cycle"
> SPARK-22809 pyspark is sensitive to imports with dots
> SPARK-21030 extend hint syntax to support any expression for Python and R
>
> Anyone know enough to close or retarget them? they don't look critical
> for 2.4, SPARK-25507 has no content, itself. SPARK-25179 "Document the
> features that require Pyarrow 0.10" however sounds like it could have
> been important for 2.4? if not a blocker.
>
> PS I don't think that SPARK-25150 is an issue; see JIRA. At least
> there is some ongoing discussion there.
>
> I am evaluating
> https://github.com/apache/spark/pull/22259#discussion_r224252642 right
> now.
>
>
> On Wed, Oct 10, 2018 at 9:47 AM Wenchen Fan  wrote:
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 1 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc3 (commit
> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
> https://github.com/apache/spark/tree/v2.4.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1289
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything 

RE: Possible bug in DatasourceV2

2018-10-11 Thread Mendelson, Assaf
Actually, it is not just a question of a write only data source. The issue is 
that in my case (and I imagine this is true for others), the schema is not read 
from the database but is understood from the options. This means that I have no 
way of understanding the schema without supplying the read options. On the 
other hand, when writing, I have the schema from the dataframe.

I know the data source V2 API is considered experimental API and I have no 
problem with it, however, this means that the change will require a change in 
how the end user works with it (they suddenly need to add schema information 
which they did not before), not to mention this being a regression.

As to the pull request, this only handles cases where the save mode is not 
append, for the original example (having non existent path but have append will 
still fail and according to the documentation of Append, if the path does not 
exist it should create it).

I am currently having problem compiling everything so I can’t test it myself 
but wouldn’t changing the relation definition in “save”:

val relation = DataSourceV2Relation.create(source, options, None, 
Option(df.schema))

and changing create to look like this:

def create(source: DataSourceV2, options: Map[String, String], tableIdent: 
Option[TableIdentifier] = None, userSpecifiedSchema: Option[StructType] = 
None): DataSourceV2Relation = {
val schema = userSpecifiedSchema.getOrElse(source.createReader(options, 
userSpecifiedSchema).readSchema())
val ident = tableIdent.orElse(tableFromOptions(options))
DataSourceV2Relation(
  source, schema.toAttributes, options, ident, userSpecifiedSchema)
  }

Correct this?

Or even creating a new create which simply gets the schema as non optional?

Thanks,
Assaf

From: Hyukjin Kwon [mailto:gurwls...@gmail.com]
Sent: Thursday, October 11, 2018 10:24 AM
To: Mendelson, Assaf; Wenchen Fan
Cc: dev
Subject: Re: Possible bug in DatasourceV2


[EXTERNAL EMAIL]
Please report any suspicious attachments, links, or requests for sensitive 
information.
See https://github.com/apache/spark/pull/22688

+WEnchen, here looks the problem raised. This might have to be considered as a 
blocker ...

On Thu, 11 Oct 2018, 2:48 pm assaf.mendelson, 
mailto:assaf.mendel...@rsa.com>> wrote:
Hi,

I created a datasource writer WITHOUT a reader. When I do, I get an
exception: org.apache.spark.sql.AnalysisException: Data source is not
readable: DefaultSource

The reason for this is that when save is called, inside the source match to
WriterSupport we have the following code:

val source = cls.newInstance().asInstanceOf[DataSourceV2]
  source match {
case ws: WriteSupport =>
  val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
source,
df.sparkSession.sessionState.conf)
  val options = sessionOptions ++ extraOptions
-->  val relation = DataSourceV2Relation.create(source, options)

  if (mode == SaveMode.Append) {
runCommand(df.sparkSession, "save") {
  AppendData.byName(relation, df.logicalPlan)
}

  } else {
val writer = ws.createWriter(
  UUID.randomUUID.toString, df.logicalPlan.output.toStructType,
mode,
  new DataSourceOptions(options.asJava))

if (writer.isPresent) {
  runCommand(df.sparkSession, "save") {
WriteToDataSourceV2(writer.get, df.logicalPlan)
  }
}
  }

but DataSourceV2Relation.create actively creates a reader
(source.createReader) to extract the schema:

def create(
  source: DataSourceV2,
  options: Map[String, String],
  tableIdent: Option[TableIdentifier] = None,
  userSpecifiedSchema: Option[StructType] = None): DataSourceV2Relation
= {
val reader = source.createReader(options, userSpecifiedSchema)
val ident = tableIdent.orElse(tableFromOptions(options))
DataSourceV2Relation(
  source, reader.readSchema().toAttributes, options, ident,
userSpecifiedSchema)
  }


This makes me a little confused.

First, the schema is defined by the dataframe itself, not by the data
source, i.e. it should be extracted from df.schema and not by
source.createReader

Second, I see that relation is actually only use if the mode is
SaveMode.append (btw this means if it is needed it should be defined inside
the "if"). I am not sure I understand the portion of the AppendData but why
would reading from the source be included?

Am I missing something here?

Thanks,
   Assaf



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org


Re: Possible bug in DatasourceV2

2018-10-11 Thread Hyukjin Kwon
See https://github.com/apache/spark/pull/22688

+WEnchen, here looks the problem raised. This might have to be considered
as a blocker ...


On Thu, 11 Oct 2018, 2:48 pm assaf.mendelson, 
wrote:

> Hi,
>
> I created a datasource writer WITHOUT a reader. When I do, I get an
> exception: org.apache.spark.sql.AnalysisException: Data source is not
> readable: DefaultSource
>
> The reason for this is that when save is called, inside the source match to
> WriterSupport we have the following code:
>
> val source = cls.newInstance().asInstanceOf[DataSourceV2]
>   source match {
> case ws: WriteSupport =>
>   val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
> source,
> df.sparkSession.sessionState.conf)
>   val options = sessionOptions ++ extraOptions
> -->  val relation = DataSourceV2Relation.create(source, options)
>
>   if (mode == SaveMode.Append) {
> runCommand(df.sparkSession, "save") {
>   AppendData.byName(relation, df.logicalPlan)
> }
>
>   } else {
> val writer = ws.createWriter(
>   UUID.randomUUID.toString, df.logicalPlan.output.toStructType,
> mode,
>   new DataSourceOptions(options.asJava))
>
> if (writer.isPresent) {
>   runCommand(df.sparkSession, "save") {
> WriteToDataSourceV2(writer.get, df.logicalPlan)
>   }
> }
>   }
>
> but DataSourceV2Relation.create actively creates a reader
> (source.createReader) to extract the schema:
>
> def create(
>   source: DataSourceV2,
>   options: Map[String, String],
>   tableIdent: Option[TableIdentifier] = None,
>   userSpecifiedSchema: Option[StructType] = None): DataSourceV2Relation
> = {
> val reader = source.createReader(options, userSpecifiedSchema)
> val ident = tableIdent.orElse(tableFromOptions(options))
> DataSourceV2Relation(
>   source, reader.readSchema().toAttributes, options, ident,
> userSpecifiedSchema)
>   }
>
>
> This makes me a little confused.
>
> First, the schema is defined by the dataframe itself, not by the data
> source, i.e. it should be extracted from df.schema and not by
> source.createReader
>
> Second, I see that relation is actually only use if the mode is
> SaveMode.append (btw this means if it is needed it should be defined inside
> the "if"). I am not sure I understand the portion of the AppendData but why
> would reading from the source be included?
>
> Am I missing something here?
>
> Thanks,
>Assaf
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Possible bug in DatasourceV2

2018-10-11 Thread assaf.mendelson
Hi,

I created a datasource writer WITHOUT a reader. When I do, I get an
exception: org.apache.spark.sql.AnalysisException: Data source is not
readable: DefaultSource

The reason for this is that when save is called, inside the source match to
WriterSupport we have the following code:

val source = cls.newInstance().asInstanceOf[DataSourceV2]
  source match {
case ws: WriteSupport =>
  val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
source,
df.sparkSession.sessionState.conf)
  val options = sessionOptions ++ extraOptions
-->  val relation = DataSourceV2Relation.create(source, options)

  if (mode == SaveMode.Append) {
runCommand(df.sparkSession, "save") {
  AppendData.byName(relation, df.logicalPlan)
}

  } else {
val writer = ws.createWriter(
  UUID.randomUUID.toString, df.logicalPlan.output.toStructType,
mode,
  new DataSourceOptions(options.asJava))

if (writer.isPresent) {
  runCommand(df.sparkSession, "save") {
WriteToDataSourceV2(writer.get, df.logicalPlan)
  }
}
  }

but DataSourceV2Relation.create actively creates a reader
(source.createReader) to extract the schema: 

def create(
  source: DataSourceV2,
  options: Map[String, String],
  tableIdent: Option[TableIdentifier] = None,
  userSpecifiedSchema: Option[StructType] = None): DataSourceV2Relation
= {
val reader = source.createReader(options, userSpecifiedSchema)
val ident = tableIdent.orElse(tableFromOptions(options))
DataSourceV2Relation(
  source, reader.readSchema().toAttributes, options, ident,
userSpecifiedSchema)
  }


This makes me a little confused.

First, the schema is defined by the dataframe itself, not by the data
source, i.e. it should be extracted from df.schema and not by
source.createReader

Second, I see that relation is actually only use if the mode is
SaveMode.append (btw this means if it is needed it should be defined inside
the "if"). I am not sure I understand the portion of the AppendData but why
would reading from the source be included? 

Am I missing something here?

Thanks,
   Assaf



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-11 Thread Hyukjin Kwon
So, which date is it?

2018년 10월 11일 (목) 오전 1:48, Garlapati, Suryanarayana (Nokia - IN/Bangalore) <
suryanarayana.garlap...@nokia.com>님이 작성:

> Might be you need to change the date(Oct 1 has already passed).
>
>
>
> >> The vote is open until October 1 PST and passes if a majority +1 PMC
> votes are cast, with
>
> >> a minimum of 3 +1 votes.
>
>
>
> Regards
>
> Surya
>
>
>
> *From:* Wenchen Fan 
> *Sent:* Wednesday, October 10, 2018 10:20 PM
> *To:* Spark dev list 
> *Subject:* Re: [VOTE] SPARK 2.4.0 (RC3)
>
>
>
> I'm adding my own +1, since there are no known blocker issues. The
> correctness issue has been fixed, the streaming Java API problem has been
> resolved, and we have upgraded to Scala 2.12.7.
>
>
>
> On Thu, Oct 11, 2018 at 12:46 AM Wenchen Fan  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
>
>
> The vote is open until October 1 PST and passes if a majority +1 PMC votes
> are cast, with
>
> a minimum of 3 +1 votes.
>
>
>
> [ ] +1 Release this package as Apache Spark 2.4.0
>
> [ ] -1 Do not release this package because ...
>
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
>
> The tag to be voted on is v2.4.0-rc3 (commit
> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
>
> https://github.com/apache/spark/tree/v2.4.0-rc3
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>
>
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1289
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>
>
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
>
>
> FAQ
>
>
>
> =
>
> How can I help test this release?
>
> =
>
>
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
>
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with a out of date RC going forward).
>
>
>
> ===
>
> What should happen to JIRA tickets still targeting 2.4.0?
>
> ===
>
>
>
> The current list of open tickets targeted at 2.4.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
>
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
>
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
>
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
>