Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Xiangrui Meng
+1 from myself.

The vote passed with the following +1s:

* Susham kumar reddy Yerabolu
* Xingbo Jiang
* Xiao Li*
* Weichen Xu
* Joseph Bradley*
* Henry Robinson
* Xiangrui Meng*
* Wenchen Fan*

Henry, you can find a design sketch at
https://issues.apache.org/jira/browse/SPARK-24375. To help discuss the
design, Xingbo submitted a prototype PR today at
https://github.com/apache/spark/pull/21494.

Best,
Xiangrui

On Mon, Jun 4, 2018 at 12:41 PM Wenchen Fan  wrote:

> +1
>
> On Tue, Jun 5, 2018 at 1:20 AM, Henry Robinson  wrote:
>
>> +1
>>
>> (I hope there will be a fuller design document to review, since the SPIP
>> is really light on details).
>>
>> On 4 June 2018 at 10:17, Joseph Bradley  wrote:
>>
>>> +1
>>>
>>> On Sun, Jun 3, 2018 at 9:59 AM, Weichen Xu 
>>> wrote:
>>>
 +1

 On Fri, Jun 1, 2018 at 3:41 PM, Xiao Li  wrote:

> +1
>
> 2018-06-01 15:41 GMT-07:00 Xingbo Jiang :
>
>> +1
>>
>> 2018-06-01 9:21 GMT-07:00 Xiangrui Meng :
>>
>>> Hi all,
>>>
>>> I want to call for a vote of SPARK-24374
>>> . It introduces
>>> a new execution mode to Spark, which would help both integration with
>>> external DL/AI frameworks and MLlib algorithm performance. This is one 
>>> of
>>> the follow-ups from a previous discussion on dev@
>>> 
>>> .
>>>
>>> The vote will be up for the next 72 hours. Please reply with your
>>> vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following
>>> technical reasons.
>>>
>>> Best,
>>> Xiangrui
>>> --
>>>
>>> Xiangrui Meng
>>>
>>> Software Engineer
>>>
>>> Databricks Inc. [image: http://databricks.com]
>>> 
>>>
>>
>>
>

>>>
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] 
>>>
>>
>>
> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] 


Re: TextSocketMicroBatchReader no longer supports nc utility

2018-06-04 Thread Jungtaek Lim
Yeah that's why I initiated this thread, especially socket source is
expected to be used from examples on official document or some experiments,
which we tend to simply use netcat.

I'll file an issue and provide the fix.

2018년 6월 5일 (화) 오전 1:48, Joseph Torres 님이 작성:

> I tend to agree that this is a bug. It's kinda silly that nc does this,
> but a socket connector that doesn't work with netcat will surely seem
> broken to users. It wouldn't be a huge change to defer opening the socket
> until a read is actually required.
>
> On Sun, Jun 3, 2018 at 9:55 PM, Jungtaek Lim  wrote:
>
>> Hi devs,
>>
>> Not sure I can hear back the response sooner since Spark summit is just
>> around the corner, but just would want to post and wait.
>>
>> While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before
>> reading actual data so the query also exits with error.
>>
>> The reason is due to launching temporary reader for reading schema, and
>> closing reader, and re-opening reader. While reliable socket server should
>> be able to handle this without any issue, nc command normally can't handle
>> multiple connections and simply exits when closing temporary reader.
>>
>> I would like to file an issue and contribute on fixing this if we think
>> this is a bug (otherwise we need to replace nc utility with another one,
>> maybe our own implementation?), but not sure we are happy to apply
>> workaround for specific source.
>>
>> Would like to hear opinions before giving a shot.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>
>


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Xiao Li
+1

On Mon, Jun 4, 2018 at 12:44 PM Henry Robinson  wrote:

> +1 (non-binding)
>
> On 4 June 2018 at 11:15, Bryan Cutler  wrote:
>
>> +1
>>
>> On Mon, Jun 4, 2018 at 10:18 AM, Joseph Bradley 
>> wrote:
>>
>>> +1
>>>
>>> On Mon, Jun 4, 2018 at 10:16 AM, Mark Hamstra 
>>> wrote:
>>>
 +1

 On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin 
 wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.3.1.
>
> Given that I expect at least a few people to be busy with Spark Summit
> next
> week, I'm taking the liberty of setting an extended voting period. The
> vote
> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>
> It passes with a majority of +1 votes, which must include at least 3
> +1 votes
> from the PMC.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
> https://github.com/apache/spark/tree/v2.3.1-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1272/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following
> URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>>>
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] 
>>>
>>
>>
>


[SS] Invalid call to dataType on unresolved object

2018-06-04 Thread Lalwani, Jayesh
This is possibly a bug introduced in 2.3.0.

I have this code that I run in spark-shell


spark.readStream.format("socket").option("host", "localhost").option("port", 
).load().toDF("employeeId").

 | withColumn("swipeTime", expr("current_timestamp()")).

 | createTempView("employeeSwipe")



spark.table("employeeSwipe").writeStream.outputMode("append").format("console").start


I run it in 2.2.1 and I get this


scala> spark.readStream.format("socket").option("host", 
"localhost").option("port", ).load().toDF("employeeId").

 | withColumn("swipeTime", expr("current_timestamp()")).

 | createTempView("employeeSwipe")

18/06/04 18:20:27 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException

18/06/04 18:20:28 WARN TextSocketSourceProvider: The socket source should not 
be used for production applications! It does not support recovery.



scala> 
spark.table("employeeSwipe").writeStream.outputMode("append").format("console").start

res1: org.apache.spark.sql.streaming.StreamingQuery = 
org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@24489d12



scala> ---

Batch: 0

---

+--++

|employeeId|   swipeTime|

+--++

| 1|2018-06-04 18:21:...|

+--++

I run the same in 2.3.0, and I get this


scala> spark.readStream.format("socket").option("host", 
"localhost").option("port", ).load().toDF("employeeId").

 |  | withColumn("swipeTime", expr("current_timestamp()")).

 |  | createTempView("employeeSwipe")

2018-06-04 18:37:12 WARN  TextSocketSourceProvider:66 - The socket source 
should not be used for production applications! It does not support recovery.

2018-06-04 18:37:15 WARN  ObjectStore:568 - Failed to get database global_temp, 
returning NoSuchObjectException



scala>



scala> 
spark.table("employeeSwipe").writeStream.outputMode("append").format("console").start

res1: org.apache.spark.sql.streaming.StreamingQuery = 
org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@5a503cf0



scala> 2018-06-04 18:37:28 ERROR MicroBatchExecution:91 - Query [id = 
10f792b2-89fd-4d98-b2e7-f7531da385e4, runId = 
3b131ecf-a622-4b54-9295-ff06e9fc2566] terminated with error

org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'swipeTime

   at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)

   at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)

   at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)

   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

   at scala.collection.immutable.List.foreach(List.scala:381)

   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

   at scala.collection.immutable.List.map(List.scala:285)

   at 
org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435)

   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)

   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:157)

   at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:447)

   at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)

   at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)

   at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)

   at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)

   at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)

   at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)

   at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)

   at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)

   at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)

   at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Henry Robinson
+1 (non-binding)

On 4 June 2018 at 11:15, Bryan Cutler  wrote:

> +1
>
> On Mon, Jun 4, 2018 at 10:18 AM, Joseph Bradley 
> wrote:
>
>> +1
>>
>> On Mon, Jun 4, 2018 at 10:16 AM, Mark Hamstra 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.3.1.

 Given that I expect at least a few people to be busy with Spark Summit
 next
 week, I'm taking the liberty of setting an extended voting period. The
 vote
 will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).

 It passes with a majority of +1 votes, which must include at least 3 +1
 votes
 from the PMC.

 [ ] +1 Release this package as Apache Spark 2.3.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
 https://github.com/apache/spark/tree/v2.3.1-rc4

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1272/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/

 The list of bug fixes going into 2.3.1 can be found at the following
 URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12342432

 FAQ

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala
 you can add the staging repository to your projects resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 2.3.1?
 ===

 The current list of open tickets targeted at 2.3.1 can be found at:
 https://s.apache.org/Q3Uo

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a regression
 that has not been correctly targeted please ping me or a committer to
 help target the issue.


 --
 Marcelo

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 
>>
>
>


Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Wenchen Fan
+1

On Tue, Jun 5, 2018 at 1:20 AM, Henry Robinson  wrote:

> +1
>
> (I hope there will be a fuller design document to review, since the SPIP
> is really light on details).
>
> On 4 June 2018 at 10:17, Joseph Bradley  wrote:
>
>> +1
>>
>> On Sun, Jun 3, 2018 at 9:59 AM, Weichen Xu 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Jun 1, 2018 at 3:41 PM, Xiao Li  wrote:
>>>
 +1

 2018-06-01 15:41 GMT-07:00 Xingbo Jiang :

> +1
>
> 2018-06-01 9:21 GMT-07:00 Xiangrui Meng :
>
>> Hi all,
>>
>> I want to call for a vote of SPARK-24374
>> . It introduces a
>> new execution mode to Spark, which would help both integration with
>> external DL/AI frameworks and MLlib algorithm performance. This is one of
>> the follow-ups from a previous discussion on dev@
>> 
>> .
>>
>> The vote will be up for the next 72 hours. Please reply with your
>> vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following
>> technical reasons.
>>
>> Best,
>> Xiangrui
>> --
>>
>> Xiangrui Meng
>>
>> Software Engineer
>>
>> Databricks Inc. [image: http://databricks.com]
>> 
>>
>
>

>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 
>>
>
>


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Bryan Cutler
+1

On Mon, Jun 4, 2018 at 10:18 AM, Joseph Bradley 
wrote:

> +1
>
> On Mon, Jun 4, 2018 at 10:16 AM, Mark Hamstra 
> wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.3.1.
>>>
>>> Given that I expect at least a few people to be busy with Spark Summit
>>> next
>>> week, I'm taking the liberty of setting an extended voting period. The
>>> vote
>>> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>>>
>>> It passes with a majority of +1 votes, which must include at least 3 +1
>>> votes
>>> from the PMC.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
>>> https://github.com/apache/spark/tree/v2.3.1-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1272/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>>>
>>> The list of bug fixes going into 2.3.1 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.1?
>>> ===
>>>
>>> The current list of open tickets targeted at 2.3.1 can be found at:
>>> https://s.apache.org/Q3Uo
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>


Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Henry Robinson
+1

(I hope there will be a fuller design document to review, since the SPIP is
really light on details).

On 4 June 2018 at 10:17, Joseph Bradley  wrote:

> +1
>
> On Sun, Jun 3, 2018 at 9:59 AM, Weichen Xu 
> wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 3:41 PM, Xiao Li  wrote:
>>
>>> +1
>>>
>>> 2018-06-01 15:41 GMT-07:00 Xingbo Jiang :
>>>
 +1

 2018-06-01 9:21 GMT-07:00 Xiangrui Meng :

> Hi all,
>
> I want to call for a vote of SPARK-24374
> . It introduces a
> new execution mode to Spark, which would help both integration with
> external DL/AI frameworks and MLlib algorithm performance. This is one of
> the follow-ups from a previous discussion on dev@
> 
> .
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following
> technical reasons.
>
> Best,
> Xiangrui
> --
>
> Xiangrui Meng
>
> Software Engineer
>
> Databricks Inc. [image: http://databricks.com]
> 
>


>>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>


Unsubscribe

2018-06-04 Thread Al Pivonka
On Mon, Jun 4, 2018, 1:18 PM Joseph Bradley  wrote:

> +1
>
> On Mon, Jun 4, 2018 at 10:16 AM, Mark Hamstra 
> wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.3.1.
>>>
>>> Given that I expect at least a few people to be busy with Spark Summit
>>> next
>>> week, I'm taking the liberty of setting an extended voting period. The
>>> vote
>>> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>>>
>>> It passes with a majority of +1 votes, which must include at least 3 +1
>>> votes
>>> from the PMC.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
>>> https://github.com/apache/spark/tree/v2.3.1-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1272/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>>>
>>> The list of bug fixes going into 2.3.1 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.1?
>>> ===
>>>
>>> The current list of open tickets targeted at 2.3.1 can be found at:
>>> https://s.apache.org/Q3Uo
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Joseph Bradley
+1

On Mon, Jun 4, 2018 at 10:16 AM, Mark Hamstra 
wrote:

> +1
>
> On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.3.1.
>>
>> Given that I expect at least a few people to be busy with Spark Summit
>> next
>> week, I'm taking the liberty of setting an extended voting period. The
>> vote
>> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>>
>> It passes with a majority of +1 votes, which must include at least 3 +1
>> votes
>> from the PMC.
>>
>> [ ] +1 Release this package as Apache Spark 2.3.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
>> https://github.com/apache/spark/tree/v2.3.1-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1272/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>>
>> The list of bug fixes going into 2.3.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.1?
>> ===
>>
>> The current list of open tickets targeted at 2.3.1 can be found at:
>> https://s.apache.org/Q3Uo
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Joseph Bradley
+1

On Sun, Jun 3, 2018 at 9:59 AM, Weichen Xu 
wrote:

> +1
>
> On Fri, Jun 1, 2018 at 3:41 PM, Xiao Li  wrote:
>
>> +1
>>
>> 2018-06-01 15:41 GMT-07:00 Xingbo Jiang :
>>
>>> +1
>>>
>>> 2018-06-01 9:21 GMT-07:00 Xiangrui Meng :
>>>
 Hi all,

 I want to call for a vote of SPARK-24374
 . It introduces a
 new execution mode to Spark, which would help both integration with
 external DL/AI frameworks and MLlib algorithm performance. This is one of
 the follow-ups from a previous discussion on dev@
 
 .

 The vote will be up for the next 72 hours. Please reply with your vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Best,
 Xiangrui
 --

 Xiangrui Meng

 Software Engineer

 Databricks Inc. [image: http://databricks.com] 

>>>
>>>
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Mark Hamstra
+1

On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.1.
>
> Given that I expect at least a few people to be busy with Spark Summit next
> week, I'm taking the liberty of setting an extended voting period. The vote
> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>
> It passes with a majority of +1 votes, which must include at least 3 +1
> votes
> from the PMC.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
> https://github.com/apache/spark/tree/v2.3.1-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1272/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: TextSocketMicroBatchReader no longer supports nc utility

2018-06-04 Thread Joseph Torres
I tend to agree that this is a bug. It's kinda silly that nc does this, but
a socket connector that doesn't work with netcat will surely seem broken to
users. It wouldn't be a huge change to defer opening the socket until a
read is actually required.

On Sun, Jun 3, 2018 at 9:55 PM, Jungtaek Lim  wrote:

> Hi devs,
>
> Not sure I can hear back the response sooner since Spark summit is just
> around the corner, but just would want to post and wait.
>
> While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before
> reading actual data so the query also exits with error.
>
> The reason is due to launching temporary reader for reading schema, and
> closing reader, and re-opening reader. While reliable socket server should
> be able to handle this without any issue, nc command normally can't handle
> multiple connections and simply exits when closing temporary reader.
>
> I would like to file an issue and contribute on fixing this if we think
> this is a bug (otherwise we need to replace nc utility with another one,
> maybe our own implementation?), but not sure we are happy to apply
> workaround for specific source.
>
> Would like to hear opinions before giving a shot.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread John Zhuge
+1

On Sun, Jun 3, 2018 at 6:12 PM, Hyukjin Kwon  wrote:

> +1
>
> 2018년 6월 3일 (일) 오후 9:25, Ricardo Almeida 님이
> 작성:
>
>> +1 (non-binding)
>>
>> On 3 June 2018 at 09:23, Dongjoon Hyun  wrote:
>>
>>> +1
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Jun 2, 2018 at 8:09 PM, Denny Lee  wrote:
>>>
 +1

 On Sat, Jun 2, 2018 at 4:53 PM Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> I'll give that a try, but I'll still have to figure out what to do if
> none of the release builds work with hadoop-aws, since Flintrock deploys
> Spark release builds to set up a cluster. Building Spark is slow, so we
> only do it if the user specifically requests a Spark version by git hash.
> (This is basically how spark-ec2 did things, too.)
>
>
> On Sat, Jun 2, 2018 at 6:54 PM Marcelo Vanzin 
> wrote:
>
>> If you're building your own Spark, definitely try the hadoop-cloud
>> profile. Then you don't even need to pull anything at runtime,
>> everything is already packaged with Spark.
>>
>> On Fri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas
>>  wrote:
>> > pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work
>> for me
>> > either (even building with -Phadoop-2.7). I guess I’ve been relying
>> on an
>> > unsupported pattern and will need to figure something else out
>> going forward
>> > in order to use s3a://.
>> >
>> >
>> > On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin 
>> wrote:
>> >>
>> >> I have personally never tried to include hadoop-aws that way. But
>> at
>> >> the very least, I'd try to use the same version of Hadoop as the
>> Spark
>> >> build (2.7.3 IIRC). I don't really expect a different version to
>> work,
>> >> and if it did in the past it definitely was not by design.
>> >>
>> >> On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
>> >>  wrote:
>> >> > Building with -Phadoop-2.7 didn’t help, and if I remember
>> correctly,
>> >> > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0
>> release,
>> >> > so
>> >> > it appears something has changed since then.
>> >> >
>> >> > I wasn’t familiar with -Phadoop-cloud, but I can try that.
>> >> >
>> >> > My goal here is simply to confirm that this release of Spark
>> works with
>> >> > hadoop-aws like past releases did, particularly for Flintrock
>> users who
>> >> > use
>> >> > Spark with S3A.
>> >> >
>> >> > We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop
>> builds
>> >> > with
>> >> > every Spark release. If the -hadoop2.7 release build won’t work
>> with
>> >> > hadoop-aws anymore, are there plans to provide a new build type
>> that
>> >> > will?
>> >> >
>> >> > Apologies if the question is poorly formed. I’m batting a bit
>> outside my
>> >> > league here. Again, my goal is simply to confirm that I/my users
>> still
>> >> > have
>> >> > a way to use s3a://. In the past, that way was simply to call
>> pyspark
>> >> > --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very
>> similar.
>> >> > If
>> >> > that will no longer work, I’m trying to confirm that the change
>> of
>> >> > behavior
>> >> > is intentional or acceptable (as a review for the Spark project)
>> and
>> >> > figure
>> >> > out what I need to change (as due diligence for Flintrock’s
>> users).
>> >> >
>> >> > Nick
>> >> >
>> >> >
>> >> > On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin <
>> van...@cloudera.com>
>> >> > wrote:
>> >> >>
>> >> >> Using the hadoop-aws package is probably going to be a little
>> more
>> >> >> complicated than that. The best bet is to use a custom build of
>> Spark
>> >> >> that includes it (use -Phadoop-cloud). Otherwise you're probably
>> >> >> looking at some nasty dependency issues, especially if you end
>> up
>> >> >> mixing different versions of Hadoop.
>> >> >>
>> >> >> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
>> >> >>  wrote:
>> >> >> > I was able to successfully launch a Spark cluster on EC2 at
>> 2.3.1 RC4
>> >> >> > using
>> >> >> > Flintrock. However, trying to load the hadoop-aws package
>> gave me
>> >> >> > some
>> >> >> > errors.
>> >> >> >
>> >> >> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
>> >> >> >
>> >> >> > 
>> >> >> >
>> >> >> > :: problems summary ::
>> >> >> >  WARNINGS
>> >> >> > [NOT FOUND  ]
>> >> >> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
>> >> >> >  local-m2-cache: tried
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/
>> jersey-json/1.9/jersey-json-1.9.jar
>> >> >> >