[GitHub] spark pull request #21055: [SPARK-23693][SQL] Functions to generate UUIDs

2018-04-12 Thread tashoyan
GitHub user tashoyan opened a pull request:

https://github.com/apache/spark/pull/21055

[SPARK-23693][SQL] Functions to generate UUIDs

## What changes were proposed in this pull request?

The following functions are implemented and available in the `functions` 
object of SQL API:

- time_based_uuid()
- random_based_uuid()

UUIDs are generated with help of 
[java-uuid-generator](https://github.com/cowtowncoder/java-uuid-generator).

This PR replaces a custom random-based UUID generator that previously was 
used in some parts of the code. In addition, it provides a new function for 
time-based UUIDs.
For backward compatibility, the new `random_based_uuid()` functions 
produces same UUID values for retries on the same data set. Thus, the new 
function is consistent with the legacy `uuid()` function.

## How was this patch tested?

Unit tests on the new functions as well as on SQL expressions implementing 
these functions.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tashoyan/spark SPARK-23693

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21055.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21055


commit 37fb2f5730fdf52987873367e057fce48810a6c3
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-10T20:32:22Z

[SPARK-23693][SQL] Implement time_based_uuid() function

commit df5124aa1edc047fe2977bcf8eb2e1f6a1e842a1
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-11T05:57:22Z

Follow updates in the contract

commit 71481e55f4d5ca0547280c958bef625d2e9373b1
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-11T20:50:28Z

[SPARK-23693][SQL] Implement random_based_uuid() function

commit 9e45a87a90d81197e677790b7b94c8526d29f635
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-11T21:08:54Z

[SPARK-23693][SQL] Refactor: Extract common functions for UUID SQL 
expressions to an abstract superclass

commit 931a737be7c0ee8c0de1da8fcb546e8d87056efc
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-11T21:14:59Z

[SPARK-23693][SQL] Annotate UUID expressions with @ExpressionDescription

commit 1ff072cd0506bd28c0b303d14f4fd7f6c30a69fb
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-11T22:18:41Z

[SPARK-23693][SQL] Fix random-based UUID: must give same values for retries 
on the same data frame

commit 19c91c5ade5330c4127112dc902da3996e42fffd
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-11T22:21:43Z

[SPARK-23693][SQL] Switch to new implementation of random-based UUIDs

commit 117538be5c87017917c9e9c1ea25432142393925
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-11T22:25:08Z

[SPARK-23693][SQL] Fix code style violation

commit ee850d9244ba4bcc8e8a21fd669a98082a9be08e
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-11T22:28:06Z

[SPARK-23693][SQL] Cleanup code

commit 1f3b700e90c512afb31f28eec2d07462c633cbd7
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-11T22:33:37Z

[SPARK-23693][SQL] For UUID functions, document behavior on retries

commit bcb086f9940da2b26a34ae73553fc6924d587ec2
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-11T22:40:02Z

[SPARK-23693][SQL] Remove unused code related to old UUID implementation

commit 878c0070cbbdce8ef6c2690605dab0362eb8027e
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-12T08:21:19Z

Merge remote-tracking branch 'upstream/master' into SPARK-23693

commit 542dbdefb4ff99bbb1c7caedab2f23a6914de0f2
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-12T11:14:44Z

[SPARK-23693][SQL] Switch to new implementation of random-based UUIDs

commit d9c560c26cb256693f6e8879f875382d41959a3e
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-12T16:09:33Z

[SPARK-23693][SQL] Javadoc conventions

commit 77e65d8cfa07a5eca7890ffc12bbed917312a35f
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-04-12T16:11:29Z

Merge remote-tracking branch 'upstream/master' into SPARK-23693




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

2018-02-11 Thread tashoyan
Github user tashoyan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20578#discussion_r167442376
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -158,18 +159,30 @@ class FPGrowth @Since("2.2.0") (
   }
 
   private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel 
= {
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
+
 val data = dataset.select($(itemsCol))
-val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => 
r.getSeq[T](0).toArray)
+val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => 
r.getSeq[Any](0).toArray)
--- End diff --

An interesting curiosity for me: why FPGrowth contract requires `Array` of 
items, not `Seq`? First, it's strange for the contract to require a specific 
implementation rather than an interface. Second, this leads to redundant 
`toArray` and back `toSeq` transformations. `Seq` would be more convenient, as 
`Row` class has `getSeq` method but does not have `getArray`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

2018-02-11 Thread tashoyan
Github user tashoyan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20578#discussion_r167441224
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -158,18 +159,30 @@ class FPGrowth @Since("2.2.0") (
   }
 
   private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel 
= {
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
+
 val data = dataset.select($(itemsCol))
-val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => 
r.getSeq[T](0).toArray)
+val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => 
r.getSeq[Any](0).toArray)
--- End diff --

It is not only for caching. Same ArrayStoreException occurs if one tries to 
execute collect() on the items RDD. No exception when using a concrete type 
like String instead of T. Probably the latter explains how it worked before - 
people invoked dataset.cache() in their code where type parameter of the 
Dataset is known.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

2018-02-11 Thread tashoyan
Github user tashoyan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20578#discussion_r167439260
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -158,18 +159,30 @@ class FPGrowth @Since("2.2.0") (
   }
 
   private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel 
= {
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
+
 val data = dataset.select($(itemsCol))
-val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => 
r.getSeq[T](0).toArray)
+val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => 
r.getSeq[Any](0).toArray)
--- End diff --

Yes it is necessary. Otherwise cache() on items RDD leads to 
`ArrayStoreException`. It seems that due to type erasure instances of 
`Array[Nothing]` are created. But `toArray` attemps to add instances of 
`java.lang.Object`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20578: [SPARK-23318][ML] FP-growth: WARN FPGrowth: Input...

2018-02-11 Thread tashoyan
GitHub user tashoyan opened a pull request:

https://github.com/apache/spark/pull/20578

[SPARK-23318][ML] FP-growth: WARN FPGrowth: Input data is not cached

## What changes were proposed in this pull request?

Cache the RDD of items in ml.FPGrowth before passing it to mllib.FPGrowth. 
Cache only when the user did not cache the input dataset of transactions. This 
fixes the warning about uncached data emerging from mllib.FPGrowth.

## How was this patch tested?

Manually:
1. Run ml.FPGrowthExample - warning is there
2. Apply the fix
3. Run ml.FPGrowthExample again - no warning anymore

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tashoyan/spark SPARK-23318

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20578.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20578


commit d17d3fbee84fcb0072d3030f3118ca18ce783e0c
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-02-10T21:16:51Z

[SPARK-23318][ML]Workaround for 'ArrayStoreException: [Ljava.lang.Object' 
when trying to cache the RDD of items.

commit e0eb8519bf09db12f5d5bc426eaf17d6488e05c1
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-02-11T15:21:39Z

[SPARK-23318][ML] Cache the RDD of items if the user did not cache the 
input dataset of transactions. This should eliminate the warning about uncahed 
data in mllib.FPGrowth.

commit 374a49c2bf447f3ddfed655f6eda9c8cd5f45285
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-02-11T15:23:58Z

Merge remote-tracking branch 'upstream/master' into SPARK-23318




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20349: [Minor][DOC] Fix the path to the examples jar

2018-01-22 Thread tashoyan
Github user tashoyan commented on the issue:

https://github.com/apache/spark/pull/20349
  
@jerryshao Not found yet


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20349: Fix the path to the examples jar

2018-01-22 Thread tashoyan
GitHub user tashoyan opened a pull request:

https://github.com/apache/spark/pull/20349

Fix the path to the examples jar

## What changes were proposed in this pull request?

The example jar file is now in ./examples/jars directory of Spark 
distribution.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tashoyan/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20349.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20349


commit 20d502fd2a271fcec1614a909c3e89934e81582e
Author: Arseniy Tashoyan <tashoyan@...>
Date:   2018-01-22T08:25:17Z

Fix the path to the examples jar




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19711: [SPARK-22471][SQL] SQLListener consumes much memo...

2017-11-14 Thread tashoyan
Github user tashoyan closed the pull request at:

https://github.com/apache/spark/pull/19711


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19711: [SPARK-22471][SQL] SQLListener consumes much memory caus...

2017-11-10 Thread tashoyan
Github user tashoyan commented on the issue:

https://github.com/apache/spark/pull/19711
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19711: [SPARK-22471][SQL] SQLListener consumes much memory caus...

2017-11-10 Thread tashoyan
Github user tashoyan commented on the issue:

https://github.com/apache/spark/pull/19711
  
Corrupted build node?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19700: [SPARK-22471][SQL] SQLListener consumes much memo...

2017-11-10 Thread tashoyan
Github user tashoyan closed the pull request at:

https://github.com/apache/spark/pull/19700


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19700: [SPARK-22471][SQL] SQLListener consumes much memo...

2017-11-09 Thread tashoyan
Github user tashoyan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19700#discussion_r150100750
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala ---
@@ -101,6 +101,8 @@ class SQLListener(conf: SparkConf) extends 
SparkListener with Logging {
 
   private val retainedExecutions = 
conf.getInt("spark.sql.ui.retainedExecutions", 1000)
 
+  private val retainedStages = conf.getInt("spark.ui.retainedStages", 1000)
--- End diff --

Done for branch-2.2


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19700: [SPARK-22471][SQL] SQLListener consumes much memory caus...

2017-11-09 Thread tashoyan
Github user tashoyan commented on the issue:

https://github.com/apache/spark/pull/19700
  
Done for branch-2.2: #19711 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19711: [SPARK-22471][SQL] SQLListener consumes much memo...

2017-11-09 Thread tashoyan
GitHub user tashoyan opened a pull request:

https://github.com/apache/spark/pull/19711

[SPARK-22471][SQL] SQLListener consumes much memory causing OutOfMemoryError

## What changes were proposed in this pull request?

This PR addresses the issue 
[SPARK-22471](https://issues.apache.org/jira/browse/SPARK-22471). The modified 
version of `SQLListener` respects the setting `spark.ui.retainedStages` and 
keeps the number of the tracked stages within the specified limit. The hash map 
`_stageIdToStageMetrics` does not outgrow the limit, hence overall memory 
consumption does not grow with time anymore.

A 2.2-compatible fix. Maybe incompatible with 2.3 due to #19681.

## How was this patch tested?

A new unit test covers this fix - see `SQLListenerMemorySuite.scala`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tashoyan/spark SPARK-22471-branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19711.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19711


commit 08b7c82be3effe094e40618fe992d3c50c3e2d98
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-11-08T15:41:36Z

Add reproducer for the issue SPARK-22471

commit 2502a7e9846e359d793c485db1d3abef8a2c1e12
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-11-08T15:41:54Z

Add fix for the issue SPARK-22471

commit 2a13530db9ec611b6ee55fc9d79bd8aac5c01862
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-11-08T20:39:02Z

Remove debug print and irrelevant checks. Add a reference to the issue.

commit 98f7b23fb52ffd11ae92716c871e5aa06ea61428
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-11-08T20:47:44Z

Remove debug print and irrelevant checks. Add a reference to the issue.

commit 80755ece91703b3b6436f88e14eb11251ae6678f
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-11-08T21:21:42Z

Collect memory-related tests on SQLListener in the same suite




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19700: [SPARK-22471][SQL] SQLListener consumes much memory caus...

2017-11-09 Thread tashoyan
Github user tashoyan commented on the issue:

https://github.com/apache/spark/pull/19700
  
Well, it would be good to have this quick fix in a 2.2-compatible bugfix 
release, without waiting for 2.3.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19700: [SPARK-22471][SQL] SQLListener consumes much memo...

2017-11-09 Thread tashoyan
Github user tashoyan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19700#discussion_r150052324
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala ---
@@ -101,6 +101,8 @@ class SQLListener(conf: SparkConf) extends 
SparkListener with Logging {
 
   private val retainedExecutions = 
conf.getInt("spark.sql.ui.retainedExecutions", 1000)
 
+  private val retainedStages = conf.getInt("spark.ui.retainedStages", 1000)
--- End diff --

@dongjoon-hyun , It is already documented in the same file configuration.md:
```
How many stages the Spark UI and status APIs remember before garbage 
collecting.
This is a target maximum, and fewer elements may be retained in some 
circumstances.
```
I did not involve a new parameter, I just used an existing one.
Regarding renaming to `spark.sql.ui.retainedStages`, I believe it should be 
done in a separate pull request - if should. This parameter is also used in 
other parts of Spark code, not only SQL.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19700: [SPARK-22471][SQL] SQLListener consumes much memory caus...

2017-11-09 Thread tashoyan
Github user tashoyan commented on the issue:

https://github.com/apache/spark/pull/19700
  
@vanzin would you like to review?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19700: [SPARK-22471][SQL] SQLListener consumes much memo...

2017-11-08 Thread tashoyan
GitHub user tashoyan opened a pull request:

https://github.com/apache/spark/pull/19700

[SPARK-22471][SQL] SQLListener consumes much memory causing OutOfMemoryError

## What changes were proposed in this pull request?

This PR addresses the issue 
[SPARK-22471](https://issues.apache.org/jira/browse/SPARK-22471). The modified 
version of `SQLListener` respects the setting `spark.ui.retainedStages` and 
keeps the number of the tracked stages within the specified limit. The hash map 
`_stageIdToStageMetrics` does not outgrow the limit, hence overall memory 
consumption does not grow with time anymore.

## How was this patch tested?

A new unit test covers this fix - see `SQLListenerMemorySuite.scala`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tashoyan/spark SPARK-22471

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19700.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19700


commit 0388f6ce50d568a0493e7959ec005ee5afc20bd0
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-11-08T15:41:36Z

Add reproducer for the issue SPARK-22471

commit 42e80272cf0926f0fd978e6b7617685987d8fc93
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-11-08T15:41:54Z

Add fix for the issue SPARK-22471

commit 2f793ad1f001bc58dd09fa4eaec6ae423445f86f
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-11-08T20:39:02Z

Remove debug print and irrelevant checks. Add a reference to the issue.

commit 4780d95b7d58df741eb8d5756c8109fc7dbfb457
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-11-08T20:47:44Z

Remove debug print and irrelevant checks. Add a reference to the issue.

commit 79c83a715d4a36ad00ff3888e8e2953fcc163d17
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-11-08T21:21:42Z

Collect memory-related tests on SQLListener in the same suite




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18885: [SPARK-21668][CORE] Ability to run driver program...

2017-08-14 Thread tashoyan
Github user tashoyan closed the pull request at:

https://github.com/apache/spark/pull/18885


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18885: [SPARK-21668][CORE] Ability to run driver programs withi...

2017-08-14 Thread tashoyan
Github user tashoyan commented on the issue:

https://github.com/apache/spark/pull/18885
  
So the working configuration is:
* set `spark.driver.host` to the IP address of the host machine
* set `spark.driver.bindAddress` to the IP address of the container
I tried this configuration with Spark 2.1.1 and 2.2.0. Works fine!
@vanzin thank you for pointing me to the right issue. I think we can close 
this PR and mark  
[SPARK-21668](https://issues.apache.org/jira/browse/SPARK-21668) as a duplicate 
of [SPARK-4563](https://issues.apache.org/jira/browse/SPARK-4563).
The working Docker image is here: 
[docker-spark-submit](https://github.com/tashoyan/docker-spark-submit).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18885: [SPARK-21668][CORE] Ability to run driver programs withi...

2017-08-10 Thread tashoyan
Github user tashoyan commented on the issue:

https://github.com/apache/spark/pull/18885
  
@vanzin @srowen @jerryshao would you you review please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18885: [SPARK-21668][CORE] Ability to run driver program...

2017-08-08 Thread tashoyan
GitHub user tashoyan opened a pull request:

https://github.com/apache/spark/pull/18885

[SPARK-21668][CORE] Ability to run driver programs within a container

## What changes were proposed in this pull request?

When running inside a container, driver program provides a driver host set 
to the container IP address. This IP address is visible only on the machine 
where the container is running. Spark executors running on other machines are 
not able to communicate to the driver program.
Now driver program may use standard SPARK_PUBLIC_DNS variable in order to 
expose driver host to executors. Just declare SPARK_PUBLIC_DNS= in spark-env.sh within the container. Thanks to 
exposed ports, all requests from executors are forwarded to the driver program 
within the container.

## How was this patch tested?

I have tested this modification manually. I have a Spark cluster on 3 
machines. I run my Spark application in a Docker container; the host machine 
belongs to the same network as my Spark cluster.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tashoyan/spark SPARK-21668

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18885.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18885


commit 6b16afb3dbfed3c745820bc3e727b4c9a13017f7
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-08-08T11:58:45Z

Driver program should advertise the hostname specified in SPARK_PUBLIC_DNS 
if specified

commit 12c0b901ea109f7389d1c83afdc16817c2cb0cfd
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-08-08T12:08:28Z

Worker keeps driver on the host provided in the driverUrl. It may differ 
from the original spark.driver.host value if the driver specified 
SPARK_PUBLIC_DNS.

commit bd7399c1552768d54ab7c8cfc1dfeb27667c7f95
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-08-08T12:11:27Z

When starting executor, take SPARK_PUBLIC_DNS into account for the driver 
url

commit e16d334eb58efe2375f4c85d77739ca3bacccecd
Author: Arseniy Tashoyan <tasho...@gmail.com>
Date:   2017-08-08T12:49:54Z

Honor checkstyle




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org