date:20151222

[jira] [Assigned] (SPARK-12480) add Hash expression that can calculate hash value for a group of expressions

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12480:


Assignee: Apache Spark

> add Hash expression that can calculate hash value for a group of expressions
> 
>
> Key: SPARK-12480
> URL: https://issues.apache.org/jira/browse/SPARK-12480
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12480) add Hash expression that can calculate hash value for a group of expressions

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12480:


Assignee: (was: Apache Spark)

> add Hash expression that can calculate hash value for a group of expressions
> 
>
> Key: SPARK-12480
> URL: https://issues.apache.org/jira/browse/SPARK-12480
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12480) add Hash expression that can calculate hash value for a group of expressions

2015-12-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068268#comment-15068268
 ] 

Apache Spark commented on SPARK-12480:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10435

> add Hash expression that can calculate hash value for a group of expressions
> 
>
> Key: SPARK-12480
> URL: https://issues.apache.org/jira/browse/SPARK-12480
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2015-12-22 Thread Paulo Magalhaes (JIRA)

Paulo Magalhaes created SPARK-12479:
---

 Summary:  sparkR collect on GroupedData  throws R error "missing 
value where TRUE/FALSE needed"
 Key: SPARK-12479
 URL: https://issues.apache.org/jira/browse/SPARK-12479
 Project: Spark
  Issue Type: Bug
  Components: R, SparkR
Affects Versions: 1.5.1
Reporter: Paulo Magalhaes


sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"

Spark Version: 1.5.1
R Version: 3.2.2

I tracked down the root cause of this exception to an specific key for which 
the hashCode could not be calculated.

The following code recreates the problem when ran in sparkR:

hashCode <- getFromNamespace("hashCode","SparkR")
hashCode("bc53d3605e8a5b7de1e8e271c2317645")
Error in if (value > .Machine$integer.max) { :
  missing value where TRUE/FALSE needed

I went one step further and relaised the the problem happens because of the  
bit wise shift below returning NA.

bitwShiftL(-1073741824,1)

where bitwShiftL is an R function. 
I believe the bitwShiftL function is working as it is supposed to. Therefore, 
my PR will fix it in the SparkR package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12480) add Hash expression that can calculate hash value for a group of expressions

2015-12-22 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-12480:
---

 Summary: add Hash expression that can calculate hash value for a 
group of expressions
 Key: SPARK-12480
 URL: https://issues.apache.org/jira/browse/SPARK-12480
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12480) add Hash expression that can calculate hash value for a group of expressions

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12480:


Assignee: Apache Spark

> add Hash expression that can calculate hash value for a group of expressions
> 
>
> Key: SPARK-12480
> URL: https://issues.apache.org/jira/browse/SPARK-12480
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Description: 
The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
* A connection is made from the driver host to the _Spark_ service
** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
{{172.30.0.2}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the driver host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the driver host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

{code:title=Stacktrace|borderStyle=solid}
15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
172.30.0.3): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.

\\

\\
Original report follows:

I initially inquired about this here:

[jira] [Resolved] (SPARK-12456) Add ExpressionDescription to misc functions

2015-12-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12456.
-
   Resolution: Fixed
 Assignee: Xiu(Joe) Guo
Fix Version/s: 2.0.0

> Add ExpressionDescription to misc functions
> ---
>
> Key: SPARK-12456
> URL: https://issues.apache.org/jira/browse/SPARK-12456
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiu(Joe) Guo
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068562#comment-15068562
 ] 

Kyle Sutton commented on SPARK-12482:
-

Wasn't able to reopen the original one, so tried cloning to preserve as much 
info from the original as possible.  Should I still create a new one?

> Spark fileserver not started on same IP as configured in spark.driver.host
> --
>
> Key: SPARK-12482
> URL: https://issues.apache.org/jira/browse/SPARK-12482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.5.2
>Reporter: Kyle Sutton
>
> The issue of the file server using the default IP instead of the IP address 
> configured through {{spark.driver.host}} still exists in _Spark 1.5.2_
> The problem is that, while the file server is listening on all ports on the 
> file server host, the _Spark_ service attempts to call back to the default 
> port of the host, to which it may or may not have connectivity.
> For instance, the following setup causes a 
> {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact 
> the _Spark_ driver host for a JAR:
> * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
> connection IP of {{172.30.0.2}}
> * _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
> * A connection is made from the driver host to the _Spark_ service
> ** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
> {{172.30.0.2}}
> ** {{spark.driver.port}} is set to {{50003}}
> ** {{spark.fileserver.port}} is set to {{50005}}
> * Locally (on the driver host), the following listeners are active:
> ** {{0.0.0.0:50005}}
> ** {{172.30.0.2:50003}}
> * The _Spark_ service calls back to the file server host for a JAR file using 
> the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
> * The _Spark_ service, being on a different network than the driver host, 
> cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
> file server
> ** A {{netstat}} on the _Spark_ service host will show the connection to the 
> file server host as being in {{SYN_SENT}} state until the process gives up 
> trying to connect
> {code:title=Driver|borderStyle=solid}
> SparkConf conf = new SparkConf()
> .setMaster("spark://172.30.0.3:7077")
> .setAppName("TestApp")
> .set("spark.driver.host", "172.30.0.2")
> .set("spark.driver.port", "50003")
> .set("spark.fileserver.port", "50005");
> JavaSparkContext sc = new JavaSparkContext(conf);
> sc.addJar("target/code.jar");
> {code}
> {code:title=Stacktrace|borderStyle=solid}
> 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> 172.30.0.3): java.net.SocketTimeoutException: connect timed out
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
>

[jira] [Comment Edited] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068572#comment-15068572
 ] 

Kyle Sutton edited comment on SPARK-6476 at 12/22/15 7:14 PM:
--

The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default IP 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
* A connection is made from the driver host to the _Spark_ service
** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
{{172.30.0.2}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the driver host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the driver host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

{code:title=Driver|borderStyle=solid}
SparkConf conf = new SparkConf()
.setMaster("spark://172.30.0.3:7077")
.setAppName("TestApp")
.set("spark.driver.host", "172.30.0.2")
.set("spark.driver.port", "50003")
.set("spark.fileserver.port", "50005");
JavaSparkContext sc = new JavaSparkContext(conf);
sc.addJar("target/code.jar");
{code}

{code:title=Stacktrace|borderStyle=solid}
15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
172.30.0.3): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at

[jira] [Updated] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-12488:
-
Description: 
When running the LDA model, and using the describeTopics function, invalid 
values appear in the termID list that is returned:

The below example generates 10 topics on a data set with a vocabulary of 685.

{code}

// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)

val ldaModel = lda.run(docTermVector)
val distModel = 
ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted.reverse
res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 
585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 
569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 
553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 
537, 536, 535, 534, 533, 532, 53...
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted
res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
-1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
-1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
-1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
-1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
-1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
-1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
-563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
-26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 
39, 40, 41, 42, 43, 44, 45, 4...
{code}

  was:
When running the LDA model, and using the describeTopics function, invalid 
values appear in the termID list that is returned:

The below example generated 10 topics on a data set with a vocabulary of 685.

{code}

// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)

val ldaModel = lda.run(docTermVector)
val distModel = 
ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted.reverse
res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 
585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 
569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 
553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 
537, 536, 535, 534, 533, 532, 53...
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted
res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
-1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
-1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
-1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
-1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
-1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
-1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
-563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
-26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 
39, 40, 41, 42, 43, 44, 45, 4...
{code}


> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2

[jira] [Resolved] (SPARK-12471) Spark daemons should log their pid in the log file

2015-12-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12471.
-
   Resolution: Fixed
 Assignee: Nong Li  (was: Apache Spark)
Fix Version/s: 2.0.0

> Spark daemons should log their pid in the log file
> --
>
> Key: SPARK-12471
> URL: https://issues.apache.org/jira/browse/SPARK-12471
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 2.0.0
>
>
> This is useful when debugging from the log files without the processes 
> running. This information makes it possible to combine the log files with 
> other system information (e.g. dmesg output)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12475) Upgrade Zinc from 0.3.5.3 to 0.3.9

2015-12-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12475.

   Resolution: Fixed
 Assignee: Josh Rosen  (was: Apache Spark)
Fix Version/s: 2.0.0

Fixed by my PR for Spark 2.0.0

> Upgrade Zinc from 0.3.5.3 to 0.3.9
> --
>
> Key: SPARK-12475
> URL: https://issues.apache.org/jira/browse/SPARK-12475
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> We should update to the latest version of Zinc in order to match our SBT 
> version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Description: 
The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
* A connection is made from the driver host to the _Spark_ service
** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
{{172.30.0.2}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the driver host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the driver host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

{code:title=Driver|borderStyle=solid}
SparkConf conf = new SparkConf()
.setMaster("spark://172.30.0.3:7077")
.setAppName("TestApp")
.set("spark.driver.host", "172.30.0.2")
.set("spark.driver.port", "50003")
.set("spark.fileserver.port", "50005");
JavaSparkContext sc = new JavaSparkContext(conf);
sc.addJar("target/code.jar");
{code}

{code:title=Stacktrace|borderStyle=solid}
15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
172.30.0.3): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

[jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068627#comment-15068627
 ] 

Maciej Bryński commented on SPARK-11437:


[~davies]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
from pyspark.sql.types import *
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
from pyspark.sql.types import *
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 1.6.0
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-12483:
---

 Summary: Data Frame as() does not work in Java
 Key: SPARK-12483
 URL: https://issues.apache.org/jira/browse/SPARK-12483
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
 Environment: Mac El Cap 10.11.2
Java 8
Reporter: Andrew Davidson


Following unit test demonstrates a bug in as(). The column name for aliasDF was 
not changed

   @Test
public void bugDataFrameAsTest() {
DataFrame df = createData();
df.printSchema();
df.show();

 DataFrame aliasDF = df.select("id").as("UUID");
 aliasDF.printSchema();
 aliasDF.show();
}

DataFrame createData() {
Features f1 = new Features(1, category1);
Features f2 = new Features(2, category2);
ArrayList data = new ArrayList(2);
data.add(f1);
data.add(f2);
//JavaRDD rdd = 
javaSparkContext.parallelize(Arrays.asList(f1, f2));
JavaRDD rdd = javaSparkContext.parallelize(data);
DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
return df;
}

This is the output I got (without the spark log msgs)

root
 |-- id: integer (nullable = false)
 |-- labelStr: string (nullable = true)

+---++
| id|labelStr|
+---++
|  1|   noise|
|  2|questionable|
+---++

root
 |-- id: integer (nullable = false)

+---+
| id|
+---+
|  1|
|  2|
+---+




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068627#comment-15068627
 ] 

Maciej Bryński edited comment on SPARK-11437 at 12/22/15 8:24 PM:
--

[~davies], [~jason.white]
Are you sure that this patch is OK ?

In 1.6.0 if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}


was (Author: maver1ck):
[~davies], [~jason.white]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 1.6.0
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12477) [SQL] Tungsten projection fails for null values in array fields

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12477:


Assignee: Apache Spark

> [SQL] Tungsten projection fails for null values in array fields
> ---
>
> Key: SPARK-12477
> URL: https://issues.apache.org/jira/browse/SPARK-12477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Pierre Borckmans
>Assignee: Apache Spark
>
> Accessing null elements in an array field fails when tungsten is enabled.
> It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.
> Example:
> {code}
> // Array of String
> case class AS( as: Seq[String] )
> val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
> dfAS.registerTempTable("T_AS")
> for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from 
> T_AS").collect.mkString(","))}
> {code}
> With Tungsten disabled:
> {code}
> 0 = [a]
> 1 = [null]
> 2 = [b]
> {code}
> With Tungsten enabled:
> {code}
> 0 = [a]
> 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
> at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> {code}
> --
> More examples below.
> The following code works in Spark 1.3.1, and in Spark > 1.5 with Tungsten 
> disabled:
> {code}
> // Array of String
> case class AS( as: Seq[String] )
> val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
> dfAS.registerTempTable("T_AS")
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select as[$i] from 
> T_AS").collect.mkString(","))}
> // Array of Int
> case class AI( ai: Seq[Option[Int]] )
> val dfAI = sc.parallelize( Seq( AI ( Seq(Some(1),None,Some(2) ) ) ) ).toDF
> dfAI.registerTempTable("T_AI")
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select ai[$i] from 
> T_AI").collect.mkString(","))}
> // Array of struct[Int,String]
> case class B(x: Option[Int], y: String)
> case class A( b: Seq[B] )
> val df1 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), 
> B(None, "c"), B(Some(4),null), B(None,null), null ) ) ) ).toDF
> df1.registerTempTable("T1")
> val df2 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), 
> B(None, "c"), B(Some(4),null), B(None,null), null ) ), A(null) ) ).toDF
> df2.registerTempTable("T2")
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, 
> b[$i].y from T1").collect.mkString(","))}
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, 
> b[$i].y from T2").collect.mkString(","))}
> // Struct[Int,String]
> case class C(b: B)
> val df3 = sc.parallelize( Seq( C ( B(Some(1),"test") ), C(null) ) ).toDF
> df3.registerTempTable("T3")
> sqlContext.sql("select b.x, b.y from T3").collect
> {code}
> With Tungsten enabled, it reaches NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12482) Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Summary: Spark fileserver not started on same IP as using spark.driver.host 
 (was: CLONE - Spark fileserver not started on same IP as using 
spark.driver.host)

> Spark fileserver not started on same IP as using spark.driver.host
> --
>
> Key: SPARK-12482
> URL: https://issues.apache.org/jira/browse/SPARK-12482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.5.2
>Reporter: Kyle Sutton
>
> The issue of the file server using the default IP instead of the IP address 
> configured through {{spark.driver.host}} still exists in _Spark 1.5.2_
> The problem is that, while the file server is listening on all ports on the 
> file server host, the _Spark_ service attempts to call back to the default 
> port of the host, to which it may or may not have connectivity.
> For instance, the following setup causes a 
> {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact 
> the _Spark_ driver host for a JAR:
> * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
> connection IP of {{172.30.0.2}}
> * _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
> * A connection is made from the driver host to the _Spark_ service
> ** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
> {{172.30.0.2}}
> ** {{spark.driver.port}} is set to {{50003}}
> ** {{spark.fileserver.port}} is set to {{50005}}
> * Locally (on the driver host), the following listeners are active:
> ** {{0.0.0.0:50005}}
> ** {{172.30.0.2:50003}}
> * The _Spark_ service calls back to the file server host for a JAR file using 
> the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
> * The _Spark_ service, being on a different network than the driver host, 
> cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
> file server
> ** A {{netstat}} on the _Spark_ service host will show the connection to the 
> file server host as being in {{SYN_SENT}} state until the process gives up 
> trying to connect
> {code:title=Driver|borderStyle=solid}
> SparkConf conf = new SparkConf()
> .setMaster("spark://172.30.0.3:7077")
> .setAppName("TestApp")
> .set("spark.driver.host", "172.30.0.2")
> .set("spark.driver.port", "50003")
> .set("spark.fileserver.port", "50005");
> JavaSparkContext sc = new JavaSparkContext(conf);
> sc.addJar("target/code.jar");
> {code}
> {code:title=Stacktrace|borderStyle=solid}
> 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> 172.30.0.3): java.net.SocketTimeoutException: connect timed out
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at

[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068627#comment-15068627
 ] 

Maciej Bryński edited comment on SPARK-11437 at 12/22/15 7:45 PM:
--

[~davies], [~jason.white]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}


was (Author: maver1ck):
[~davies]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 1.6.0
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions

2015-12-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068670#comment-15068670
 ] 

Apache Spark commented on SPARK-12462:
--

User 'xguo27' has created a pull request for this issue:
https://github.com/apache/spark/pull/10437

> Add ExpressionDescription to misc non-aggregate functions
> -
>
> Key: SPARK-12462
> URL: https://issues.apache.org/jira/browse/SPARK-12462
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068696#comment-15068696
 ] 

Andrew Davidson commented on SPARK-12484:
-

What I am really trying to do is rewrite the following python code in Java. 
Ideally I would implement this code as a MLib.transformation how ever that does 
not seem possible at this point in time using the Java API

Kind regards

Andy

def convertMultinomialLabelToBinary(dataFrame):
newColName = "binomialLabel"
binomial = udf(lambda labelStr: labelStr if (labelStr == "noise") else 
"signal", StringType())
ret = dataFrame.withColumn(newColName, binomial(dataFrame["label"]))
return ret

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2015-12-22 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12485:
--
Target Version/s: 2.0.0

> Rename "dynamic allocation" to "elastic scaling"
> 
>
> Key: SPARK-12485
> URL: https://issues.apache.org/jira/browse/SPARK-12485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12488) LDA Describe Topics Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-12488:
-
Description: 
When running the LDA model, and using the describeTopics function, invalid 
values appear in the termID list that is returned:

The below example generated 10 topics on a data set with a vocabulary of 685.

{code}

// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)

val ldaModel = lda.run(docTermVector)
val distModel = 
ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted.reverse
res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 
585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 
569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 
553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 
537, 536, 535, 534, 533, 532, 53...
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted
res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
-1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
-1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
-1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
-1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
-1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
-1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
-563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
-26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 
39, 40, 41, 42, 43, 44, 45, 4...
{code}

  was:
When running the LDA model, and using the describeTopics function, invalid 
values appear in the termID list that is returned:

{code}

// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)

val ldaModel = lda.run(docTermVector)
val distModel = 
ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
{code}


> LDA Describe Topics Generates Invalid Term IDs
> --
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generated 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169,

[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Description: 
The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
* A connection is made from the launch host to the _Spark_ service
** {{spark.driver.host}} is set to the IP of the launch host on the LAN 
{{172.30.0.2}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the launch host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the launch host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

{code:title=Stacktrace|borderStyle=solid}
15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
172.30.0.3): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.

\\

\\
Original report follows:

I initially inquired about this here:

[jira] [Commented] (SPARK-1865) Improve behavior of cleanup of disk state

2015-12-22 Thread Charles Allen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068542#comment-15068542
 ] 

Charles Allen commented on SPARK-1865:
--

This is compounded by the fact that some of the shutdown processes will stop() 
during the normal course of the main thread, then will fail to wait for 
completion if stop() is ALSO called via the shutdown hook.

> Improve behavior of cleanup of disk state
> -
>
> Key: SPARK-1865
> URL: https://issues.apache.org/jira/browse/SPARK-1865
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Aaron Davidson
>
> Right now the behavior of disk cleanup is centered around the exit hook of 
> the executor, which attempts to cleanup shuffle files and disk manager 
> blocks, but may fail. We should make this behavior more predictable, perhaps 
> by letting the Standalone Worker cleanup the disk state, and adding a flag to 
> disable having the executor cleanup its own state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: SPARK_12483_unitTest.java

added a unit test file

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068761#comment-15068761
 ] 

Ilya Ganelin commented on SPARK-12488:
--

@jkbradley Would love your feedback here. Thanks!

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068568#comment-15068568
 ] 

Kyle Sutton commented on SPARK-12482:
-

Thanks!  I did.  I think he's saying that fileserver is listening on all ports, 
but if the Spark service can't see the IP given it by the Spark driver, the 
ports are immaterial.

> Spark fileserver not started on same IP as configured in spark.driver.host
> --
>
> Key: SPARK-12482
> URL: https://issues.apache.org/jira/browse/SPARK-12482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.5.2
>Reporter: Kyle Sutton
>
> The issue of the file server using the default IP instead of the IP address 
> configured through {{spark.driver.host}} still exists in _Spark 1.5.2_
> The problem is that, while the file server is listening on all ports on the 
> file server host, the _Spark_ service attempts to call back to the default 
> port of the host, to which it may or may not have connectivity.
> For instance, the following setup causes a 
> {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact 
> the _Spark_ driver host for a JAR:
> * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
> connection IP of {{172.30.0.2}}
> * _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
> * A connection is made from the driver host to the _Spark_ service
> ** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
> {{172.30.0.2}}
> ** {{spark.driver.port}} is set to {{50003}}
> ** {{spark.fileserver.port}} is set to {{50005}}
> * Locally (on the driver host), the following listeners are active:
> ** {{0.0.0.0:50005}}
> ** {{172.30.0.2:50003}}
> * The _Spark_ service calls back to the file server host for a JAR file using 
> the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
> * The _Spark_ service, being on a different network than the driver host, 
> cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
> file server
> ** A {{netstat}} on the _Spark_ service host will show the connection to the 
> file server host as being in {{SYN_SENT}} state until the process gives up 
> trying to connect
> {code:title=Driver|borderStyle=solid}
> SparkConf conf = new SparkConf()
> .setMaster("spark://172.30.0.3:7077")
> .setAppName("TestApp")
> .set("spark.driver.host", "172.30.0.2")
> .set("spark.driver.port", "50003")
> .set("spark.fileserver.port", "50005");
> JavaSparkContext sc = new JavaSparkContext(conf);
> sc.addJar("target/code.jar");
> {code}
> {code:title=Stacktrace|borderStyle=solid}
> 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> 172.30.0.3): java.net.SocketTimeoutException: connect timed out
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
>

[jira] [Updated] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12484:

Attachment: UDFTest.java

Add a unit test file

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12486) Executors are not always terminated successfully by the worker.

2015-12-22 Thread Nong Li (JIRA)

Nong Li created SPARK-12486:
---

 Summary: Executors are not always terminated successfully by the 
worker.
 Key: SPARK-12486
 URL: https://issues.apache.org/jira/browse/SPARK-12486
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Nong Li


There are cases when the executor is not killed successfully by the worker. 
One way this can happen is if the executor is in a bad state, fails to 
heartbeat and the master tells the worker to kill the executor. The executor is 
in such a bad state that the kill request is ignored. This seems to be able to 
happen if the executor is in heavy GC.

The cause of this is that the Process.destroy() API is not forceful enough. In 
Java8, a new API, destroyForcibly() was added. We should use that if available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-12488:
-
Summary: LDA describeTopics() Generates Invalid Term IDs  (was: LDA 
Describe Topics Generates Invalid Term IDs)

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generated 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Description: 
The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* A _Spark_ connection is made from the launch host to a service running on 
that LAN, {{172.30.0.0/16}}
** {{spark.driver.host}} is set to the IP of the launch host on that network 
{{172.30.0.2}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the launch host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the launch host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.

\\

\\
Original report follows:

I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work instead:

{code:title=HttpServer.scala#L75}
val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0))
{code}

  was:
The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* A _Spark_ connection is made from the launch host to a service running on 
that LAN, {{172.30.0.0/16}}
** {{spark.driver.host}} is set to the IP of that _Spark_ service host 
{{172.30.0.3}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the launch host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the launch host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.

\\

\\
Original report follows:

I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068698#comment-15068698
 ] 

Andrew Davidson commented on SPARK-12484:
-

releated issue https://issues.apache.org/jira/browse/SPARK-12483

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12488) LDA Describe Topics Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)

Ilya Ganelin created SPARK-12488:


 Summary: LDA Describe Topics Generates Invalid Term IDs
 Key: SPARK-12488
 URL: https://issues.apache.org/jira/browse/SPARK-12488
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.5.2
Reporter: Ilya Ganelin


When running the LDA model, and using the describeTopics function, invalid 
values appear in the termID list that is returned:

{code}

// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)

val ldaModel = lda.run(docTermVector)
val distModel = 
ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12487) Add docs for Kafka message handler

2015-12-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068748#comment-15068748
 ] 

Apache Spark commented on SPARK-12487:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10439

> Add docs for Kafka message handler
> --
>
> Key: SPARK-12487
> URL: https://issues.apache.org/jira/browse/SPARK-12487
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12489) Fix minor issues found by Findbugs

2015-12-22 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-12489:


 Summary: Fix minor issues found by Findbugs
 Key: SPARK-12489
 URL: https://issues.apache.org/jira/browse/SPARK-12489
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core, SQL
Reporter: Shixiong Zhu
Priority: Minor


Just used FindBugs to scan the codes and fixed some real issues:

1. Close `java.sql.Statement`
2. Fix incorrect `asInstanceOf`.
3. Remove unnecessary `synchronized` and `ReentrantLock`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12490:


Assignee: Apache Spark  (was: Josh Rosen)

> Don't use Javascript for web UI's paginated table navigation controls
> -
>
> Key: SPARK-12490
> URL: https://issues.apache.org/jira/browse/SPARK-12490
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The web UI's paginated table uses Javascript to implement certain navigation 
> controls, such as table sorting and the "go to page" form. This is 
> unnecessary and should be simplified to use plain HTML form controls and 
> links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis

2015-12-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12102:
-
Target Version/s:   (was: 1.6.0)

> Cast a non-nullable struct field to a nullable field during analysis
> 
>
> Key: SPARK-12102
> URL: https://issues.apache.org/jira/browse/SPARK-12102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Dilip Biswal
> Fix For: 2.0.0
>
>
> If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
> cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will 
> see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 
> 0) THEN 
> struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
>  as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
> expressions should all be same type or coercible to a common type; line 1 pos 
> 85}}.
> The problem is the nullability difference between {{4}} (non-nullable) and 
> {{hash(4)}} (nullable).
> Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069020#comment-15069020
 ] 

Xiao Li commented on SPARK-12483:
-

That means, it is not a bug. Thanks!

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069027#comment-15069027
 ] 

Andrew Davidson commented on SPARK-12483:
-

Hi Xiao

thanks for looking at the issue

is there a way to change a column name? If you do a select() using a data 
frame, the column name is really strange

see attachement for https://issues.apache.org/jira/browse/SPARK-12484

 // get column from data frame call df.withColumnName
Column newCol = udfDF.col("_c0"); 

renaming data frame columns is very common in R

Kind regards

Andy

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12478) Dataset fields of product types can't be null

2015-12-22 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069044#comment-15069044
 ] 

Cheng Lian commented on SPARK-12478:


I'm leaving this ticket open since we also need to backport this to branch-1.6 
after the release.

> Dataset fields of product types can't be null
> -
>
> Key: SPARK-12478
> URL: https://issues.apache.org/jira/browse/SPARK-12478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>  Labels: backport-needed
>
> Spark shell snippet for reproduction:
> {code}
> import sqlContext.implicits._
> case class Inner(f: Int)
> case class Outer(i: Inner)
> Seq(Outer(null)).toDS().toDF().show()
> Seq(Outer(null)).toDS().show()
> {code}
> Expected output should be:
> {noformat}
> ++
> |   i|
> ++
> |null|
> ++
> ++
> |   i|
> ++
> |null|
> ++
> {noformat}
> Actual output:
> {noformat}
> +--+
> | i|
> +--+
> |[null]|
> +--+
> java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: 
> Null value appeared in non-nullable field Inner.f of type scala.Int. If the 
> schema is inferred from a Scala tuple/case class, or a Java bean, please try 
> to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
> instead of int/scala.Int).
> newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, 
> StructType(StructField(f,IntegerType,false))])) null else newinstance(class 
> $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class 
> $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3))
> +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null 
> else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>:- isnull(input[0, StructType(StructField(f,IntegerType,false))])
>:  +- input[0, StructType(StructField(f,IntegerType,false))]
>:- null
>+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>   +- assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int)
>  +- input[0, StructType(StructField(f,IntegerType,false))].f
> +- input[0, StructType(StructField(f,IntegerType,false))]
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.Dataset.collect(Dataset.scala:704)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:725)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
> at $iwC$$iwC$$iwC$$iwC.(:48)
> at $iwC$$iwC$$iwC.(:50)
> at $iwC$$iwC.(:52)
> at $iwC.(:54)
> at (:56)
> at .(:60)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
>

[jira] [Commented] (SPARK-10486) Spark intermittently fails to recover from a worker failure (in standalone mode)

2015-12-22 Thread Liang Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069054#comment-15069054
 ] 

Liang Chen commented on SPARK-10486:


I meet the same problem

> Spark intermittently fails to recover from a worker failure (in standalone 
> mode)
> 
>
> Key: SPARK-10486
> URL: https://issues.apache.org/jira/browse/SPARK-10486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Cheuk Lam
>Priority: Critical
>
> We have run into a problem where some Spark job is aborted after one worker 
> is killed in a 2-worker standalone cluster.  The problem is intermittent, but 
> we can consistently reproduce it.  The problem only appears to happen when we 
> kill a worker.  It doesn't seem to happen when we kill an executor directly.
> The program we use to reproduce the problem is some iterative program based 
> on GraphX, although the nature of the issue doesn't seem to be GraphX 
> related.  This is how we reproduce the problem:
> * Set up a standalone cluster of 2 workers;
> * Run a Spark application of some iterative program (ours is some based on 
> GraphX);
> * Kill a worker process (and thus the associated executor);
> * Intermittently some job will be aborted.
> The driver and the executor logs are available, as well as the application 
> history (event log file).  But they are quite large and can't be attached 
> here.
> ~
> After looking into the log files, we think the failure is caused by the 
> following two things combined:
> * The BlockManagerMasterEndpoint in the driver has some stale block info 
> corresponding to the dead executor after the worker has been killed.  The 
> driver does appear to handle the "RemoveExecutor" message and cleans up all 
> related block info.  But subsequently, and intermittently, it receives some 
> Akka messages to re-register the dead BlockManager and re-add some of its 
> blocks.  As a result, upon GetLocations requests from the remaining executor, 
> the driver responds with some stale block info, instructing the remaining 
> executor to fetch blocks from the dead executor.  Please see the driver log 
> excerption below that shows the sequence of events described above.  In the 
> log, there are two executors: 1.2.3.4 was the one which got shut down, while 
> 5.6.7.8 is the remaining executor.  The driver also ran on 5.6.7.8.
> * When the remaining executor's BlockManager issues a doGetRemote() call to 
> fetch the block of data, it fails because the targeted BlockManager which 
> resided in the dead executor is gone.  This failure results in an exception 
> forwarded to the caller, bypassing the mechanism in the doGetRemote() 
> function to trigger a re-computation of the block.  I don't know whether that 
> is intentional or not.
> Driver log excerption that shows that the driver received messages to 
> re-register the dead executor after handling the RemoveExecutor message:
> 11690 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
> (172.236378 ms) 
> AkkaMessage(RegisterExecutor(0,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/Executor#670388190]),1.2.3.4:36140,8,Map(stdout
>  -> 
> http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stdout,
>  stderr -> 
> http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stderr)),true)
>  from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$f]
> 11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message 
> AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
> 52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
>  from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g]
> 11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: 
> AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
> 52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
> 11718 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] INFO 
> BlockManagerMasterEndpoint: Registering block manager 1.2.3.4:52615 with 6.2 
> GB RAM, BlockManagerId(0, 1.2.3.4, 52615)
> 11719 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
> (1.498313 ms) AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
>

[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12468:


Assignee: Apache Spark

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Assignee: Apache Spark
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12468:


Assignee: (was: Apache Spark)

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12061) Persist for Map/filter with Lambda Functions don't always read from Cache

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058419#comment-15058419
 ] 

Xiao Li edited comment on SPARK-12061 at 12/23/15 12:22 AM:


The logical plan's cleanArgs do not match when we calling sameResult. 


was (Author: smilegator):
Start working on it. Thanks!

> Persist for Map/filter with Lambda Functions don't always read from Cache
> -
>
> Key: SPARK-12061
> URL: https://issues.apache.org/jira/browse/SPARK-12061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, the existing caching mechanisms do not work on dataset operations 
> when using map/filter with lambda functions. For example, 
> {code}
>   test("persist and then map/filter with lambda functions") {
> val f = (i: Int) => i + 1
> val ds = Seq(1, 2, 3).toDS()
> val mapped = ds.map(f)
> mapped.cache()
> val mapped2 = ds.map(f)
> assertCached(mapped2)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12386) Setting "spark.executor.port" leads to NPE in SparkEnv

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12386:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> Setting "spark.executor.port" leads to NPE in SparkEnv
> --
>
> Key: SPARK-12386
> URL: https://issues.apache.org/jira/browse/SPARK-12386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 1.6.0
>
>
> From the list:
> {quote}
> when we set spark.executor.port in 1.6, we get thrown a NPE in 
> SparkEnv$.create(SparkEnv.scala:259).
> {quote}
> Fix is simple; probably should make it to 1.6.0 since it will affect anyone 
> using that config options, but I'll leave that to the release manager's 
> discretion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12410) "." and "|" used for String.split directly

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12410:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> "." and "|" used for String.split directly
> --
>
> Key: SPARK-12410
> URL: https://issues.apache.org/jira/browse/SPARK-12410
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.4.2, 1.5.3, 1.6.0
>
>
> String.split accepts a regular expression, so we should escape "." and "|".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069025#comment-15069025
 ] 

Xiao Li commented on SPARK-12484:
-

Please check the email answer by [~zjffdu] Thanks!

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming

2015-12-22 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-12429:
--
Assignee: Shixiong Zhu  (was: Apache Spark)

> Update documentation to show how to use accumulators and broadcasts with 
> Spark Streaming
> 
>
> Key: SPARK-12429
> URL: https://issues.apache.org/jira/browse/SPARK-12429
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
>  Accumulators and Broadcasts with Spark Streaming cannot work perfectly when 
> restarting on driver failures. We need to add some example to guide the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming

2015-12-22 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-12429.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Update documentation to show how to use accumulators and broadcasts with 
> Spark Streaming
> 
>
> Key: SPARK-12429
> URL: https://issues.apache.org/jira/browse/SPARK-12429
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
>  Accumulators and Broadcasts with Spark Streaming cannot work perfectly when 
> restarting on driver failures. We need to add some example to guide the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069055#comment-15069055
 ] 

Xiao Li commented on SPARK-12483:
-

Hi, Andy, 

See this example: 
{code}
DataFrame df = hc.range(0, 100).unionAll(hc.range(0, 
100)).select(col("id").as("value"));
{code}

Good luck, 

Xiao

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069061#comment-15069061
 ] 

Andrew Davidson commented on SPARK-12484:
-

Hi Xiao

thanks for looking into this

Andy

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12468:


Assignee: (was: Apache Spark)

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12468:


Assignee: Apache Spark

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Assignee: Apache Spark
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12376) Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12376:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method
> 
>
> Key: SPARK-12376
> URL: https://issues.apache.org/jira/browse/SPARK-12376
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
> Environment: Oracle Java 64-bit (build 1.8.0_66-b17)
>Reporter: Evan Chen
>Assignee: Evan Chen
>Priority: Minor
> Fix For: 1.6.0
>
>
> org.apache.spark.streaming.Java8APISuite.java is failing due to trying to 
> sort immutable list in assertOrderInvariantEquals method.
> Here are the errors:
> Tests run: 27, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 5.948 sec 
> <<< FAILURE! - in org.apache.spark.streaming.Java8APISuite
> testMap(org.apache.spark.streaming.Java8APISuite)  Time elapsed: 0.217 sec  
> <<< ERROR!
> java.lang.UnsupportedOperationException: null
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at java.util.AbstractList$ListItr.set(AbstractList.java:426)
>   at java.util.List.sort(List.java:482)
>   at java.util.Collections.sort(Collections.java:141)
>   at 
> org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444)
> testFlatMap(org.apache.spark.streaming.Java8APISuite)  Time elapsed: 0.203 
> sec  <<< ERROR!
> java.lang.UnsupportedOperationException: null
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at java.util.AbstractList$ListItr.set(AbstractList.java:426)
>   at java.util.List.sort(List.java:482)
>   at java.util.Collections.sort(Collections.java:141)
>   at 
> org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444)
> testFilter(org.apache.spark.streaming.Java8APISuite)  Time elapsed: 0.209 sec 
>  <<< ERROR!
> java.lang.UnsupportedOperationException: null
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at java.util.AbstractList$ListItr.set(AbstractList.java:426)
>   at java.util.List.sort(List.java:482)
>   at java.util.Collections.sort(Collections.java:141)
>   at 
> org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444)
> testTransform(org.apache.spark.streaming.Java8APISuite)  Time elapsed: 0.215 
> sec  <<< ERROR!
> java.lang.UnsupportedOperationException: null
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at java.util.AbstractList$ListItr.set(AbstractList.java:426)
>   at java.util.List.sort(List.java:482)
>   at java.util.Collections.sort(Collections.java:141)
>   at 
> org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444)
> Results :
> Tests in error: 
>   
> Java8APISuite.testFilter:81->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444
>  » UnsupportedOperation
>   
> Java8APISuite.testFlatMap:360->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444
>  » UnsupportedOperation
>   
> Java8APISuite.testMap:63->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444
>  » UnsupportedOperation
>   
> Java8APISuite.testTransform:168->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444
>  » UnsupportedOperation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11749) Duplicate creating the RDD in file stream when recovering from checkpoint data

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-11749:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> Duplicate creating the RDD in file stream when recovering from checkpoint data
> --
>
> Key: SPARK-11749
> URL: https://issues.apache.org/jira/browse/SPARK-11749
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Jack Hu
>Assignee: Jack Hu
> Fix For: 1.6.0
>
>
> I have a case to monitor a HDFS folder, then enrich the incoming data from 
> the HDFS folder via different table (about 15 reference tables) and send to 
> different hive table after some operations. 
> The code is as this:
> {code}
> val txt = 
> ssc.textFileStream(folder).map(toKeyValuePair).reduceByKey(removeDuplicates)
> val refTable1 = 
> ssc.textFileStream(refSource1).map(parse(_)).updateStateByKey(...)
> txt.join(refTable1).map(..).reduceByKey(...).foreachRDD(
>   rdd => {
>  // insert into hive table
>   }
> )
> val refTable2 = 
> ssc.textFileStream(refSource2).map(parse(_)).updateStateByKey(...)
> txt.join(refTable2).map(..).reduceByKey(...).foreachRDD(
>   rdd => {
>  // insert into hive table
>   }
> )
> /// more refTables in following code
> {code}
>  
> The {{batchInterval}} of this application is set to *30 seconds*, the 
> checkpoint interval is set to *10 minutes*, every batch in {{txt}} has *60 
> files*
> After recovered from checkpoint data, I can see lots of log to create the RDD 
> in file stream: rdd in each batch of file stream was been recreated *15 
> times*, and it takes about *5 minutes* to create so much file RDD. During 
> this period, *10K+ broadcast* had been created and almost used all the block 
> manager space. 
> After some investigation, we found that the {{DStream.restoreCheckpointData}} 
> would be invoked at each output ({{DStream.foreachRDD}} in this case), and no 
> flag to indicate that this {{DStream}} had been restored, so the RDD in file 
> stream was been recreated. 
> Suggest to add on flag to control the restore process to avoid the duplicated 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12386) Setting "spark.executor.port" leads to NPE in SparkEnv

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12386:
-
Target Version/s: 1.6.0  (was: 1.6.1, 2.0.0)

> Setting "spark.executor.port" leads to NPE in SparkEnv
> --
>
> Key: SPARK-12386
> URL: https://issues.apache.org/jira/browse/SPARK-12386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 1.6.0
>
>
> From the list:
> {quote}
> when we set spark.executor.port in 1.6, we get thrown a NPE in 
> SparkEnv$.create(SparkEnv.scala:259).
> {quote}
> Fix is simple; probably should make it to 1.6.0 since it will affect anyone 
> using that config options, but I'll leave that to the release manager's 
> discretion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11749) Duplicate creating the RDD in file stream when recovering from checkpoint data

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-11749:
-
Affects Version/s: (was: 1.5.0)
   1.6.0

> Duplicate creating the RDD in file stream when recovering from checkpoint data
> --
>
> Key: SPARK-11749
> URL: https://issues.apache.org/jira/browse/SPARK-11749
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Jack Hu
>Assignee: Jack Hu
> Fix For: 1.6.0
>
>
> I have a case to monitor a HDFS folder, then enrich the incoming data from 
> the HDFS folder via different table (about 15 reference tables) and send to 
> different hive table after some operations. 
> The code is as this:
> {code}
> val txt = 
> ssc.textFileStream(folder).map(toKeyValuePair).reduceByKey(removeDuplicates)
> val refTable1 = 
> ssc.textFileStream(refSource1).map(parse(_)).updateStateByKey(...)
> txt.join(refTable1).map(..).reduceByKey(...).foreachRDD(
>   rdd => {
>  // insert into hive table
>   }
> )
> val refTable2 = 
> ssc.textFileStream(refSource2).map(parse(_)).updateStateByKey(...)
> txt.join(refTable2).map(..).reduceByKey(...).foreachRDD(
>   rdd => {
>  // insert into hive table
>   }
> )
> /// more refTables in following code
> {code}
>  
> The {{batchInterval}} of this application is set to *30 seconds*, the 
> checkpoint interval is set to *10 minutes*, every batch in {{txt}} has *60 
> files*
> After recovered from checkpoint data, I can see lots of log to create the RDD 
> in file stream: rdd in each batch of file stream was been recreated *15 
> times*, and it takes about *5 minutes* to create so much file RDD. During 
> this period, *10K+ broadcast* had been created and almost used all the block 
> manager space. 
> After some investigation, we found that the {{DStream.restoreCheckpointData}} 
> would be invoked at each output ({{DStream.foreachRDD}} in this case), and no 
> flag to indicate that this {{DStream}} had been restored, so the RDD in file 
> stream was been recreated. 
> Suggest to add on flag to control the restore process to avoid the duplicated 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12220) Make Utils.fetchFile support files that contain special characters

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12220:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> Make Utils.fetchFile support files that contain special characters
> --
>
> Key: SPARK-12220
> URL: https://issues.apache.org/jira/browse/SPARK-12220
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
> Now if a file name contains some illegal characters, such as " ", 
> Utils.fetchFile will fail because it doesn't handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069019#comment-15069019
 ] 

Xiao Li commented on SPARK-12483:
-

{code}as{code} is used to return a new DataFrame with an alias set. For example,

{code}
val x = testData2.as("x")
val y = testData2.as("y")
val join = x.join(y, $"x.a" === $"y.a", 
"inner").queryExecution.optimizedPlan
{code}

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12478) Dataset fields of product types can't be null

2015-12-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12478:
---
Labels: backport-needed  (was: )

> Dataset fields of product types can't be null
> -
>
> Key: SPARK-12478
> URL: https://issues.apache.org/jira/browse/SPARK-12478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>  Labels: backport-needed
>
> Spark shell snippet for reproduction:
> {code}
> import sqlContext.implicits._
> case class Inner(f: Int)
> case class Outer(i: Inner)
> Seq(Outer(null)).toDS().toDF().show()
> Seq(Outer(null)).toDS().show()
> {code}
> Expected output should be:
> {noformat}
> ++
> |   i|
> ++
> |null|
> ++
> ++
> |   i|
> ++
> |null|
> ++
> {noformat}
> Actual output:
> {noformat}
> +--+
> | i|
> +--+
> |[null]|
> +--+
> java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: 
> Null value appeared in non-nullable field Inner.f of type scala.Int. If the 
> schema is inferred from a Scala tuple/case class, or a Java bean, please try 
> to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
> instead of int/scala.Int).
> newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, 
> StructType(StructField(f,IntegerType,false))])) null else newinstance(class 
> $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class 
> $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3))
> +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null 
> else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>:- isnull(input[0, StructType(StructField(f,IntegerType,false))])
>:  +- input[0, StructType(StructField(f,IntegerType,false))]
>:- null
>+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>   +- assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int)
>  +- input[0, StructType(StructField(f,IntegerType,false))].f
> +- input[0, StructType(StructField(f,IntegerType,false))]
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.Dataset.collect(Dataset.scala:704)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:725)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
> at $iwC$$iwC$$iwC$$iwC.(:48)
> at $iwC$$iwC$$iwC.(:50)
> at $iwC$$iwC.(:52)
> at $iwC.(:54)
> at (:56)
> at .(:60)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045)
> at 
>

[jira] [Updated] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Summary: Spark fileserver not started on same IP as configured in 
spark.driver.host  (was: Spark fileserver not started on same IP as using 
spark.driver.host)

> Spark fileserver not started on same IP as configured in spark.driver.host
> --
>
> Key: SPARK-12482
> URL: https://issues.apache.org/jira/browse/SPARK-12482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.5.2
>Reporter: Kyle Sutton
>
> The issue of the file server using the default IP instead of the IP address 
> configured through {{spark.driver.host}} still exists in _Spark 1.5.2_
> The problem is that, while the file server is listening on all ports on the 
> file server host, the _Spark_ service attempts to call back to the default 
> port of the host, to which it may or may not have connectivity.
> For instance, the following setup causes a 
> {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact 
> the _Spark_ driver host for a JAR:
> * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
> connection IP of {{172.30.0.2}}
> * _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
> * A connection is made from the driver host to the _Spark_ service
> ** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
> {{172.30.0.2}}
> ** {{spark.driver.port}} is set to {{50003}}
> ** {{spark.fileserver.port}} is set to {{50005}}
> * Locally (on the driver host), the following listeners are active:
> ** {{0.0.0.0:50005}}
> ** {{172.30.0.2:50003}}
> * The _Spark_ service calls back to the file server host for a JAR file using 
> the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
> * The _Spark_ service, being on a different network than the driver host, 
> cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
> file server
> ** A {{netstat}} on the _Spark_ service host will show the connection to the 
> file server host as being in {{SYN_SENT}} state until the process gives up 
> trying to connect
> {code:title=Driver|borderStyle=solid}
> SparkConf conf = new SparkConf()
> .setMaster("spark://172.30.0.3:7077")
> .setAppName("TestApp")
> .set("spark.driver.host", "172.30.0.2")
> .set("spark.driver.port", "50003")
> .set("spark.fileserver.port", "50005");
> JavaSparkContext sc = new JavaSparkContext(conf);
> sc.addJar("target/code.jar");
> {code}
> {code:title=Stacktrace|borderStyle=solid}
> 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> 172.30.0.3): java.net.SocketTimeoutException: connect timed out
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
>

[jira] [Reopened] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton reopened SPARK-12482:
-

Reopening SPARK-6476, since it's still an issue

> Spark fileserver not started on same IP as configured in spark.driver.host
> --
>
> Key: SPARK-12482
> URL: https://issues.apache.org/jira/browse/SPARK-12482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.5.2
>Reporter: Kyle Sutton
>
> The issue of the file server using the default IP instead of the IP address 
> configured through {{spark.driver.host}} still exists in _Spark 1.5.2_
> The problem is that, while the file server is listening on all ports on the 
> file server host, the _Spark_ service attempts to call back to the default 
> port of the host, to which it may or may not have connectivity.
> For instance, the following setup causes a 
> {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact 
> the _Spark_ driver host for a JAR:
> * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
> connection IP of {{172.30.0.2}}
> * _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
> * A connection is made from the driver host to the _Spark_ service
> ** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
> {{172.30.0.2}}
> ** {{spark.driver.port}} is set to {{50003}}
> ** {{spark.fileserver.port}} is set to {{50005}}
> * Locally (on the driver host), the following listeners are active:
> ** {{0.0.0.0:50005}}
> ** {{172.30.0.2:50003}}
> * The _Spark_ service calls back to the file server host for a JAR file using 
> the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
> * The _Spark_ service, being on a different network than the driver host, 
> cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
> file server
> ** A {{netstat}} on the _Spark_ service host will show the connection to the 
> file server host as being in {{SYN_SENT}} state until the process gives up 
> trying to connect
> {code:title=Driver|borderStyle=solid}
> SparkConf conf = new SparkConf()
> .setMaster("spark://172.30.0.3:7077")
> .setAppName("TestApp")
> .set("spark.driver.host", "172.30.0.2")
> .set("spark.driver.port", "50003")
> .set("spark.fileserver.port", "50005");
> JavaSparkContext sc = new JavaSparkContext(conf);
> sc.addJar("target/code.jar");
> {code}
> {code:title=Stacktrace|borderStyle=solid}
> 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> 172.30.0.3): java.net.SocketTimeoutException: connect timed out
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>   at

[jira] [Commented] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068572#comment-15068572
 ] 

Kyle Sutton commented on SPARK-6476:


The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
* A connection is made from the driver host to the _Spark_ service
** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
{{172.30.0.2}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the driver host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the driver host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

{code:title=Driver|borderStyle=solid}
SparkConf conf = new SparkConf()
.setMaster("spark://172.30.0.3:7077")
.setAppName("TestApp")
.set("spark.driver.host", "172.30.0.2")
.set("spark.driver.port", "50003")
.set("spark.fileserver.port", "50005");
JavaSparkContext sc = new JavaSparkContext(conf);
sc.addJar("target/code.jar");
{code}

{code:title=Stacktrace|borderStyle=solid}
15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
172.30.0.3): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at

[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068627#comment-15068627
 ] 

Maciej Bryński edited comment on SPARK-11437 at 12/22/15 7:44 PM:
--

[~davies]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}


was (Author: maver1ck):
[~davies]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
from pyspark.sql.types import *
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
from pyspark.sql.types import *
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 1.6.0
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: SPARK_12483_unitTest.java

add a unit test file 

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-12484:
---

 Summary: DataFrame withColumn() does not work in Java
 Key: SPARK-12484
 URL: https://issues.apache.org/jira/browse/SPARK-12484
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
 Environment: mac El Cap. 10.11.2
Java 8
Reporter: Andrew Davidson


DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises

 org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
transformedByUDF#3];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068684#comment-15068684
 ] 

Andrew Davidson commented on SPARK-12484:
-

you can find some more back ground on the email thread 'should I file a bug? 
Re: trouble implementing Transformer and calling DataFrame.withColumn()' 



> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner

2015-12-22 Thread Pete Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068693#comment-15068693
 ] 

Pete Robbins commented on SPARK-12470:
--

I'm fairly sure the code in my PR is correct but it is causing an 
ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is 
failing with the change I made. The failure is:

 determining the number of reducers: aggregate operator *** FAILED ***
 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316)

putting some debug into the test I see that before my change the pre-shuffle 
partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 
800. 800. 720 but I have no idea why. I'd really appreciate anyone with 
knowledge of this area a) checking my PR and b) helping explain the failing 
test.

> Incorrect calculation of row size in 
> o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
> ---
>
> Key: SPARK-12470
> URL: https://issues.apache.org/jira/browse/SPARK-12470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Pete Robbins
>Priority: Minor
>
> While looking into https://issues.apache.org/jira/browse/SPARK-12319 I 
> noticed that the row size is incorrectly calculated.
> The "sizeReduction" value is calculated in words:
>// The number of words we can reduce when we concat two rows together.
> // The only reduction comes from merging the bitset portion of the two 
> rows, saving 1 word.
> val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords
> but then it is subtracted from the size of the row in bytes:
>|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - 
> $sizeReduction);
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12489) Fix minor issues found by Findbugs

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12489:


Assignee: (was: Apache Spark)

> Fix minor issues found by Findbugs
> --
>
> Key: SPARK-12489
> URL: https://issues.apache.org/jira/browse/SPARK-12489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Just used FindBugs to scan the codes and fixed some real issues:
> 1. Close `java.sql.Statement`
> 2. Fix incorrect `asInstanceOf`.
> 3. Remove unnecessary `synchronized` and `ReentrantLock`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12489) Fix minor issues found by Findbugs

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12489:


Assignee: Apache Spark

> Fix minor issues found by Findbugs
> --
>
> Key: SPARK-12489
> URL: https://issues.apache.org/jira/browse/SPARK-12489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> Just used FindBugs to scan the codes and fixed some real issues:
> 1. Close `java.sql.Statement`
> 2. Fix incorrect `asInstanceOf`.
> 3. Remove unnecessary `synchronized` and `ReentrantLock`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12491) UDAF result differs in SQL if alias is used

2015-12-22 Thread Tristan (JIRA)

Tristan created SPARK-12491:
---

 Summary: UDAF result differs in SQL if alias is used
 Key: SPARK-12491
 URL: https://issues.apache.org/jira/browse/SPARK-12491
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Tristan


Using the GeometricMean UDAF example 
(https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html),
 I found the following discrepancy in results:

scala> sqlContext.sql("select group_id, gm(id) from simple group by 
group_id").show()
++---+
|group_id|_c1|
++---+
|   0|0.0|
|   1|0.0|
|   2|0.0|
++---+


scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple 
group by group_id").show()
++-+
|group_id|GeometricMean|
++-+
|   0|8.981385496571725|
|   1|7.301716979342118|
|   2|7.706253151292568|
++-+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis

2015-12-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12102.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10156
[https://github.com/apache/spark/pull/10156]

> Cast a non-nullable struct field to a nullable field during analysis
> 
>
> Key: SPARK-12102
> URL: https://issues.apache.org/jira/browse/SPARK-12102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
> Fix For: 2.0.0
>
>
> If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
> cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will 
> see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 
> 0) THEN 
> struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
>  as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
> expressions should all be same type or coercible to a common type; line 1 pos 
> 85}}.
> The problem is the nullability difference between {{4}} (non-nullable) and 
> {{hash(4)}} (nullable).
> Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis

2015-12-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12102:
-
Assignee: Dilip Biswal

> Cast a non-nullable struct field to a nullable field during analysis
> 
>
> Key: SPARK-12102
> URL: https://issues.apache.org/jira/browse/SPARK-12102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Dilip Biswal
> Fix For: 2.0.0
>
>
> If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
> cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will 
> see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 
> 0) THEN 
> struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
>  as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
> expressions should all be same type or coercible to a common type; line 1 pos 
> 85}}.
> The problem is the nullability difference between {{4}} (non-nullable) and 
> {{hash(4)}} (nullable).
> Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner

2015-12-22 Thread Pete Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068693#comment-15068693
 ] 

Pete Robbins edited comment on SPARK-12470 at 12/22/15 9:47 PM:


I'm fairly sure the code in my PR is correct but it is causing an 
ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is 
failing with the change I made. The failure is:

 determining the number of reducers: aggregate operator *** FAILED ***
 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316)

putting some debug into the test I see that before my change the pre-shuffle 
partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 
800. 800. 720 but I have no idea why. I'd really appreciate anyone with 
knowledge of this area a) checking my PR and b) helping explain the failing 
test.

EDIT Please ignore. Merged with latest head including changes for SPARK-12388 
now passes all tests


was (Author: robbinspg):
I'm fairly sure the code in my PR is correct but it is causing an 
ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is 
failing with the change I made. The failure is:

 determining the number of reducers: aggregate operator *** FAILED ***
 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316)

putting some debug into the test I see that before my change the pre-shuffle 
partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 
800. 800. 720 but I have no idea why. I'd really appreciate anyone with 
knowledge of this area a) checking my PR and b) helping explain the failing 
test.

> Incorrect calculation of row size in 
> o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
> ---
>
> Key: SPARK-12470
> URL: https://issues.apache.org/jira/browse/SPARK-12470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Pete Robbins
>Priority: Minor
>
> While looking into https://issues.apache.org/jira/browse/SPARK-12319 I 
> noticed that the row size is incorrectly calculated.
> The "sizeReduction" value is calculated in words:
>// The number of words we can reduce when we concat two rows together.
> // The only reduction comes from merging the bitset portion of the two 
> rows, saving 1 word.
> val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords
> but then it is subtracted from the size of the row in bytes:
>|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - 
> $sizeReduction);
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-22 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068908#comment-15068908
 ] 

Timothy Hunter commented on SPARK-12247:


It seems to me that the calculation of false positives is more relevant for the 
movie ratings, and that the RMSE right above in the example is already a good 
example to but. What do you think?

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12487) Add docs for Kafka message handler

2015-12-22 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-12487.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Add docs for Kafka message handler
> --
>
> Key: SPARK-12487
> URL: https://issues.apache.org/jira/browse/SPARK-12487
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls

2015-12-22 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-12490:
--

 Summary: Don't use Javascript for web UI's paginated table 
navigation controls
 Key: SPARK-12490
 URL: https://issues.apache.org/jira/browse/SPARK-12490
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Josh Rosen
Assignee: Josh Rosen


The web UI's paginated table uses Javascript to implement certain navigation 
controls, such as table sorting and the "go to page" form. This is unnecessary 
and should be simplified to use plain HTML form controls and links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls

2015-12-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068802#comment-15068802
 ] 

Apache Spark commented on SPARK-12490:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10441

> Don't use Javascript for web UI's paginated table navigation controls
> -
>
> Key: SPARK-12490
> URL: https://issues.apache.org/jira/browse/SPARK-12490
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The web UI's paginated table uses Javascript to implement certain navigation 
> controls, such as table sorting and the "go to page" form. This is 
> unnecessary and should be simplified to use plain HTML form controls and 
> links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068827#comment-15068827
 ] 

Ilya Ganelin commented on SPARK-12488:
--

Further investigation identifies the issue as stemming from the docTermVector 
containing zero-vectors (as in no words from the vocabulary present in the 
document).

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12441) Fixing missingInput in all Logical/Physical operators

2015-12-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12441:

Summary: Fixing missingInput in all Logical/Physical operators  (was: 
Fixing missingInput in Generate/MapPartitions/AppendColumns/MapGroups/CoGroup)

> Fixing missingInput in all Logical/Physical operators
> -
>
> Key: SPARK-12441
> URL: https://issues.apache.org/jira/browse/SPARK-12441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Xiao Li
>
> The value of missingInput in 
> Generate/MapPartitions/AppendColumns/MapGroups/CoGroup is incorrect. 
> {code}
> val df = Seq((1, "a b c"), (2, "a b"), (3, "a")).toDF("number", "letters")
> val df2 =
>   df.explode('letters) {
> case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq
>   }
> df2.explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Generate UserDefinedGenerator('letters), true, false, None
> +- Project [_1#0 AS number#2,_2#1 AS letters#3]
>+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]]
> == Analyzed Logical Plan ==
> number: int, letters: string, _1: string
> Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8]
> +- Project [_1#0 AS number#2,_2#1 AS letters#3]
>+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]]
> == Optimized Logical Plan ==
> Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8]
> +- LocalRelation [number#2,letters#3], [[1,a b c],[2,a b],[3,a]]
> == Physical Plan ==
> !Generate UserDefinedGenerator(letters#3), true, false, 
> [number#2,letters#3,_1#8]
> +- LocalTableScan [number#2,letters#3], [[1,a b c],[2,a b],[3,a]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host

2015-12-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12482.
---
Resolution: Duplicate

I reopened the other one and reclosed this. Please see Rares's comments there 
though.

> Spark fileserver not started on same IP as configured in spark.driver.host
> --
>
> Key: SPARK-12482
> URL: https://issues.apache.org/jira/browse/SPARK-12482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.5.2
>Reporter: Kyle Sutton
>
> The issue of the file server using the default IP instead of the IP address 
> configured through {{spark.driver.host}} still exists in _Spark 1.5.2_
> The problem is that, while the file server is listening on all ports on the 
> file server host, the _Spark_ service attempts to call back to the default 
> port of the host, to which it may or may not have connectivity.
> For instance, the following setup causes a 
> {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact 
> the _Spark_ driver host for a JAR:
> * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
> connection IP of {{172.30.0.2}}
> * _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
> * A connection is made from the driver host to the _Spark_ service
> ** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
> {{172.30.0.2}}
> ** {{spark.driver.port}} is set to {{50003}}
> ** {{spark.fileserver.port}} is set to {{50005}}
> * Locally (on the driver host), the following listeners are active:
> ** {{0.0.0.0:50005}}
> ** {{172.30.0.2:50003}}
> * The _Spark_ service calls back to the file server host for a JAR file using 
> the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
> * The _Spark_ service, being on a different network than the driver host, 
> cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
> file server
> ** A {{netstat}} on the _Spark_ service host will show the connection to the 
> file server host as being in {{SYN_SENT}} state until the process gives up 
> trying to connect
> {code:title=Driver|borderStyle=solid}
> SparkConf conf = new SparkConf()
> .setMaster("spark://172.30.0.3:7077")
> .setAppName("TestApp")
> .set("spark.driver.host", "172.30.0.2")
> .set("spark.driver.port", "50003")
> .set("spark.fileserver.port", "50005");
> JavaSparkContext sc = new JavaSparkContext(conf);
> sc.addJar("target/code.jar");
> {code}
> {code:title=Stacktrace|borderStyle=solid}
> 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> 172.30.0.3): java.net.SocketTimeoutException: connect timed out
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at

[jira] [Reopened] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-6476:
--

> Spark fileserver not started on same IP as using spark.driver.host
> --
>
> Key: SPARK-6476
> URL: https://issues.apache.org/jira/browse/SPARK-6476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Rares Vernica
>
> I initially inquired about this here: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E
> If the Spark driver host has multiple IPs and spark.driver.host is set to one 
> of them, I would expect the fileserver to start on the same IP. I checked 
> HttpServer and the jetty Server is started the default IP of the machine: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75
> Something like this might work instead:
> {code:title=HttpServer.scala#L75}
> val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 
> 0))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068565#comment-15068565
 ] 

Kyle Sutton commented on SPARK-12482:
-

Actually, how do I reopen a ticket I didn't write?

> Spark fileserver not started on same IP as configured in spark.driver.host
> --
>
> Key: SPARK-12482
> URL: https://issues.apache.org/jira/browse/SPARK-12482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.5.2
>Reporter: Kyle Sutton
>
> The issue of the file server using the default IP instead of the IP address 
> configured through {{spark.driver.host}} still exists in _Spark 1.5.2_
> The problem is that, while the file server is listening on all ports on the 
> file server host, the _Spark_ service attempts to call back to the default 
> port of the host, to which it may or may not have connectivity.
> For instance, the following setup causes a 
> {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact 
> the _Spark_ driver host for a JAR:
> * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
> connection IP of {{172.30.0.2}}
> * _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
> * A connection is made from the driver host to the _Spark_ service
> ** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
> {{172.30.0.2}}
> ** {{spark.driver.port}} is set to {{50003}}
> ** {{spark.fileserver.port}} is set to {{50005}}
> * Locally (on the driver host), the following listeners are active:
> ** {{0.0.0.0:50005}}
> ** {{172.30.0.2:50003}}
> * The _Spark_ service calls back to the file server host for a JAR file using 
> the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
> * The _Spark_ service, being on a different network than the driver host, 
> cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
> file server
> ** A {{netstat}} on the _Spark_ service host will show the connection to the 
> file server host as being in {{SYN_SENT}} state until the process gives up 
> trying to connect
> {code:title=Driver|borderStyle=solid}
> SparkConf conf = new SparkConf()
> .setMaster("spark://172.30.0.3:7077")
> .setAppName("TestApp")
> .set("spark.driver.host", "172.30.0.2")
> .set("spark.driver.port", "50003")
> .set("spark.fileserver.port", "50005");
> JavaSparkContext sc = new JavaSparkContext(conf);
> sc.addJar("target/code.jar");
> {code}
> {code:title=Stacktrace|borderStyle=solid}
> 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> 172.30.0.3): java.net.SocketTimeoutException: connect timed out
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at

[jira] [Assigned] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12462:


Assignee: (was: Apache Spark)

> Add ExpressionDescription to misc non-aggregate functions
> -
>
> Key: SPARK-12462
> URL: https://issues.apache.org/jira/browse/SPARK-12462
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12462:


Assignee: Apache Spark

> Add ExpressionDescription to misc non-aggregate functions
> -
>
> Key: SPARK-12462
> URL: https://issues.apache.org/jira/browse/SPARK-12462
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068761#comment-15068761
 ] 

Ilya Ganelin edited comment on SPARK-12488 at 12/22/15 9:32 PM:


[~josephkb] Would love your feedback here. Thanks!


was (Author: ilganeli):
@jkbradley Would love your feedback here. Thanks!

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Description: 
The issue of the file server starting up on the default IP instead of the IP 
address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the Spark service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* A _Spark_ connection is made from the launch host to a service running on 
that LAN, {{172.30.0.0/16}}
** {{spark.driver.host}} is set to the IP of that _Spark_ service host 
{{172.30.0.3}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the launch host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the launch host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.



Original report follows:

I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work instead:

{code:title=HttpServer.scala#L75}
val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0))
{code}

  was:
I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work instead:

{code:title=HttpServer.scala#L75}
val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0))
{code}


> CLONE - Spark fileserver not started on same IP as using spark.driver.host
> --
>
> Key: SPARK-12482
> URL: https://issues.apache.org/jira/browse/SPARK-12482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Kyle Sutton
>
> The issue of the file server starting up on the default IP instead of the IP 
> address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_
> The problem is that, while the file server is listening on all ports on the 
> file server host, the Spark service attempts to call back to the default port 
> of the host, to which it may or may not have connectivity.
> For instance, the following setup causes a 
> {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact 
> the _Spark_ driver host for a JAR:
> * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
> connection IP of {{172.30.0.2}}
> * A _Spark_ connection is made from the launch host to a service running on 
> that LAN, {{172.30.0.0/16}}
> ** {{spark.driver.host}} is set to the IP of that _Spark_ service host 
> {{172.30.0.3}}
> ** {{spark.driver.port}} is set to {{50003}}
> ** {{spark.fileserver.port}} is set to {{50005}}
> * Locally (on the launch host), the following listeners are active:
> ** {{0.0.0.0:50005}}
> ** {{172.30.0.2:50003}}
> * The _Spark_ service calls back to the file server host for a JAR file using 
> the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}

[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Description: 
The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the Spark service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* A _Spark_ connection is made from the launch host to a service running on 
that LAN, {{172.30.0.0/16}}
** {{spark.driver.host}} is set to the IP of that _Spark_ service host 
{{172.30.0.3}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the launch host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the launch host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.



Original report follows:

I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work instead:

{code:title=HttpServer.scala#L75}
val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0))
{code}

  was:
The issue of the file server starting up on the default IP instead of the IP 
address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the Spark service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* A _Spark_ connection is made from the launch host to a service running on 
that LAN, {{172.30.0.0/16}}
** {{spark.driver.host}} is set to the IP of that _Spark_ service host 
{{172.30.0.3}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the launch host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the launch host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.



Original report follows:

I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work instead:

[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Affects Version/s: 1.5.2

> CLONE - Spark fileserver not started on same IP as using spark.driver.host
> --
>
> Key: SPARK-12482
> URL: https://issues.apache.org/jira/browse/SPARK-12482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.5.2
>Reporter: Kyle Sutton
>
> The issue of the file server using the default IP instead of the IP address 
> configured through {{spark.driver.host}} still exists in _Spark 1.5.2_
> The problem is that, while the file server is listening on all ports on the 
> file server host, the Spark service attempts to call back to the default port 
> of the host, to which it may or may not have connectivity.
> For instance, the following setup causes a 
> {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact 
> the _Spark_ driver host for a JAR:
> * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
> connection IP of {{172.30.0.2}}
> * A _Spark_ connection is made from the launch host to a service running on 
> that LAN, {{172.30.0.0/16}}
> ** {{spark.driver.host}} is set to the IP of that _Spark_ service host 
> {{172.30.0.3}}
> ** {{spark.driver.port}} is set to {{50003}}
> ** {{spark.fileserver.port}} is set to {{50005}}
> * Locally (on the launch host), the following listeners are active:
> ** {{0.0.0.0:50005}}
> ** {{172.30.0.2:50003}}
> * The _Spark_ service calls back to the file server host for a JAR file using 
> the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
> * The _Spark_ service, being on a different network than the launch host, 
> cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
> file server
> ** A {{netstat}} on the _Spark_ service host will show the connection to the 
> file server host as being in {{SYN_SENT}} state until the process gives up 
> trying to connect
> Given that the configured {{spark.driver.host}} must necessarily be 
> accessible by the _Spark_ service, it makes sense to reuse it for fileserver 
> connections and any other _Spark_ service -> driver host connections.
> 
> Original report follows:
> I initially inquired about this here: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E
> If the Spark driver host has multiple IPs and spark.driver.host is set to one 
> of them, I would expect the fileserver to start on the same IP. I checked 
> HttpServer and the jetty Server is started the default IP of the machine: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75
> Something like this might work instead:
> {code:title=HttpServer.scala#L75}
> val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 
> 0))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Description: 
The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* A _Spark_ connection is made from the launch host to a service running on 
that LAN, {{172.30.0.0/16}}
** {{spark.driver.host}} is set to the IP of that _Spark_ service host 
{{172.30.0.3}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the launch host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the launch host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.

\\

\\
Original report follows:

I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work instead:

{code:title=HttpServer.scala#L75}
val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0))
{code}

  was:
The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* A _Spark_ connection is made from the launch host to a service running on 
that LAN, {{172.30.0.0/16}}
** {{spark.driver.host}} is set to the IP of that _Spark_ service host 
{{172.30.0.3}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the launch host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the launch host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.



Original report follows:

I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work instead:

[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Sutton updated SPARK-12482:

Description: 
The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* A _Spark_ connection is made from the launch host to a service running on 
that LAN, {{172.30.0.0/16}}
** {{spark.driver.host}} is set to the IP of that _Spark_ service host 
{{172.30.0.3}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the launch host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the launch host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.



Original report follows:

I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work instead:

{code:title=HttpServer.scala#L75}
val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0))
{code}

  was:
The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the Spark service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Launch host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* A _Spark_ connection is made from the launch host to a service running on 
that LAN, {{172.30.0.0/16}}
** {{spark.driver.host}} is set to the IP of that _Spark_ service host 
{{172.30.0.3}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the launch host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the file server host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the launch host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

Given that the configured {{spark.driver.host}} must necessarily be accessible 
by the _Spark_ service, it makes sense to reuse it for fileserver connections 
and any other _Spark_ service -> driver host connections.



Original report follows:

I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work instead:

[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12479:


Assignee: (was: Apache Spark)

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: R, SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2015-12-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069092#comment-15069092
 ] 

Apache Spark commented on SPARK-12479:
--

User 'paulomagalhaes' has created a pull request for this issue:
https://github.com/apache/spark/pull/10436

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: R, SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12385) Push projection into Join

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069133#comment-15069133
 ] 

Xiao Li commented on SPARK-12385:
-

I will work on it, if nobody starts it. Thanks!

> Push projection into Join
> -
>
> Key: SPARK-12385
> URL: https://issues.apache.org/jira/browse/SPARK-12385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> We usually have  Join followed by a projection to pruning some columns, but 
> Join already have a result projection to produce UnsafeRow, we should combine 
> them together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11164) Add InSet pushdown filter back for Parquet

2015-12-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-11164.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10278
[https://github.com/apache/spark/pull/10278]

> Add InSet pushdown filter back for Parquet
> --
>
> Key: SPARK-11164
> URL: https://issues.apache.org/jira/browse/SPARK-11164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11164) Add InSet pushdown filter back for Parquet

2015-12-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11164:
---
Assignee: Xiao Li

> Add InSet pushdown filter back for Parquet
> --
>
> Key: SPARK-11164
> URL: https://issues.apache.org/jira/browse/SPARK-11164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Xiao Li
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12492) SQL page of Spark-sql is always blank

2015-12-22 Thread meiyoula (JIRA)

meiyoula created SPARK-12492:


 Summary: SQL page of Spark-sql is always blank 
 Key: SPARK-12492
 URL: https://issues.apache.org/jira/browse/SPARK-12492
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: meiyoula


When I run a sql query in spark-sql, the Execution page of SQL tab is always 
blank. But the JDBCServer is not blank.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12479:


Assignee: (was: Apache Spark)

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: R, SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 168 matches

Mail list logo