[jira] [Commented] (SPARK-19569) could not get APP ID and cause failed to connect to spark driver on yarn-client mode

2017-05-19 Thread Xiaochen Ouyang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016996#comment-16016996
 ] 

Xiaochen Ouyang commented on SPARK-19569:
-

It is really a problem, we should reopen this issue. Because we can reproduce 
this problem by programing way. as follow:
val conf = new SparkConf()
conf.set("spark.app.name", "SparkOnYarnClient")
conf.setMaster("yarn-client")
conf.set("spark.driver.host","192.168.10.128")
val arg0 = new ArrayBuffer[String]()
 arg0 += "--jar"
arg0 += args(0)
arg0 += "--class"
arg0 += "com.hello.SparkPI"
val client = new Client(cArgs, hadoopConf, conf)
client.submitApplication()

But, it will be successfully when  we using spark-submit shell to submit a job 
whih yarn-client mode. 


> could not  get APP ID and cause failed to connect to spark driver on 
> yarn-client mode
> -
>
> Key: SPARK-19569
> URL: https://issues.apache.org/jira/browse/SPARK-19569
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: hadoop2.7.1
> spark2.0.2
> hive2.2
>Reporter: KaiXu
>
> when I run Hive queries on Spark, got below error in the console, after check 
> the container's log, found it failed to connected to spark driver. I have set 
>  hive.spark.job.monitor.timeout=3600s, so the log said 'Job hasn't been 
> submitted after 3601s', actually during this long-time period it's impossible 
> no available resource, and also did not see any issue related to the network, 
> so the cause is not clear from the message "Possible reasons include network 
> issues, errors in remote driver or the cluster has no available resources, 
> etc.".
> From Hive's log, failed to get APP ID, so this might be the cause why the 
> driver did not start up.
> console log:
> Starting Spark Job = e9ce42c8-ff20-4ac8-803f-7668678c2a00
> Job hasn't been submitted after 3601s. Aborting it.
> Possible reasons include network issues, errors in remote driver or the 
> cluster has no available resources, etc.
> Please check YARN or Spark driver's logs for further information.
> Status: SENT
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
> container's log:
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Preparing Local resources
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Prepared Local resources 
> Map(__spark_libs__ -> resource { scheme: "hdfs" host: "hsx-node1" port: 8020 
> file: 
> "/user/root/.sparkStaging/application_1486905599813_0046/__spark_libs__6842484649003444330.zip"
>  } size: 153484072 timestamp: 1486926551130 type: ARCHIVE visibility: 
> PRIVATE, __spark_conf__ -> resource { scheme: "hdfs" host: "hsx-node1" port: 
> 8020 file: 
> "/user/root/.sparkStaging/application_1486905599813_0046/__spark_conf__.zip" 
> } size: 116245 timestamp: 1486926551318 type: ARCHIVE visibility: PRIVATE)
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
> appattempt_1486905599813_0046_02
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls to: root
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls to: root
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls groups to: 
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls groups to: 
> 17/02/13 05:05:54 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(root); groups 
> with view permissions: Set(); users  with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Waiting for Spark driver to be 
> reachable.
> 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Fai

[jira] [Comment Edited] (SPARK-19569) could not get APP ID and cause failed to connect to spark driver on yarn-client mode

2017-05-19 Thread Xiaochen Ouyang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016996#comment-16016996
 ] 

Xiaochen Ouyang edited comment on SPARK-19569 at 5/19/17 7:04 AM:
--

It is really a problem, we should reopen this issue. Because we can reproduce 
this problem by programing way. as follow:
val conf = new SparkConf()
conf.set("spark.app.name", "SparkOnYarnClient")
conf.setMaster("yarn-client")
conf.set("spark.driver.host","192.168.10.128")
val arg0 = new ArrayBuffer[String]()
 arg0 += "--jar"
arg0 += args(0)
arg0 += "--class"
arg0 += "com.hello.SparkPI"
val client = new Client(cArgs, hadoopConf, conf)
client.submitApplication()

But, it will be successfully when  we using spark-submit shell to submit a job 
whih yarn-client mode. 
[~srowen]


was (Author: ouyangxc.zte):
It is really a problem, we should reopen this issue. Because we can reproduce 
this problem by programing way. as follow:
val conf = new SparkConf()
conf.set("spark.app.name", "SparkOnYarnClient")
conf.setMaster("yarn-client")
conf.set("spark.driver.host","192.168.10.128")
val arg0 = new ArrayBuffer[String]()
 arg0 += "--jar"
arg0 += args(0)
arg0 += "--class"
arg0 += "com.hello.SparkPI"
val client = new Client(cArgs, hadoopConf, conf)
client.submitApplication()

But, it will be successfully when  we using spark-submit shell to submit a job 
whih yarn-client mode. 


> could not  get APP ID and cause failed to connect to spark driver on 
> yarn-client mode
> -
>
> Key: SPARK-19569
> URL: https://issues.apache.org/jira/browse/SPARK-19569
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: hadoop2.7.1
> spark2.0.2
> hive2.2
>Reporter: KaiXu
>
> when I run Hive queries on Spark, got below error in the console, after check 
> the container's log, found it failed to connected to spark driver. I have set 
>  hive.spark.job.monitor.timeout=3600s, so the log said 'Job hasn't been 
> submitted after 3601s', actually during this long-time period it's impossible 
> no available resource, and also did not see any issue related to the network, 
> so the cause is not clear from the message "Possible reasons include network 
> issues, errors in remote driver or the cluster has no available resources, 
> etc.".
> From Hive's log, failed to get APP ID, so this might be the cause why the 
> driver did not start up.
> console log:
> Starting Spark Job = e9ce42c8-ff20-4ac8-803f-7668678c2a00
> Job hasn't been submitted after 3601s. Aborting it.
> Possible reasons include network issues, errors in remote driver or the 
> cluster has no available resources, etc.
> Please check YARN or Spark driver's logs for further information.
> Status: SENT
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
> container's log:
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Preparing Local resources
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Prepared Local resources 
> Map(__spark_libs__ -> resource { scheme: "hdfs" host: "hsx-node1" port: 8020 
> file: 
> "/user/root/.sparkStaging/application_1486905599813_0046/__spark_libs__6842484649003444330.zip"
>  } size: 153484072 timestamp: 1486926551130 type: ARCHIVE visibility: 
> PRIVATE, __spark_conf__ -> resource { scheme: "hdfs" host: "hsx-node1" port: 
> 8020 file: 
> "/user/root/.sparkStaging/application_1486905599813_0046/__spark_conf__.zip" 
> } size: 116245 timestamp: 1486926551318 type: ARCHIVE visibility: PRIVATE)
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
> appattempt_1486905599813_0046_02
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls to: root
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls to: root
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls groups to: 
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls groups to: 
> 17/02/13 05:05:54 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(root); groups 
> with view permissions: Set(); users  with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Waiting for Spark driver to be 
> reachable.
> 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.1

[jira] [Commented] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x

2017-05-19 Thread Morten Hornbech (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017001#comment-16017001
 ] 

Morten Hornbech commented on SPARK-17875:
-

We were just hit by a runtime error caused by this apparently obsolete 
dependency. More specifically the version of SslHandler used by netty 3.8 is 
not binary compatible with the one we use (and the one spark-core uses in netty 
4.0). 

We can get around this by shading our own dependency, but I think its a bit 
nasty having this floating around risking unnecessary runtime errors - 
dependency management is difficult enough as it is :-) Could we reopen the 
issue?

> Remove unneeded direct dependence on Netty 3.x
> --
>
> Key: SPARK-17875
> URL: https://issues.apache.org/jira/browse/SPARK-17875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
>
> The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is 
> used. It's best to remove the 3.x dependency (and while we're at it, update a 
> few things like license info)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x

2017-05-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017006#comment-16017006
 ] 

Sean Owen commented on SPARK-17875:
---

Did you see my pull request?

> Remove unneeded direct dependence on Netty 3.x
> --
>
> Key: SPARK-17875
> URL: https://issues.apache.org/jira/browse/SPARK-17875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
>
> The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is 
> used. It's best to remove the 3.x dependency (and while we're at it, update a 
> few things like license info)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20806) Launcher:redundant code,invalid branch of judgment

2017-05-19 Thread Phoenix_Dad (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017041#comment-16017041
 ] 

Phoenix_Dad commented on SPARK-20806:
-

the " libdir.isDirectory()" express is always true within the "if"

if (new File(sparkHome, "jars").isDirectory()) {
  libdir = new File(sparkHome, "jars");
  checkState(!failIfNotFound || libdir.isDirectory(),
"Library directory '%s' does not exist.",
libdir.getAbsolutePath());
}

> Launcher:redundant code,invalid branch of judgment
> --
>
> Key: SPARK-20806
> URL: https://issues.apache.org/jira/browse/SPARK-20806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.1
>Reporter: Phoenix_Dad
>
>   org.apache.spark.launcher.CommandBuilderUtils
>   In findJarsDir function, there is an “if or else” branch .
>   the first input argument of 'checkState' in 'if' subclause is always true, 
> so 'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20806) Launcher:redundant code,invalid branch of judgment

2017-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-20806:
---

OK, I get it. That should be in the description.

> Launcher:redundant code,invalid branch of judgment
> --
>
> Key: SPARK-20806
> URL: https://issues.apache.org/jira/browse/SPARK-20806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.1
>Reporter: Phoenix_Dad
>
>   org.apache.spark.launcher.CommandBuilderUtils
>   In findJarsDir function, there is an “if or else” branch .
>   the first input argument of 'checkState' in 'if' subclause is always true, 
> so 'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20806) Launcher: redundant check for Spark lib dir

2017-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20806:
--
Summary: Launcher: redundant check for Spark lib dir  (was: 
Launcher:redundant code,invalid branch of judgment)

> Launcher: redundant check for Spark lib dir
> ---
>
> Key: SPARK-20806
> URL: https://issues.apache.org/jira/browse/SPARK-20806
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.1
>Reporter: Phoenix_Dad
>Priority: Trivial
>
>   org.apache.spark.launcher.CommandBuilderUtils
>   In findJarsDir function, there is an “if or else” branch .
>   the first input argument of 'checkState' in 'if' subclause is always true, 
> so 'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20806) Launcher:redundant code,invalid branch of judgment

2017-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20806:
--
  Priority: Trivial  (was: Major)
Issue Type: Improvement  (was: Bug)

> Launcher:redundant code,invalid branch of judgment
> --
>
> Key: SPARK-20806
> URL: https://issues.apache.org/jira/browse/SPARK-20806
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.1
>Reporter: Phoenix_Dad
>Priority: Trivial
>
>   org.apache.spark.launcher.CommandBuilderUtils
>   In findJarsDir function, there is an “if or else” branch .
>   the first input argument of 'checkState' in 'if' subclause is always true, 
> so 'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20806) Launcher:redundant code,invalid branch of judgment

2017-05-19 Thread Phoenix_Dad (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017041#comment-16017041
 ] 

Phoenix_Dad edited comment on SPARK-20806 at 5/19/17 8:08 AM:
--

[~srowen]
the " libdir.isDirectory()" express is always true within the "if"

if (new File(sparkHome, "jars").isDirectory()) {
  libdir = new File(sparkHome, "jars");
  checkState(!failIfNotFound || libdir.isDirectory(),
"Library directory '%s' does not exist.",
libdir.getAbsolutePath());
}


was (Author: phoenix_dad):
the " libdir.isDirectory()" express is always true within the "if"

if (new File(sparkHome, "jars").isDirectory()) {
  libdir = new File(sparkHome, "jars");
  checkState(!failIfNotFound || libdir.isDirectory(),
"Library directory '%s' does not exist.",
libdir.getAbsolutePath());
}

> Launcher:redundant code,invalid branch of judgment
> --
>
> Key: SPARK-20806
> URL: https://issues.apache.org/jira/browse/SPARK-20806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.1
>Reporter: Phoenix_Dad
>
>   org.apache.spark.launcher.CommandBuilderUtils
>   In findJarsDir function, there is an “if or else” branch .
>   the first input argument of 'checkState' in 'if' subclause is always true, 
> so 'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20806) Launcher:redundant code,invalid branch of judgment

2017-05-19 Thread Phoenix_Dad (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017041#comment-16017041
 ] 

Phoenix_Dad edited comment on SPARK-20806 at 5/19/17 8:09 AM:
--

[~srowen]
the " libdir.isDirectory()" express is always true within the "if"

if (new File(sparkHome, "jars").isDirectory()) {
  libdir = new File(sparkHome, "jars");
 checkState(!failIfNotFound || libdir.isDirectory(), "Library directory 
'%s' does not exist.",libdir.getAbsolutePath());
}


was (Author: phoenix_dad):
[~srowen]
the " libdir.isDirectory()" express is always true within the "if"

if (new File(sparkHome, "jars").isDirectory()) {
  libdir = new File(sparkHome, "jars");
  checkState(!failIfNotFound || libdir.isDirectory(),
"Library directory '%s' does not exist.",
libdir.getAbsolutePath());
}

> Launcher:redundant code,invalid branch of judgment
> --
>
> Key: SPARK-20806
> URL: https://issues.apache.org/jira/browse/SPARK-20806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.1
>Reporter: Phoenix_Dad
>
>   org.apache.spark.launcher.CommandBuilderUtils
>   In findJarsDir function, there is an “if or else” branch .
>   the first input argument of 'checkState' in 'if' subclause is always true, 
> so 'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20806) Launcher: redundant check for Spark lib dir

2017-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017049#comment-16017049
 ] 

Apache Spark commented on SPARK-20806:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/18032

> Launcher: redundant check for Spark lib dir
> ---
>
> Key: SPARK-20806
> URL: https://issues.apache.org/jira/browse/SPARK-20806
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.1
>Reporter: Phoenix_Dad
>Priority: Trivial
>
>   org.apache.spark.launcher.CommandBuilderUtils
>   In findJarsDir function, there is an “if or else” branch .
>   the first input argument of 'checkState' in 'if' subclause is always true, 
> so 'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20806) Launcher: redundant check for Spark lib dir

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20806:


Assignee: (was: Apache Spark)

> Launcher: redundant check for Spark lib dir
> ---
>
> Key: SPARK-20806
> URL: https://issues.apache.org/jira/browse/SPARK-20806
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.1
>Reporter: Phoenix_Dad
>Priority: Trivial
>
>   org.apache.spark.launcher.CommandBuilderUtils
>   In findJarsDir function, there is an “if or else” branch .
>   the first input argument of 'checkState' in 'if' subclause is always true, 
> so 'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20806) Launcher: redundant check for Spark lib dir

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20806:


Assignee: Apache Spark

> Launcher: redundant check for Spark lib dir
> ---
>
> Key: SPARK-20806
> URL: https://issues.apache.org/jira/browse/SPARK-20806
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.1
>Reporter: Phoenix_Dad
>Assignee: Apache Spark
>Priority: Trivial
>
>   org.apache.spark.launcher.CommandBuilderUtils
>   In findJarsDir function, there is an “if or else” branch .
>   the first input argument of 'checkState' in 'if' subclause is always true, 
> so 'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x

2017-05-19 Thread Morten Hornbech (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017059#comment-16017059
 ] 

Morten Hornbech commented on SPARK-17875:
-

Sorry, I have now. If the class files are indeed in the flume assembly my best 
guess is that this occurs because of binary compatibility issues between 4.0 
and 3.8 triggered by static members upon load of ChannelPipelineFactory. I can 
see that ChannelPipelineFactory does not exist in 4.0 but it references 
ChannelPipeline in its class definition which does. So if that was loaded from 
4.0 things could go wrong. If an upgrade of flume to netty 4.0 is a major task 
a simpler solution would be to shade netty 3.8 in the flume assembly. That way 
you should be able to get rid of it in spark-core.

> Remove unneeded direct dependence on Netty 3.x
> --
>
> Key: SPARK-17875
> URL: https://issues.apache.org/jira/browse/SPARK-17875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
>
> The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is 
> used. It's best to remove the 3.x dependency (and while we're at it, update a 
> few things like license info)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x

2017-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-17875:
-

Assignee: (was: Sean Owen)
Target Version/s: 3.0.0

At least, we can fix this in Spark 3, when we likely remove the flume 
integration or something. It's already a dependency liability and not sure how 
supported it is.

> Remove unneeded direct dependence on Netty 3.x
> --
>
> Key: SPARK-17875
> URL: https://issues.apache.org/jira/browse/SPARK-17875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Priority: Trivial
>
> The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is 
> used. It's best to remove the 3.x dependency (and while we're at it, update a 
> few things like license info)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19569) could not get APP ID and cause failed to connect to spark driver on yarn-client mode

2017-05-19 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017066#comment-16017066
 ] 

Saisai Shao commented on SPARK-19569:
-

[~ouyangxc.zte] In you above code you directly call 
{{client.submitApplication()}} to invoke Spark application, I assume this 
client is {{org.apache.spark.deploy.yarn.Client}}. From my understanding it is 
not allowed to directly call this class. Also if you directly using yarn#client 
to invoke Spark on YARN application, I would doubt you will probably have to do 
lots of preparation works done by SparkSubmit.

> could not  get APP ID and cause failed to connect to spark driver on 
> yarn-client mode
> -
>
> Key: SPARK-19569
> URL: https://issues.apache.org/jira/browse/SPARK-19569
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: hadoop2.7.1
> spark2.0.2
> hive2.2
>Reporter: KaiXu
>
> when I run Hive queries on Spark, got below error in the console, after check 
> the container's log, found it failed to connected to spark driver. I have set 
>  hive.spark.job.monitor.timeout=3600s, so the log said 'Job hasn't been 
> submitted after 3601s', actually during this long-time period it's impossible 
> no available resource, and also did not see any issue related to the network, 
> so the cause is not clear from the message "Possible reasons include network 
> issues, errors in remote driver or the cluster has no available resources, 
> etc.".
> From Hive's log, failed to get APP ID, so this might be the cause why the 
> driver did not start up.
> console log:
> Starting Spark Job = e9ce42c8-ff20-4ac8-803f-7668678c2a00
> Job hasn't been submitted after 3601s. Aborting it.
> Possible reasons include network issues, errors in remote driver or the 
> cluster has no available resources, etc.
> Please check YARN or Spark driver's logs for further information.
> Status: SENT
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
> container's log:
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Preparing Local resources
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Prepared Local resources 
> Map(__spark_libs__ -> resource { scheme: "hdfs" host: "hsx-node1" port: 8020 
> file: 
> "/user/root/.sparkStaging/application_1486905599813_0046/__spark_libs__6842484649003444330.zip"
>  } size: 153484072 timestamp: 1486926551130 type: ARCHIVE visibility: 
> PRIVATE, __spark_conf__ -> resource { scheme: "hdfs" host: "hsx-node1" port: 
> 8020 file: 
> "/user/root/.sparkStaging/application_1486905599813_0046/__spark_conf__.zip" 
> } size: 116245 timestamp: 1486926551318 type: ARCHIVE visibility: PRIVATE)
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
> appattempt_1486905599813_0046_02
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls to: root
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls to: root
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls groups to: 
> 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls groups to: 
> 17/02/13 05:05:54 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(root); groups 
> with view permissions: Set(); users  with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Waiting for Spark driver to be 
> reachable.
> 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:43656, retrying ...
> 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver 
> at 192.168.1.1:4365

[jira] [Created] (SPARK-20807) Add compression/decompression of data to ColumnVector

2017-05-19 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-20807:


 Summary: Add compression/decompression of data to ColumnVector
 Key: SPARK-20807
 URL: https://issues.apache.org/jira/browse/SPARK-20807
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


While current {{CachedBatch}} can compress data by using of of multiple 
compression schemes, {{ColumnVector}} cannot compress data. It is mandatory for 
table cache.

This JIRA adds compression/decompression to {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20808) External Table unnecessarily not create in Hive-compatible way

2017-05-19 Thread Joachim Hereth (JIRA)
Joachim Hereth created SPARK-20808:
--

 Summary: External Table unnecessarily not create in 
Hive-compatible way
 Key: SPARK-20808
 URL: https://issues.apache.org/jira/browse/SPARK-20808
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.1.0
Reporter: Joachim Hereth
Priority: Minor


In Spark 2.1.0 and 2.1.1 {{spark.catalog.createExternalTable}} creates tables 
unnecessarily in a hive-incompatible way.

For instance executing in a spark shell

{code}
val database = "default"
val table = "table_name"
val path = "/user/daki/"  + database + "/" + table

var data = Array(("Alice", 23), ("Laura", 33), ("Peter", 54))
val df = sc.parallelize(data).toDF("name","age") 

df.write.mode(org.apache.spark.sql.SaveMode.Overwrite).parquet(path)

spark.sql("DROP TABLE IF EXISTS " + database + "." + table)

spark.catalog.createExternalTable(database + "."+ table, path)
{code}

issues the warning

{code}
Search Subject for Kerberos V5 INIT cred (<>, 
sun.security.jgss.krb5.Krb5InitCredential)
17/05/19 11:01:17 WARN hive.HiveExternalCatalog: Could not persist 
`default`.`table_name` in a Hive compatible way. Persisting it into Hive 
metastore in Spark SQL specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User 
daki does not have privileges for CREATETABLE)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
...
{code}

The Exception (user does not have privileges for CREATETABLE) is misleading (I 
do have the CREATE TABLE privilege).

Querying the table with Hive does not return any result. With Spark one can 
access the data.

The following code creates the table correctly (workaround):
{code}
def sqlStatement(df : org.apache.spark.sql.DataFrame, database : String, table: 
String, path: String) : String = {
  val rows = (for(col <- df.schema) 
yield "`" + col.name + "` " + 
col.dataType.simpleString).mkString(",\n")
  val sqlStmnt = ("CREATE EXTERNAL TABLE `%s`.`%s` (%s) " +
"STORED AS PARQUET " +
"Location 'hdfs://nameservice1%s'").format(database, table, rows, path)
  return sqlStmnt
}

spark.sql("DROP TABLE IF EXISTS " + database + "." + table)
spark.sql(sqlStatement(df, database, table, path))
{code}

The code is executed via YARN against a Cloudera CDH 5.7.5 cluster with Sentry 
enabled (in case this matters regarding the privilege warning). Spark was built 
against the CDH libraries.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20808) External Table unnecessarily not create in Hive-compatible way

2017-05-19 Thread Joachim Hereth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017108#comment-16017108
 ] 

Joachim Hereth commented on SPARK-20808:


The warning is caused by an Exeption raised by a call to [saveTableIntoHive() | 
https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L369].

I was not able to debug what caused the misleading Exception about privileges.


> External Table unnecessarily not create in Hive-compatible way
> --
>
> Key: SPARK-20808
> URL: https://issues.apache.org/jira/browse/SPARK-20808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Joachim Hereth
>Priority: Minor
>
> In Spark 2.1.0 and 2.1.1 {{spark.catalog.createExternalTable}} creates tables 
> unnecessarily in a hive-incompatible way.
> For instance executing in a spark shell
> {code}
> val database = "default"
> val table = "table_name"
> val path = "/user/daki/"  + database + "/" + table
> var data = Array(("Alice", 23), ("Laura", 33), ("Peter", 54))
> val df = sc.parallelize(data).toDF("name","age") 
> df.write.mode(org.apache.spark.sql.SaveMode.Overwrite).parquet(path)
> spark.sql("DROP TABLE IF EXISTS " + database + "." + table)
> spark.catalog.createExternalTable(database + "."+ table, path)
> {code}
> issues the warning
> {code}
> Search Subject for Kerberos V5 INIT cred (<>, 
> sun.security.jgss.krb5.Krb5InitCredential)
> 17/05/19 11:01:17 WARN hive.HiveExternalCatalog: Could not persist 
> `default`.`table_name` in a Hive compatible way. Persisting it into Hive 
> metastore in Spark SQL specific format.
> org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User 
> daki does not have privileges for CREATETABLE)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
> ...
> {code}
> The Exception (user does not have privileges for CREATETABLE) is misleading 
> (I do have the CREATE TABLE privilege).
> Querying the table with Hive does not return any result. With Spark one can 
> access the data.
> The following code creates the table correctly (workaround):
> {code}
> def sqlStatement(df : org.apache.spark.sql.DataFrame, database : String, 
> table: String, path: String) : String = {
>   val rows = (for(col <- df.schema) 
> yield "`" + col.name + "` " + 
> col.dataType.simpleString).mkString(",\n")
>   val sqlStmnt = ("CREATE EXTERNAL TABLE `%s`.`%s` (%s) " +
> "STORED AS PARQUET " +
> "Location 'hdfs://nameservice1%s'").format(database, table, rows, path)
>   return sqlStmnt
> }
> spark.sql("DROP TABLE IF EXISTS " + database + "." + table)
> spark.sql(sqlStatement(df, database, table, path))
> {code}
> The code is executed via YARN against a Cloudera CDH 5.7.5 cluster with 
> Sentry enabled (in case this matters regarding the privilege warning). Spark 
> was built against the CDH libraries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18838) High latency of event processing for large jobs

2017-05-19 Thread Antoine PRANG (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine PRANG updated SPARK-18838:
--
Attachment: SparkListernerComputeTime.xlsx

execution trace

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-05-19 Thread Antoine PRANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017120#comment-16017120
 ] 

Antoine PRANG commented on SPARK-18838:
---

[~joshrosen] I uploaded the timings I get. I put some counters in the code. You 
can take a look at the metrics branch of my forks.
I do not have exact profile of the methods.
First the StorageListener "really" execute a lot of message. It has not a no-op 
received method for the most frequent messages (SparkListenerBlockUpdated) if i 
understand well.
they are not logged in the EventLoggingListener for example.
But the StorageStatusListener listens to this kind of events too, and its 
execution time is not comparable.
But it seems to do much more work (with the parent classes).
there is a lot of synchronization which may be avoided in my mind.


> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20809) PySpark: Java heap space issue despite apparently being within memory limits

2017-05-19 Thread James Porritt (JIRA)
James Porritt created SPARK-20809:
-

 Summary: PySpark: Java heap space issue despite apparently being 
within memory limits
 Key: SPARK-20809
 URL: https://issues.apache.org/jira/browse/SPARK-20809
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.1
 Environment: Linux x86_64
Reporter: James Porritt


I have the following script:

{code}
import itertools
import loremipsum
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

conf = SparkConf().set("spark.cores.max", "16") \
.set("spark.driver.memory", "16g") \
.set("spark.executor.memory", "16g") \
.set("spark.executor.memory_overhead", "16g") \
.set("spark.driver.maxResultsSize", "0")

sc = SparkContext(appName="testRDD", conf=conf)
ss = SparkSession(sc)

j = itertools.cycle(range(8))
rows = [(i, j.next(), ' '.join(map(lambda x: x[2], 
loremipsum.generate_sentences(600 for i in range(500)] * 100
rrd = sc.parallelize(rows, 128)
{code}

When I run it with:
{noformat}
/spark-2.1.1-bin-hadoop2.7/bin/spark-submit /writeTest.py
{noformat}

it fails with a 'Java heap space' error:

{noformat}
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:468)
at 
org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The data I create here approximates my actual data. The third element of each 
tuple should be around 25k, and there are 50k tuples overall. I estimate that I 
should have around 1.2G of data. 

Why then does it fail? All parts of the system should have enough memory?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20797:


Assignee: Apache Spark

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>Assignee: Apache Spark
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017127#comment-16017127
 ] 

Apache Spark commented on SPARK-20797:
--

User 'd0evi1' has created a pull request for this issue:
https://github.com/apache/spark/pull/18034

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20797:


Assignee: (was: Apache Spark)

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-05-19 Thread d0evi1 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017128#comment-16017128
 ] 

d0evi1 commented on SPARK-20797:


ok,  there is: https://github.com/apache/spark/pull/18034

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20807) Add compression/decompression of data to ColumnVector

2017-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017154#comment-16017154
 ] 

Apache Spark commented on SPARK-20807:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/18033

> Add compression/decompression of data to ColumnVector
> -
>
> Key: SPARK-20807
> URL: https://issues.apache.org/jira/browse/SPARK-20807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> While current {{CachedBatch}} can compress data by using of of multiple 
> compression schemes, {{ColumnVector}} cannot compress data. It is mandatory 
> for table cache.
> This JIRA adds compression/decompression to {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20807) Add compression/decompression of data to ColumnVector

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20807:


Assignee: (was: Apache Spark)

> Add compression/decompression of data to ColumnVector
> -
>
> Key: SPARK-20807
> URL: https://issues.apache.org/jira/browse/SPARK-20807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> While current {{CachedBatch}} can compress data by using of of multiple 
> compression schemes, {{ColumnVector}} cannot compress data. It is mandatory 
> for table cache.
> This JIRA adds compression/decompression to {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20807) Add compression/decompression of data to ColumnVector

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20807:


Assignee: Apache Spark

> Add compression/decompression of data to ColumnVector
> -
>
> Key: SPARK-20807
> URL: https://issues.apache.org/jira/browse/SPARK-20807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> While current {{CachedBatch}} can compress data by using of of multiple 
> compression schemes, {{ColumnVector}} cannot compress data. It is mandatory 
> for table cache.
> This JIRA adds compression/decompression to {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward

2017-05-19 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017192#comment-16017192
 ] 

Cristian Opris commented on SPARK-16365:


There's another potential argument for exposing 'local' (non-distributed) 
implementations of the algorithms: sometimes it's useful to apply the algorithm 
on relatively small groupings of data in a very large dataset. In this case 
Spark would only serve to distribute the data and apply the algorithm locally 
on each partition/grouping of data, perhaps through an UDF.

This may currently be achieved with the scikit integration, but would be useful 
to consider making it possible to use the Spark implementation of the 
algorithm, where that algorithm is not an inherently distributed 
implementation. 
CountVectorizer is a good example, nothing in there inherently requires a 
DataFrame.

In practice this should mostly imply just exposing the core implementation of 
the algorithms where possible.

> Ideas for moving "mllib-local" forward
> --
>
> Key: SPARK-16365
> URL: https://issues.apache.org/jira/browse/SPARK-16365
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next 
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's 
> linear algebra", or "investigate how we will implement local models/pipelines 
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation 
> of linalg into a standalone project turned out to be significantly more 
> complex than originally expected. So I vote we devote sufficient discussion 
> and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-20810:
---

 Summary: ML LinearSVC vs MLlib SVMWithSGD output different solution
 Key: SPARK-20810
 URL: https://issues.apache.org/jira/browse/SPARK-20810
 Project: Spark
  Issue Type: Question
  Components: ML, MLlib
Affects Versions: 2.2.0
Reporter: Yanbo Liang


Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution?
AFAIK, both of them use Hinge loss which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use squared hinge loss? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-20810:

Description: 
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution?
AFAIK, both of them use Hinge loss which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use squared hinge loss which is the 
default loss function of {{sklearn.svm.LinearSVC}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

  was:
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution?
AFAIK, both of them use Hinge loss which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use squared hinge loss? 


> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution?
> AFAIK, both of them use Hinge loss which is convex but not differentiable 
> function. Since the derivative of the hinge loss at certain place is 
> non-deterministic, should we switch to use squared hinge loss which is the 
> default loss function of {{sklearn.svm.LinearSVC}}?
> This issue is very easy to reproduce, you can paste the following code 
> snippet to {{LinearSVCSuite}} and then click run in Intellij IDE.
> {code}
> test("LinearSVC vs SVMWithSGD") {
> import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
> import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
> val trainer1 = new LinearSVC()
>   .setRegParam(0.2)
>   .setMaxIter(200)
>   .setTol(1e-4)
> val model1 = trainer1.fit(binaryDataset)
> println(model1.coefficients)
> println(model1.intercept)
> val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
> Vector) =>
> OldLabeledPoint(label, OldVectors.fromML(features))
> }
> val trainer2 = new SVMWithSGD().setIntercept(true)
> 
> trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)
> val model2 = trainer2.run(oldData)
> println(model2.weights)
> println(model2.intercept)
>   }
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-20810:

Description: 
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution?
AFAIK, both of them use Hinge loss which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use squared hinge loss which is the 
default loss function of {{sklearn.svm.LinearSVC}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
0.667790514894194
{code}

  was:
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution?
AFAIK, both of them use Hinge loss which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use squared hinge loss which is the 
default loss function of {{sklearn.svm.LinearSVC}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 


> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution?
> AFAIK, both of them use Hinge loss which is convex but not differentiable 
> function. Since the derivative of the hinge loss at certain place is 
> non-deterministic, should we switch to use squared hinge loss which is the 
> default loss function of {{sklearn.svm.LinearSVC}}?
> This issue is very easy to reproduce, you can paste the following code 
> snippet to {{LinearSVCSuite}} and then click run in Intellij IDE.
> {code}
> test("LinearSVC vs SVMWithSGD") {
> import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
> import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
> val trainer1 = new LinearSVC()
>   .setRegParam(0.2)
>   .setMaxIter(200)
>   .setTol(1e-4)
> val model1 = trainer1.fit(binaryDataset)
> println(model1.coefficients)
> println(model1.intercept)
> val oldData = binaryDataset.rdd.map { case Row(label: Double, feat

[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-20810:

Description: 
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution?
AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use {{squared hinge loss}} which is the 
default loss function of {{sklearn.svm.LinearSVC}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
0.667790514894194
{code}

  was:
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution?
AFAIK, both of them use Hinge loss which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use squared hinge loss which is the 
default loss function of {{sklearn.svm.LinearSVC}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
0.667790514894194
{code}


> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution?
> AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
> function. Since the derivative of the hinge loss at certain place is 
> non-deterministic, should we switch to use {{squared hinge loss}} which is 
> the default loss function of {{sklearn.svm.LinearSVC}}?
> This issue is very easy to reproduce, you can paste the following code 
> snippet to {{LinearSVCSuite}} and then click run in Intellij IDE.
> {code}
> test("LinearSVC vs SVMWithSGD") {
> import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
> import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
> val trainer1 = new LinearSVC()
>   .setRegParam(0.2)

[jira] [Commented] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017217#comment-16017217
 ] 

Yanbo Liang commented on SPARK-20810:
-

cc [~josephkb] [~yuhaoyan]

> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution?
> AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
> function. Since the derivative of the hinge loss at certain place is 
> non-deterministic, should we switch to use {{squared hinge loss}} which is 
> the default loss function of {{sklearn.svm.LinearSVC}}?
> This issue is very easy to reproduce, you can paste the following code 
> snippet to {{LinearSVCSuite}} and then click run in Intellij IDE.
> {code}
> test("LinearSVC vs SVMWithSGD") {
> import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
> import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
> val trainer1 = new LinearSVC()
>   .setRegParam(0.2)
>   .setMaxIter(200)
>   .setTol(1e-4)
> val model1 = trainer1.fit(binaryDataset)
> println(model1.coefficients)
> println(model1.intercept)
> val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
> Vector) =>
> OldLabeledPoint(label, OldVectors.fromML(features))
> }
> val trainer2 = new SVMWithSGD().setIntercept(true)
> 
> trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)
> val model2 = trainer2.run(oldData)
> println(model2.weights)
> println(model2.intercept)
>   }
> {code} 
> The output is:
> {code}
> [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
> 7.373454363024084
> [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
> 0.667790514894194
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-20810:

Description: 
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} 
produce wrong solution. Does it also like this?
AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use {{squared hinge loss}} which is the 
default loss function of {{sklearn.svm.LinearSVC}} and more robust than {{hinge 
loss}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
0.667790514894194
{code}

  was:
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} 
produce wrong solution. Does it also like this?
AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use {{squared hinge loss}} which is the 
default loss function of {{sklearn.svm.LinearSVC}} and more robust then {{hinge 
loss}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
0.667790514894194
{code}


> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
> e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like 
> {{SVMWithSGD}} produce wrong solution. Does it also like this?
> AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
> 

[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-20810:

Description: 
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} 
produce wrong solution. Does it also like this?
AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use {{squared hinge loss}} which is the 
default loss function of {{sklearn.svm.LinearSVC}} and more robust then {{hinge 
loss}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
0.667790514894194
{code}

  was:
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution?
AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use {{squared hinge loss}} which is the 
default loss function of {{sklearn.svm.LinearSVC}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
0.667790514894194
{code}


> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
> e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like 
> {{SVMWithSGD}} produce wrong solution. Does it also like this?
> AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
> function. Since the derivative of the hinge loss at certain place is 
> non-deterministic, should we switch to use {{squared hinge loss}} which is 
> the default loss function of {{sklearn.svm.LinearSVC}} and mo

[jira] [Updated] (SPARK-20808) External Table unnecessarily not created in Hive-compatible way

2017-05-19 Thread Joachim Hereth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joachim Hereth updated SPARK-20808:
---
Summary: External Table unnecessarily not created in Hive-compatible way  
(was: External Table unnecessarily not create in Hive-compatible way)

> External Table unnecessarily not created in Hive-compatible way
> ---
>
> Key: SPARK-20808
> URL: https://issues.apache.org/jira/browse/SPARK-20808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Joachim Hereth
>Priority: Minor
>
> In Spark 2.1.0 and 2.1.1 {{spark.catalog.createExternalTable}} creates tables 
> unnecessarily in a hive-incompatible way.
> For instance executing in a spark shell
> {code}
> val database = "default"
> val table = "table_name"
> val path = "/user/daki/"  + database + "/" + table
> var data = Array(("Alice", 23), ("Laura", 33), ("Peter", 54))
> val df = sc.parallelize(data).toDF("name","age") 
> df.write.mode(org.apache.spark.sql.SaveMode.Overwrite).parquet(path)
> spark.sql("DROP TABLE IF EXISTS " + database + "." + table)
> spark.catalog.createExternalTable(database + "."+ table, path)
> {code}
> issues the warning
> {code}
> Search Subject for Kerberos V5 INIT cred (<>, 
> sun.security.jgss.krb5.Krb5InitCredential)
> 17/05/19 11:01:17 WARN hive.HiveExternalCatalog: Could not persist 
> `default`.`table_name` in a Hive compatible way. Persisting it into Hive 
> metastore in Spark SQL specific format.
> org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User 
> daki does not have privileges for CREATETABLE)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
> ...
> {code}
> The Exception (user does not have privileges for CREATETABLE) is misleading 
> (I do have the CREATE TABLE privilege).
> Querying the table with Hive does not return any result. With Spark one can 
> access the data.
> The following code creates the table correctly (workaround):
> {code}
> def sqlStatement(df : org.apache.spark.sql.DataFrame, database : String, 
> table: String, path: String) : String = {
>   val rows = (for(col <- df.schema) 
> yield "`" + col.name + "` " + 
> col.dataType.simpleString).mkString(",\n")
>   val sqlStmnt = ("CREATE EXTERNAL TABLE `%s`.`%s` (%s) " +
> "STORED AS PARQUET " +
> "Location 'hdfs://nameservice1%s'").format(database, table, rows, path)
>   return sqlStmnt
> }
> spark.sql("DROP TABLE IF EXISTS " + database + "." + table)
> spark.sql(sqlStatement(df, database, table, path))
> {code}
> The code is executed via YARN against a Cloudera CDH 5.7.5 cluster with 
> Sentry enabled (in case this matters regarding the privilege warning). Spark 
> was built against the CDH libraries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

2017-05-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017258#comment-16017258
 ] 

Steve Loughran commented on SPARK-20799:


bq. Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
details. This is insecure and may be unsupported in future., but this should 
not mean that it shouldn't work anymore.

It probably will stop working at some point in the future as putting secrets in 
the URIs is too dangerous: everything logs them assuming they aren't sensitive 
data. the {{S3xLoginHelper}} not only warns you, it does a best-effort attempt 
to strip out the secrets from the public URI, hence the logs and the messages 
telling you off.

Prior to Hadoop 2.8, the sole *defensible* use case of secrets in URIs was it 
was the only way to have different logins on different buckets. In Hadoop 2.8 
we added the ability to configure any of the fs.s3a. options on a per-bucket 
basis, including the secret logins, endpoints, and other important values

I see what may be happening; in which case it probably constitutes a hadoop 
regression: if the filesystem's URI is converted to a string it will have these 
stripped, so if something is going path -> URI -> String ->path the secrets 
will be lost.

If you are seeing this stack trace, it means you are using Hadoop 2.8 or 
something else with the HADOOP-3733 patch in it. What version of Hadoop (or 
HDP, CDH..) are you using? If it is based on the full Apache 2.8 release, you 
get 

# per-bucket config to allow you to [configure each bucket 
separately|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]
# the ability to use JCEKS files to keep the secrets out the configs
# session token support.

Accordingly, if you state the version, I may be able to look @ what's happening 
in a bit more detail


> Unable to infer schema for ORC on reading ORC from S3
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jork Zijlstra
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter

2017-05-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20798.
---
   Resolution: Fixed
 Assignee: Ala Luszczak
Fix Version/s: 2.2.0
   2.1.2

> GenerateUnsafeProjection should check if value is null before calling the 
> getter
> 
>
> Key: SPARK-20798
> URL: https://issues.apache.org/jira/browse/SPARK-20798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ala Luszczak
>Assignee: Ala Luszczak
> Fix For: 2.1.2, 2.2.0
>
>
> GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption 
> that one should first make sure the value is not null before calling the 
> getter. This can lead to errors.
> An example of generated code:
> {noformat}
> /* 059 */ final UTF8String fieldName = value.getUTF8String(0);
> /* 060 */ if (value.isNullAt(0)) {
> /* 061 */   rowWriter1.setNullAt(0);
> /* 062 */ } else {
> /* 063 */   rowWriter1.write(0, fieldName);
> /* 064 */ }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017273#comment-16017273
 ] 

Sean Owen commented on SPARK-20810:
---

Are you pretty sure both are converged?
You set the same params but do they have the same meaning in both 
implementations?
I wonder if you can double-check the loss that both are computing to see if 
they even agree about how good a solution the other has found.
I doubt the discontinuity of the hinge loss matters as it only affects the 
gradient when the loss is exactly 0, and defining the derivative as 0 or 1 is 
valid and doesn't matter much, or shouldn't.

> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
> e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like 
> {{SVMWithSGD}} produce wrong solution. Does it also like this?
> AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
> function. Since the derivative of the hinge loss at certain place is 
> non-deterministic, should we switch to use {{squared hinge loss}} which is 
> the default loss function of {{sklearn.svm.LinearSVC}} and more robust than 
> {{hinge loss}}?
> This issue is very easy to reproduce, you can paste the following code 
> snippet to {{LinearSVCSuite}} and then click run in Intellij IDE.
> {code}
> test("LinearSVC vs SVMWithSGD") {
> import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
> import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
> val trainer1 = new LinearSVC()
>   .setRegParam(0.2)
>   .setMaxIter(200)
>   .setTol(1e-4)
> val model1 = trainer1.fit(binaryDataset)
> println(model1.coefficients)
> println(model1.intercept)
> val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
> Vector) =>
> OldLabeledPoint(label, OldVectors.fromML(features))
> }
> val trainer2 = new SVMWithSGD().setIntercept(true)
> 
> trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)
> val model2 = trainer2.run(oldData)
> println(model2.weights)
> println(model2.intercept)
>   }
> {code} 
> The output is:
> {code}
> [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
> 7.373454363024084
> [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
> 0.667790514894194
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

2017-05-19 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017278#comment-16017278
 ] 

Jork Zijlstra commented on SPARK-20799:
---

Hi Steve, 

Thanks for the quick response. We indeed don't need the credentials anymore to 
be on the path.

I indeed forgot to mention the version we are running. We are using Spark 2.1.1 
with indeed Hadoop 2.8.0
Any other information you need?

Regards, Jork

> Unable to infer schema for ORC on reading ORC from S3
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jork Zijlstra
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-20810:

Description: 
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} 
produce wrong solution. Does it also like this?
AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use {{squared hinge loss}} which is the 
default loss function of {{sklearn.svm.LinearSVC}} and more robust than {{hinge 
loss}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(2000).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.9257083966837497,1.8567843250728242,2.7381537413979595,3.7434319370941265]
0.9656577947867953
{code}

  was:
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} 
produce wrong solution. Does it also like this?
AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use {{squared hinge loss}} which is the 
default loss function of {{sklearn.svm.LinearSVC}} and more robust than {{hinge 
loss}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(2000).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.9257083966837497,1.8567843250728242,2.7381537413979595,3.7434319370941265]
0.9656577947867953
{code}


> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
> e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like 
> {{SVMWithSGD}} produce wrong solution. Does it also like this?
> AFAIK, both o

[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-20810:

Description: 
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} 
produce wrong solution. Does it also like this?
AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use {{squared hinge loss}} which is the 
default loss function of {{sklearn.svm.LinearSVC}} and more robust than {{hinge 
loss}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(2000).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.9257083966837497,1.8567843250728242,2.7381537413979595,3.7434319370941265]
0.9656577947867953
{code}

  was:
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} 
produce wrong solution. Does it also like this?
AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use {{squared hinge loss}} which is the 
default loss function of {{sklearn.svm.LinearSVC}} and more robust than {{hinge 
loss}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

val trainer1 = new LinearSVC()
  .setRegParam(0.2)
  .setMaxIter(200)
  .setTol(1e-4)
val model1 = trainer1.fit(binaryDataset)

println(model1.coefficients)
println(model1.intercept)

val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
OldLabeledPoint(label, OldVectors.fromML(features))
}
val trainer2 = new SVMWithSGD().setIntercept(true)

trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4)

val model2 = trainer2.run(oldData)

println(model2.weights)
println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
0.667790514894194
{code}


> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
> e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like 
> {{SVMWithSGD}} produce wrong solution. Does it also like this?
> AFAIK, both of them use {{hinge loss}} which is convex but not differentiable

[jira] [Commented] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017300#comment-16017300
 ] 

Yanbo Liang commented on SPARK-20810:
-

[~srowen] Thanks for your comments. I'm sure both are converged. ML LinearSVC 
converged after 143 epoch, and MLlib SVMWithSGD converged after 1794 epoch. It 
seems that we should pay some efforts to investigate the correctness of old 
MLlib implementation.

> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
> e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like 
> {{SVMWithSGD}} produce wrong solution. Does it also like this?
> AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
> function. Since the derivative of the hinge loss at certain place is 
> non-deterministic, should we switch to use {{squared hinge loss}} which is 
> the default loss function of {{sklearn.svm.LinearSVC}} and more robust than 
> {{hinge loss}}?
> This issue is very easy to reproduce, you can paste the following code 
> snippet to {{LinearSVCSuite}} and then click run in Intellij IDE.
> {code}
> test("LinearSVC vs SVMWithSGD") {
> import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
> import org.apache.spark.mllib.classification.SVMWithSGD
> import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
> val trainer1 = new LinearSVC()
>   .setRegParam(0.2)
>   .setMaxIter(200)
>   .setTol(1e-4)
> val model1 = trainer1.fit(binaryDataset)
> println(model1.coefficients)
> println(model1.intercept)
> val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
> Vector) =>
> OldLabeledPoint(label, OldVectors.fromML(features))
> }
> val trainer2 = new SVMWithSGD().setIntercept(true)
> 
> trainer2.optimizer.setRegParam(0.2).setNumIterations(2000).setConvergenceTol(1e-4)
> val model2 = trainer2.run(oldData)
> println(model2.weights)
> println(model2.intercept)
>   }
> {code} 
> The output is:
> {code}
> [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
> 7.373454363024084
> [0.9257083966837497,1.8567843250728242,2.7381537413979595,3.7434319370941265]
> 0.9656577947867953
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

2017-05-19 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017300#comment-16017300
 ] 

Yanbo Liang edited comment on SPARK-20810 at 5/19/17 12:02 PM:
---

[~srowen] Thanks for your comments. I'm sure both are converged. ML LinearSVC 
converged after 143 epoch, and MLlib SVMWithSGD converged after 1794 epoch. It 
seems that we should pay some efforts to investigate the correctness of old 
MLlib implementation. Or there are some implementation difference in detail, 
I'll try to make a closer inspection.


was (Author: yanboliang):
[~srowen] Thanks for your comments. I'm sure both are converged. ML LinearSVC 
converged after 143 epoch, and MLlib SVMWithSGD converged after 1794 epoch. It 
seems that we should pay some efforts to investigate the correctness of old 
MLlib implementation.

> ML LinearSVC vs MLlib SVMWithSGD output different solution
> --
>
> Key: SPARK-20810
> URL: https://issues.apache.org/jira/browse/SPARK-20810
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R 
> e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like 
> {{SVMWithSGD}} produce wrong solution. Does it also like this?
> AFAIK, both of them use {{hinge loss}} which is convex but not differentiable 
> function. Since the derivative of the hinge loss at certain place is 
> non-deterministic, should we switch to use {{squared hinge loss}} which is 
> the default loss function of {{sklearn.svm.LinearSVC}} and more robust than 
> {{hinge loss}}?
> This issue is very easy to reproduce, you can paste the following code 
> snippet to {{LinearSVCSuite}} and then click run in Intellij IDE.
> {code}
> test("LinearSVC vs SVMWithSGD") {
> import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
> import org.apache.spark.mllib.classification.SVMWithSGD
> import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
> val trainer1 = new LinearSVC()
>   .setRegParam(0.2)
>   .setMaxIter(200)
>   .setTol(1e-4)
> val model1 = trainer1.fit(binaryDataset)
> println(model1.coefficients)
> println(model1.intercept)
> val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
> Vector) =>
> OldLabeledPoint(label, OldVectors.fromML(features))
> }
> val trainer2 = new SVMWithSGD().setIntercept(true)
> 
> trainer2.optimizer.setRegParam(0.2).setNumIterations(2000).setConvergenceTol(1e-4)
> val model2 = trainer2.run(oldData)
> println(model2.weights)
> println(model2.intercept)
>   }
> {code} 
> The output is:
> {code}
> [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
> 7.373454363024084
> [0.9257083966837497,1.8567843250728242,2.7381537413979595,3.7434319370941265]
> 0.9656577947867953
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields

2017-05-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20773.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.2

> ParquetWriteSupport.writeFields is quadratic in number of fields
> 
>
> Key: SPARK-20773
> URL: https://issues.apache.org/jira/browse/SPARK-20773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: T Poterba
>Priority: Minor
>  Labels: easyfix, performance
> Fix For: 2.1.2, 2.2.0
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all 
> elements. Since the fieldWriters object is a List, this is a quadratic 
> operation.
> See line 123: 
> https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields

2017-05-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-20773:
-

Assignee: T Poterba

> ParquetWriteSupport.writeFields is quadratic in number of fields
> 
>
> Key: SPARK-20773
> URL: https://issues.apache.org/jira/browse/SPARK-20773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: T Poterba
>Assignee: T Poterba
>Priority: Minor
>  Labels: easyfix, performance
> Fix For: 2.1.2, 2.2.0
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all 
> elements. Since the fieldWriters object is a List, this is a quadratic 
> operation.
> See line 123: 
> https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly

2017-05-19 Thread Artur Sukhenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artur Sukhenko updated SPARK-17922:
---
Affects Version/s: 2.0.1

> ClassCastException java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator 
> cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection 
> -
>
> Key: SPARK-17922
> URL: https://issues.apache.org/jira/browse/SPARK-17922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: kanika dhuria
> Attachments: spark_17922.tar.gz
>
>
> I am using spark 2.0
> Seeing class loading issue because the whole stage code gen is generating 
> multiple classes with same name as 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass"
> I am using dataframe transform. and within transform i use Osgi.
> Osgi replaces the thread context class loader to ContextFinder which looks at 
> all the class loaders in the stack to find out the new generated class and 
> finds the GeneratedClass with inner class GeneratedIterator byteclass 
> loader(instead of falling back to the byte class loader created by janino 
> compiler), since the class name is same that byte class loader loads the 
> class and returns GeneratedClass$GeneratedIterator instead of expected 
> GeneratedClass$UnsafeProjection.
> Can we generate different classes with different names or is it expected to 
> generate one class only? 
> This is the somewhat I am trying to do 
> {noformat} 
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import com.databricks.spark.avro._
>   def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = {
> //Initialize osgi
>  (rows:Iterator[Row]) => {
>  var outi = Iterator[Row]() 
>  while(rows.hasNext) {
>  val r = rows.next 
>  outi = outi.++(Iterator(Row(r.get(0  
>  } 
>  //val ors = Row("abc")   
>  //outi =outi.++( Iterator(ors))  
>  outi
>  }
>   }
> def transform1( outType:StructType) :((DataFrame) => DataFrame) = {
>  (d:DataFrame) => {
>   val inType = d.schema
>   val rdd = d.rdd.mapPartitions(exePart(outType))
>   d.sqlContext.createDataFrame(rdd, outType)
> }
>
>   }
> val df = spark.read.avro("file:///data/builds/a1.avro")
> val df1 = df.select($"id2").filter(false)
> val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, 
> true)::Nil))).createOrReplaceTempView("tbl0")
> spark.sql("insert overwrite table testtable select p1 from tbl0")
> {noformat} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20607) Add new unit tests to ShuffleSuite

2017-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20607:
-

Assignee: caoxuewen
Priority: Trivial  (was: Minor)

> Add new unit tests to ShuffleSuite
> --
>
> Key: SPARK-20607
> URL: https://issues.apache.org/jira/browse/SPARK-20607
> Project: Spark
>  Issue Type: Test
>  Components: Shuffle, Tests
>Affects Versions: 2.1.2
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Trivial
> Fix For: 2.3.0
>
>
> 1.adds the new unit tests.
>   testing would be performed when there is no shuffle stage, 
>   shuffle will not generate the data file and the index files.
> 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its 
> local shuffle file' unit test, 
>   parallelize is 1 but not is 2, Check the index file and delete.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20607) Add new unit tests to ShuffleSuite

2017-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20607.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17868
[https://github.com/apache/spark/pull/17868]

> Add new unit tests to ShuffleSuite
> --
>
> Key: SPARK-20607
> URL: https://issues.apache.org/jira/browse/SPARK-20607
> Project: Spark
>  Issue Type: Test
>  Components: Shuffle, Tests
>Affects Versions: 2.1.2
>Reporter: caoxuewen
>Priority: Minor
> Fix For: 2.3.0
>
>
> 1.adds the new unit tests.
>   testing would be performed when there is no shuffle stage, 
>   shuffle will not generate the data file and the index files.
> 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its 
> local shuffle file' unit test, 
>   parallelize is 1 but not is 2, Check the index file and delete.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20759) SCALA_VERSION in _config.yml,LICENSE and Dockerfile should be consistent with pom.xml

2017-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20759.
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 17992
[https://github.com/apache/spark/pull/17992]

> SCALA_VERSION in _config.yml,LICENSE and Dockerfile should be consistent with 
> pom.xml
> -
>
> Key: SPARK-20759
> URL: https://issues.apache.org/jira/browse/SPARK-20759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Priority: Minor
> Fix For: 2.2.0, 2.1.2
>
>
> SCALA_VERSION in _config.yml,LICENSE and Dockerfile is 2.11.7, but 2.11.8 in 
> pom.xml. So I think SCALA_VERSION in _config.yml should be consistent with 
> pom.xml.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20759) SCALA_VERSION in _config.yml,LICENSE and Dockerfile should be consistent with pom.xml

2017-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20759:
-

Assignee: liuzhaokun
Priority: Trivial  (was: Minor)

> SCALA_VERSION in _config.yml,LICENSE and Dockerfile should be consistent with 
> pom.xml
> -
>
> Key: SPARK-20759
> URL: https://issues.apache.org/jira/browse/SPARK-20759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Assignee: liuzhaokun
>Priority: Trivial
> Fix For: 2.1.2, 2.2.0
>
>
> SCALA_VERSION in _config.yml,LICENSE and Dockerfile is 2.11.7, but 2.11.8 in 
> pom.xml. So I think SCALA_VERSION in _config.yml should be consistent with 
> pom.xml.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-05-19 Thread Mathieu D (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017528#comment-16017528
 ] 

Mathieu D commented on SPARK-18838:
---

I'm not very familiar with this part of Spark, but I'd like to share a thought.
In my experience (SPARK-18881) when events start to be dropped because of full 
event queues, it's much more serious than just a failed job. The Spark driver 
became useless, I had to relaunch.
So, besides the improvement of existing bus, listeners and threads, wouldn't be 
a kind of back-pressure mechanism (on tasks emission) better than dropping 
events ? I mean, this would obviously degrade the job performance, but it's 
still better than compromising the whole job or even the driver health. 
my2cent

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-05-19 Thread Antoine PRANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017560#comment-16017560
 ] 

Antoine PRANG commented on SPARK-18838:
---

[~mathieude]]: Yep, I introduced a blocking strategy for the LiveListenerBus 
(If the queue is full, we wait for space instead of dropping events).
This is not the default strategy but it can be activated through a settings.
The default strategy remains the dropping one.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18838) High latency of event processing for large jobs

2017-05-19 Thread Antoine PRANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017560#comment-16017560
 ] 

Antoine PRANG edited comment on SPARK-18838 at 5/19/17 3:39 PM:


[~mathieude]: Yep, I introduced a blocking strategy for the LiveListenerBus (If 
the queue is full, we wait for space instead of dropping events).
This is not the default strategy but it can be activated through a settings.
The default strategy remains the dropping one.


was (Author: boomx):
[~mathieude]]: Yep, I introduced a blocking strategy for the LiveListenerBus 
(If the queue is full, we wait for space instead of dropping events).
This is not the default strategy but it can be activated through a settings.
The default strategy remains the dropping one.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20811) GBT Classifier failed with mysterious StackOverflowError

2017-05-19 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-20811:

Summary: GBT Classifier failed with mysterious StackOverflowError  (was: 
GBT Classifier failed with mysterious StackOverflowException )

> GBT Classifier failed with mysterious StackOverflowError
> 
>
> Key: SPARK-20811
> URL: https://issues.apache.org/jira/browse/SPARK-20811
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> I am running GBT Classifier over airline dataset (combining 2005-2008) and in 
> total it's around 22M examples as training data
> code is simple
> {code:title=Bar.scala|borderStyle=solid}
> val gradientBoostedTrees = new GBTClassifier()
>   gradientBoostedTrees.setMaxBins(1000)
>   gradientBoostedTrees.setMaxIter(500)
>   gradientBoostedTrees.setMaxDepth(6)
>   gradientBoostedTrees.setStepSize(1.0)
>   transformedTrainingSet.cache().foreach(_ => Unit)
>   val startTime = System.nanoTime()
>   val model = gradientBoostedTrees.fit(transformedTrainingSet)
>   println(s"===training time cost: ${(System.nanoTime() - startTime) / 
> 1000.0 / 1000.0} ms")
>   val resultDF = model.transform(transformedTestset)
>   val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
>   
> binaryClassificationEvaluator.setRawPredictionCol("prediction").setLabelCol("label")
>   println(s"=test AUC: 
> ${binaryClassificationEvaluator.evaluate(resultDF)}==")
> {code}
> my training job always failed with 
> {quote}
> 17/05/19 13:41:29 WARN TaskSetManager: Lost task 18.0 in stage 3907.0 (TID 
> 137506, 10.0.0.13, executor 3): java.lang.StackOverflowError
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:3037)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3061)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2234)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479)
>   at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
> {quote}
> the above pattern repeated for many times
> Is it a bug or did I make something wrong when using GBTClassifier in ML?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20811) GBT Classifier failed with mysterious StackOverflowException

2017-05-19 Thread Nan Zhu (JIRA)
Nan Zhu created SPARK-20811:
---

 Summary: GBT Classifier failed with mysterious 
StackOverflowException 
 Key: SPARK-20811
 URL: https://issues.apache.org/jira/browse/SPARK-20811
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
Reporter: Nan Zhu


I am running GBT Classifier over airline dataset (combining 2005-2008) and in 
total it's around 22M examples as training data

code is simple

{code:title=Bar.scala|borderStyle=solid}
val gradientBoostedTrees = new GBTClassifier()
  gradientBoostedTrees.setMaxBins(1000)
  gradientBoostedTrees.setMaxIter(500)
  gradientBoostedTrees.setMaxDepth(6)
  gradientBoostedTrees.setStepSize(1.0)
  transformedTrainingSet.cache().foreach(_ => Unit)
  val startTime = System.nanoTime()
  val model = gradientBoostedTrees.fit(transformedTrainingSet)
  println(s"===training time cost: ${(System.nanoTime() - startTime) / 
1000.0 / 1000.0} ms")
  val resultDF = model.transform(transformedTestset)
  val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
  
binaryClassificationEvaluator.setRawPredictionCol("prediction").setLabelCol("label")
  println(s"=test AUC: 
${binaryClassificationEvaluator.evaluate(resultDF)}==")
{code}


my training job always failed with 

{quote}
17/05/19 13:41:29 WARN TaskSetManager: Lost task 18.0 in stage 3907.0 (TID 
137506, 10.0.0.13, executor 3): java.lang.StackOverflowError
at 
java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:3037)
at 
java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3061)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2234)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at 
scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
{quote}

the above pattern repeated for many times

Is it a bug or did I make something wrong when using GBTClassifier in ML?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20751) Built-in SQL Function Support - COT

2017-05-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20751:
---

Assignee: Yuming Wang

> Built-in SQL Function Support - COT
> ---
>
> Key: SPARK-20751
> URL: https://issues.apache.org/jira/browse/SPARK-20751
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
> Fix For: 2.3.0
>
>
> {noformat}
> COT()
> {noformat}
> Returns the cotangent of .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20751) Built-in SQL Function Support - COT

2017-05-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20751.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Built-in SQL Function Support - COT
> ---
>
> Key: SPARK-20751
> URL: https://issues.apache.org/jira/browse/SPARK-20751
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
> Fix For: 2.3.0
>
>
> {noformat}
> COT()
> {noformat}
> Returns the cotangent of .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20811) GBT Classifier failed with mysterious StackOverflowError

2017-05-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017660#comment-16017660
 ] 

Sean Owen commented on SPARK-20811:
---

I assume it's serialization of a very deep tree via the Java mechanism. Does 
kryo work differently? does increasing the stack size with something like 
-Xss1m at least work around it?

> GBT Classifier failed with mysterious StackOverflowError
> 
>
> Key: SPARK-20811
> URL: https://issues.apache.org/jira/browse/SPARK-20811
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> I am running GBT Classifier over airline dataset (combining 2005-2008) and in 
> total it's around 22M examples as training data
> code is simple
> {code:title=Bar.scala|borderStyle=solid}
> val gradientBoostedTrees = new GBTClassifier()
>   gradientBoostedTrees.setMaxBins(1000)
>   gradientBoostedTrees.setMaxIter(500)
>   gradientBoostedTrees.setMaxDepth(6)
>   gradientBoostedTrees.setStepSize(1.0)
>   transformedTrainingSet.cache().foreach(_ => Unit)
>   val startTime = System.nanoTime()
>   val model = gradientBoostedTrees.fit(transformedTrainingSet)
>   println(s"===training time cost: ${(System.nanoTime() - startTime) / 
> 1000.0 / 1000.0} ms")
>   val resultDF = model.transform(transformedTestset)
>   val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
>   
> binaryClassificationEvaluator.setRawPredictionCol("prediction").setLabelCol("label")
>   println(s"=test AUC: 
> ${binaryClassificationEvaluator.evaluate(resultDF)}==")
> {code}
> my training job always failed with 
> {quote}
> 17/05/19 13:41:29 WARN TaskSetManager: Lost task 18.0 in stage 3907.0 (TID 
> 137506, 10.0.0.13, executor 3): java.lang.StackOverflowError
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:3037)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3061)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2234)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479)
>   at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
> {quote}
> the above pattern repeated for many times
> Is it a bug or did I make something wrong when using GBTClassifier in ML?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12139) REGEX Column Specification for Hive Queries

2017-05-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-12139:
-

> REGEX Column Specification for Hive Queries
> ---
>
> Key: SPARK-12139
> URL: https://issues.apache.org/jira/browse/SPARK-12139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Derek Sabry
>Priority: Minor
>
> When executing a query of the form
> Select `(a)?\+.\+` from A,
> Hive would interpret this query as a regular expression, which can be 
> supported in the hive parser for spark



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index

2017-05-19 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017699#comment-16017699
 ] 

Felix Cheung commented on SPARK-18825:
--

interesting - do you think knitr can take your change for -method?
I'm actually not sure about the part with dontrun - could you explain a bit?


> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20763) The function of `month` and `day` return a value which is not we expected

2017-05-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20763:

Fix Version/s: (was: 2.3.0)

> The function of  `month` and `day` return a value which is not we expected
> --
>
> Key: SPARK-20763
> URL: https://issues.apache.org/jira/browse/SPARK-20763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.2.0
>
>
> spark-sql>select month("1582-09-28");
> spark-sql>10
> For this case, the expected result is 9, but it is 10.
> spark-sql>select day("1582-04-18");
> spark-sql>28
> For this case, the expected result is 18, but it is 28.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20763) The function of `month` and `day` return a value which is not we expected

2017-05-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20763.
-
   Resolution: Fixed
 Assignee: liuxian
Fix Version/s: 2.3.0
   2.2.0

> The function of  `month` and `day` return a value which is not we expected
> --
>
> Key: SPARK-20763
> URL: https://issues.apache.org/jira/browse/SPARK-20763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
>
> spark-sql>select month("1582-09-28");
> spark-sql>10
> For this case, the expected result is 9, but it is 10.
> spark-sql>select day("1582-04-18");
> spark-sql>28
> For this case, the expected result is 18, but it is 28.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20812) Add Mesos Secrets support to the spark dispatcher

2017-05-19 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-20812:
---

 Summary: Add Mesos Secrets support to the spark dispatcher
 Key: SPARK-20812
 URL: https://issues.apache.org/jira/browse/SPARK-20812
 Project: Spark
  Issue Type: New Feature
  Components: Mesos
Affects Versions: 2.3.0
Reporter: Michael Gummelt


Mesos 1.3 supports secrets.  In order to support sending keytabs through the 
Spark Dispatcher, or any other secret, we need to integrate this with the Spark 
Dispatcher.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12139) REGEX Column Specification for Hive Queries

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12139:


Assignee: Apache Spark

> REGEX Column Specification for Hive Queries
> ---
>
> Key: SPARK-12139
> URL: https://issues.apache.org/jira/browse/SPARK-12139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Derek Sabry
>Assignee: Apache Spark
>Priority: Minor
>
> When executing a query of the form
> Select `(a)?\+.\+` from A,
> Hive would interpret this query as a regular expression, which can be 
> supported in the hive parser for spark



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12139) REGEX Column Specification for Hive Queries

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12139:


Assignee: (was: Apache Spark)

> REGEX Column Specification for Hive Queries
> ---
>
> Key: SPARK-12139
> URL: https://issues.apache.org/jira/browse/SPARK-12139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Derek Sabry
>Priority: Minor
>
> When executing a query of the form
> Select `(a)?\+.\+` from A,
> Hive would interpret this query as a regular expression, which can be 
> supported in the hive parser for spark



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20812) Add Mesos Secrets support to the spark dispatcher

2017-05-19 Thread Michael Gummelt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-20812:

Description: 
Mesos 1.3 supports secrets.  In order to support sending keytabs through the 
Spark Dispatcher, or any other secret, we need to integrate this with the Spark 
Dispatcher.

The integration should include support for both file-based and env-based 
secrets.

  was:Mesos 1.3 supports secrets.  In order to support sending keytabs through 
the Spark Dispatcher, or any other secret, we need to integrate this with the 
Spark Dispatcher.


> Add Mesos Secrets support to the spark dispatcher
> -
>
> Key: SPARK-20812
> URL: https://issues.apache.org/jira/browse/SPARK-20812
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Michael Gummelt
>
> Mesos 1.3 supports secrets.  In order to support sending keytabs through the 
> Spark Dispatcher, or any other secret, we need to integrate this with the 
> Spark Dispatcher.
> The integration should include support for both file-based and env-based 
> secrets.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates

2017-05-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20512:
--
Description: 
Before the release, we need to update the SparkR Programming Guide, its 
migration guide, and the R vignettes.  Updates will include:
* Add migration guide subsection.
** Use the results of the QA audit JIRAs and [SPARK-18864].
* Check phrasing, especially in main sections (for outdated items such as "In 
this release, ...")
* Update R vignettes

Note: This task is for large changes to the guides.  New features are handled 
in [SPARK-18330].

  was:
Before the release, we need to update the SparkR Programming Guide, its 
migration guide, and the R vignettes.  Updates will include:
* Add migration guide subsection.
** Use the results of the QA audit JIRAs and [SPARK-17692].
* Check phrasing, especially in main sections (for outdated items such as "In 
this release, ...")
* Update R vignettes

Note: This task is for large changes to the guides.  New features are handled 
in [SPARK-18330].


> SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-20512
> URL: https://issues.apache.org/jira/browse/SPARK-20512
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-18864].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates

2017-05-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20512:
--
Description: 
Before the release, we need to update the SparkR Programming Guide, its 
migration guide, and the R vignettes.  Updates will include:
* Add migration guide subsection.
** Use the results of the QA audit JIRAs and [SPARK-18864].
* Check phrasing, especially in main sections (for outdated items such as "In 
this release, ...")
* Update R vignettes

Note: This task is for large changes to the guides.  New features are handled 
in [SPARK-20505].

  was:
Before the release, we need to update the SparkR Programming Guide, its 
migration guide, and the R vignettes.  Updates will include:
* Add migration guide subsection.
** Use the results of the QA audit JIRAs and [SPARK-18864].
* Check phrasing, especially in main sections (for outdated items such as "In 
this release, ...")
* Update R vignettes

Note: This task is for large changes to the guides.  New features are handled 
in [SPARK-18330].


> SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-20512
> URL: https://issues.apache.org/jira/browse/SPARK-20512
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-18864].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-20505].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20813) executor page search by status not working

2017-05-19 Thread Jong Yoon Lee (JIRA)
Jong Yoon Lee created SPARK-20813:
-

 Summary: executor page  search by status not working 
 Key: SPARK-20813
 URL: https://issues.apache.org/jira/browse/SPARK-20813
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: Jong Yoon Lee
Priority: Trivial


When searching for status keywords such as active, dead or Blacklisted nothing 
is returned on the table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20813) Web UI executor page tab search by status not working

2017-05-19 Thread Jong Yoon Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jong Yoon Lee updated SPARK-20813:
--
Summary: Web UI executor page tab search by status not working   (was: 
executor page  search by status not working )

> Web UI executor page tab search by status not working 
> --
>
> Key: SPARK-20813
> URL: https://issues.apache.org/jira/browse/SPARK-20813
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Jong Yoon Lee
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When searching for status keywords such as active, dead or Blacklisted 
> nothing is returned on the table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index

2017-05-19 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017817#comment-16017817
 ] 

Maciej Szymkiewicz commented on SPARK-18825:


Originally I thought about patching it for our own usage, but I can open an 
issue / PR and see what they have to say. Problematic html is not even 
generated by {{knitr}} so technically speaking we can just {{sed}} this thing 
between:

{code}
. "$FWDIR/install-dev.sh"
{code}

and calling {{knitr}}

Regarding {{dontrun}} -  right now we have a lot of examples which are never 
executed to satisfy CRAN requirements. Calling these could:

- Serve as additional tests.
- Reduce maintenance burden.
- Improve quality of the docs (strip {{## Not run:}} and {{##D}} and provide 
actual output).


> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18825) Eliminate duplicate links in SparkR API doc index

2017-05-19 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017817#comment-16017817
 ] 

Maciej Szymkiewicz edited comment on SPARK-18825 at 5/19/17 6:42 PM:
-

Originally I thought about patching it for our own usage, but I can open an 
issue / PR and see what they have to say. Problematic html is not even 
generated by {{knitr}} so technically speaking we can just {{sed}} this thing 
between:

{code}
. "$FWDIR/install-dev.sh"
{code}

and calling {{knitr}}

Regarding {{dontrun}} -  right now we have a lot of examples which are never 
executed to satisfy CRAN requirements but could be run locally when we 
{{create_docs}}. Running these could:

- Serve as additional tests.
- Reduce maintenance burden.
- Improve quality of the docs (strip {{## Not run:}} and {{##D}} and provide 
actual output).



was (Author: zero323):
Originally I thought about patching it for our own usage, but I can open an 
issue / PR and see what they have to say. Problematic html is not even 
generated by {{knitr}} so technically speaking we can just {{sed}} this thing 
between:

{code}
. "$FWDIR/install-dev.sh"
{code}

and calling {{knitr}}

Regarding {{dontrun}} -  right now we have a lot of examples which are never 
executed to satisfy CRAN requirements. Calling these could:

- Serve as additional tests.
- Reduce maintenance burden.
- Improve quality of the docs (strip {{## Not run:}} and {{##D}} and provide 
actual output).


> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20506) ML, Graph 2.2 QA: Programming guide update and migration guide

2017-05-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20506.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17996
[https://github.com/apache/spark/pull/17996]

> ML, Graph 2.2 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-20506
> URL: https://issues.apache.org/jira/browse/SPARK-20506
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.2.0
>
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20813) Web UI executor page tab search by status not working

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20813:


Assignee: Apache Spark

> Web UI executor page tab search by status not working 
> --
>
> Key: SPARK-20813
> URL: https://issues.apache.org/jira/browse/SPARK-20813
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Jong Yoon Lee
>Assignee: Apache Spark
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When searching for status keywords such as active, dead or Blacklisted 
> nothing is returned on the table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20813) Web UI executor page tab search by status not working

2017-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017907#comment-16017907
 ] 

Apache Spark commented on SPARK-20813:
--

User 'yoonlee95' has created a pull request for this issue:
https://github.com/apache/spark/pull/18036

> Web UI executor page tab search by status not working 
> --
>
> Key: SPARK-20813
> URL: https://issues.apache.org/jira/browse/SPARK-20813
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Jong Yoon Lee
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When searching for status keywords such as active, dead or Blacklisted 
> nothing is returned on the table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20813) Web UI executor page tab search by status not working

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20813:


Assignee: (was: Apache Spark)

> Web UI executor page tab search by status not working 
> --
>
> Key: SPARK-20813
> URL: https://issues.apache.org/jira/browse/SPARK-20813
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Jong Yoon Lee
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When searching for status keywords such as active, dead or Blacklisted 
> nothing is returned on the table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems

2017-05-19 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017912#comment-16017912
 ] 

Paul Praet commented on SPARK-4820:
---

I confirm - still an issue when trying to build Spark 2.1.1 on Ubuntu 16.04.

> Spark build encounters "File name too long" on some encrypted filesystems
> -
>
> Key: SPARK-4820
> URL: https://issues.apache.org/jira/browse/SPARK-4820
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Theodore Vasiloudis
>Priority: Minor
> Fix For: 1.4.0
>
>
> This was reported by Luchesar Cekov on github along with a proposed fix. The 
> fix has some potential downstream issues (it will modify the classnames) so 
> until we understand better how many users are affected we aren't going to 
> merge it. However, I'd like to include the issue and workaround here. If you 
> encounter this issue please comment on the JIRA so we can assess the 
> frequency.
> The issue produces this error:
> {code}
> [error] == Expanded type of tree ==
> [error] 
> [error] ConstantType(value = Constant(Throwable))
> [error] 
> [error] uncaught exception during compilation: java.io.IOException
> [error] File name too long
> [error] two errors found
> {code}
> The workaround is in maven under the compile options add: 
> {code}
> +  -Xmax-classfile-name
> +  128
> {code}
> In SBT add:
> {code}
> +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong

2017-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20781.
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 18013
[https://github.com/apache/spark/pull/18013]

> the location of Dockerfile in docker.properties.template is wrong
> -
>
> Key: SPARK-20781
> URL: https://issues.apache.org/jira/browse/SPARK-20781
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
> Fix For: 2.2.0, 2.1.2
>
>
> the location of Dockerfile in docker.properties.template should be 
> "../external/docker/spark-mesos/Dockerfile"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong

2017-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20781:
-

  Assignee: liuzhaokun
  Priority: Minor  (was: Major)
Issue Type: Bug  (was: Improvement)

> the location of Dockerfile in docker.properties.template is wrong
> -
>
> Key: SPARK-20781
> URL: https://issues.apache.org/jira/browse/SPARK-20781
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Assignee: liuzhaokun
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
>
> the location of Dockerfile in docker.properties.template should be 
> "../external/docker/spark-mesos/Dockerfile"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration

2017-05-19 Thread Gene Pang (JIRA)
Gene Pang created SPARK-20814:
-

 Summary: Mesos scheduler does not respect 
spark.executor.extraClassPath configuration
 Key: SPARK-20814
 URL: https://issues.apache.org/jira/browse/SPARK-20814
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Gene Pang


When Spark executors are deployed on Mesos, the Mesos scheduler no longer 
respects the "spark.executor.extraClassPath" configuration parameter.

MesosCoarseGrainedSchedulerBackend used to use the environment variable 
"SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the 
executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was removed 
in this commit 
[https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178].

This effectively broke the ability for users to specify 
"spark.executor.extraClassPath" for Spark executors deployed on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration

2017-05-19 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-20814:
---
Target Version/s: 2.2.0
Priority: Critical  (was: Major)

> Mesos scheduler does not respect spark.executor.extraClassPath configuration
> 
>
> Key: SPARK-20814
> URL: https://issues.apache.org/jira/browse/SPARK-20814
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Gene Pang
>Priority: Critical
>
> When Spark executors are deployed on Mesos, the Mesos scheduler no longer 
> respects the "spark.executor.extraClassPath" configuration parameter.
> MesosCoarseGrainedSchedulerBackend used to use the environment variable 
> "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the 
> executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was 
> removed in this commit 
> [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178].
> This effectively broke the ability for users to specify 
> "spark.executor.extraClassPath" for Spark executors deployed on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration

2017-05-19 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017979#comment-16017979
 ] 

Marcelo Vanzin commented on SPARK-20814:


Hmm, this sucks, we should fix it for 2.2 (FYI [~marmbrus]).

Let me take a stab at fixing just the Mesos usage without re-introducing that 
variable.

> Mesos scheduler does not respect spark.executor.extraClassPath configuration
> 
>
> Key: SPARK-20814
> URL: https://issues.apache.org/jira/browse/SPARK-20814
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Gene Pang
>
> When Spark executors are deployed on Mesos, the Mesos scheduler no longer 
> respects the "spark.executor.extraClassPath" configuration parameter.
> MesosCoarseGrainedSchedulerBackend used to use the environment variable 
> "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the 
> executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was 
> removed in this commit 
> [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178].
> This effectively broke the ability for users to specify 
> "spark.executor.extraClassPath" for Spark executors deployed on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration

2017-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017993#comment-16017993
 ] 

Apache Spark commented on SPARK-20814:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18037

> Mesos scheduler does not respect spark.executor.extraClassPath configuration
> 
>
> Key: SPARK-20814
> URL: https://issues.apache.org/jira/browse/SPARK-20814
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Gene Pang
>Priority: Critical
>
> When Spark executors are deployed on Mesos, the Mesos scheduler no longer 
> respects the "spark.executor.extraClassPath" configuration parameter.
> MesosCoarseGrainedSchedulerBackend used to use the environment variable 
> "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the 
> executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was 
> removed in this commit 
> [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178].
> This effectively broke the ability for users to specify 
> "spark.executor.extraClassPath" for Spark executors deployed on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20814:


Assignee: Apache Spark

> Mesos scheduler does not respect spark.executor.extraClassPath configuration
> 
>
> Key: SPARK-20814
> URL: https://issues.apache.org/jira/browse/SPARK-20814
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Gene Pang
>Assignee: Apache Spark
>Priority: Critical
>
> When Spark executors are deployed on Mesos, the Mesos scheduler no longer 
> respects the "spark.executor.extraClassPath" configuration parameter.
> MesosCoarseGrainedSchedulerBackend used to use the environment variable 
> "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the 
> executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was 
> removed in this commit 
> [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178].
> This effectively broke the ability for users to specify 
> "spark.executor.extraClassPath" for Spark executors deployed on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration

2017-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20814:


Assignee: (was: Apache Spark)

> Mesos scheduler does not respect spark.executor.extraClassPath configuration
> 
>
> Key: SPARK-20814
> URL: https://issues.apache.org/jira/browse/SPARK-20814
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Gene Pang
>Priority: Critical
>
> When Spark executors are deployed on Mesos, the Mesos scheduler no longer 
> respects the "spark.executor.extraClassPath" configuration parameter.
> MesosCoarseGrainedSchedulerBackend used to use the environment variable 
> "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the 
> executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was 
> removed in this commit 
> [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178].
> This effectively broke the ability for users to specify 
> "spark.executor.extraClassPath" for Spark executors deployed on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20683) Make table uncache chaining optional

2017-05-19 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018008#comment-16018008
 ] 

Andrew Ash commented on SPARK-20683:


Thanks for that diff [~shea.parkes] -- we're planning on trying it in our fork 
too: https://github.com/palantir/spark/pull/188

> Make table uncache chaining optional
> 
>
> Key: SPARK-20683
> URL: https://issues.apache.org/jira/browse/SPARK-20683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Not particularly environment sensitive.  
> Encountered/tested on Linux and Windows.
>Reporter: Shea Parkes
>
> A recent change was made in SPARK-19765 that causes table uncaching to chain. 
>  That is, if table B is a child of table A, and they are both cached, now 
> uncaching table A will automatically uncache table B.
> At first I did not understand the need for this, but when reading the unit 
> tests, I see that it is likely that many people do not keep named references 
> to the child table (e.g. B).  Perhaps B is just made and cached as some part 
> of data exploration.  In that situation, it makes sense for B to 
> automatically be uncached when you are finished with A.
> However, we commonly utilize a different design pattern that is now harmed by 
> this automatic uncaching.  It is common for us to cache table A to then make 
> two, independent children tables (e.g. B and C).  Once those two child tables 
> are realized and cached, we'd then uncache table A (as it was no longer 
> needed and could be quite large).  After this change now, when we uncache 
> table A, we suddenly lose our cached status on both table B and C (which is 
> quite frustrating).  All of these tables are often quite large, and we view 
> what we're doing as mindful memory management.  We are maintaining named 
> references to B and C at all times, so we can always uncache them ourselves 
> when it makes sense.
> Would it be acceptable/feasible to make this table uncache chaining optional? 
>  I would be fine if the default is for the chaining to happen, as long as we 
> can turn it off via parameters.
> If acceptable, I can try to work towards making the required changes.  I am 
> most comfortable in Python (and would want the optional parameter surfaced in 
> Python), but have found the places required to make this change in Scala 
> (since I reverted the functionality in a private fork already).  Any help 
> would be greatly appreciated however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20815) NullPointerException in RPackageUtils#checkManifestForR

2017-05-19 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-20815:
--

 Summary: NullPointerException in RPackageUtils#checkManifestForR
 Key: SPARK-20815
 URL: https://issues.apache.org/jira/browse/SPARK-20815
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.1
Reporter: Andrew Ash


Some jars don't have manifest files in them, such as in my case 
javax.inject-1.jar and value-2.2.1-annotations.jar

This causes the below NPE:

{noformat}
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.deploy.RPackageUtils$.checkManifestForR(RPackageUtils.scala:95)
at 
org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply$mcV$sp(RPackageUtils.scala:180)
at 
org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180)
at 
org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1322)
at 
org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:202)
at 
org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:175)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.deploy.RPackageUtils$.checkAndBuildRPackage(RPackageUtils.scala:175)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:311)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:152)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}

due to RPackageUtils#checkManifestForR assuming {{jar.getManifest}} is non-null.

However per the JDK spec it can be null:

{noformat}
/**
 * Returns the jar file manifest, or null if none.
 *
 * @return the jar file manifest, or null if none
 *
 * @throws IllegalStateException
 * may be thrown if the jar file has been closed
 * @throws IOException  if an I/O error has occurred
 */
public Manifest getManifest() throws IOException {
return getManifestFromReference();
}
{noformat}

This method should do a null check and return false if the manifest is null 
(meaning no R code in that jar)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20811) GBT Classifier failed with mysterious StackOverflowError

2017-05-19 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018246#comment-16018246
 ] 

Nan Zhu commented on SPARK-20811:
-

thanks, let me try it

> GBT Classifier failed with mysterious StackOverflowError
> 
>
> Key: SPARK-20811
> URL: https://issues.apache.org/jira/browse/SPARK-20811
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> I am running GBT Classifier over airline dataset (combining 2005-2008) and in 
> total it's around 22M examples as training data
> code is simple
> {code:title=Bar.scala|borderStyle=solid}
> val gradientBoostedTrees = new GBTClassifier()
>   gradientBoostedTrees.setMaxBins(1000)
>   gradientBoostedTrees.setMaxIter(500)
>   gradientBoostedTrees.setMaxDepth(6)
>   gradientBoostedTrees.setStepSize(1.0)
>   transformedTrainingSet.cache().foreach(_ => Unit)
>   val startTime = System.nanoTime()
>   val model = gradientBoostedTrees.fit(transformedTrainingSet)
>   println(s"===training time cost: ${(System.nanoTime() - startTime) / 
> 1000.0 / 1000.0} ms")
>   val resultDF = model.transform(transformedTestset)
>   val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
>   
> binaryClassificationEvaluator.setRawPredictionCol("prediction").setLabelCol("label")
>   println(s"=test AUC: 
> ${binaryClassificationEvaluator.evaluate(resultDF)}==")
> {code}
> my training job always failed with 
> {quote}
> 17/05/19 13:41:29 WARN TaskSetManager: Lost task 18.0 in stage 3907.0 (TID 
> 137506, 10.0.0.13, executor 3): java.lang.StackOverflowError
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:3037)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3061)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2234)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479)
>   at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
> {quote}
> the above pattern repeated for many times
> Is it a bug or did I make something wrong when using GBTClassifier in ML?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20815) NullPointerException in RPackageUtils#checkManifestForR

2017-05-19 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018300#comment-16018300
 ] 

Felix Cheung commented on SPARK-20815:
--

make sense to me. would you like to contribute the fix?


> NullPointerException in RPackageUtils#checkManifestForR
> ---
>
> Key: SPARK-20815
> URL: https://issues.apache.org/jira/browse/SPARK-20815
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Andrew Ash
>
> Some jars don't have manifest files in them, such as in my case 
> javax.inject-1.jar and value-2.2.1-annotations.jar
> This causes the below NPE:
> {noformat}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.deploy.RPackageUtils$.checkManifestForR(RPackageUtils.scala:95)
> at 
> org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply$mcV$sp(RPackageUtils.scala:180)
> at 
> org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180)
> at 
> org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1322)
> at 
> org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:202)
> at 
> org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:175)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> org.apache.spark.deploy.RPackageUtils$.checkAndBuildRPackage(RPackageUtils.scala:175)
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:311)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:152)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {noformat}
> due to RPackageUtils#checkManifestForR assuming {{jar.getManifest}} is 
> non-null.
> However per the JDK spec it can be null:
> {noformat}
> /**
>  * Returns the jar file manifest, or null if none.
>  *
>  * @return the jar file manifest, or null if none
>  *
>  * @throws IllegalStateException
>  * may be thrown if the jar file has been closed
>  * @throws IOException  if an I/O error has occurred
>  */
> public Manifest getManifest() throws IOException {
> return getManifestFromReference();
> }
> {noformat}
> This method should do a null check and return false if the manifest is null 
> (meaning no R code in that jar)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20727) Skip SparkR tests when missing Hadoop winutils on CRAN windows machines

2017-05-19 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20727:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15799

> Skip SparkR tests when missing Hadoop winutils on CRAN windows machines
> ---
>
> Key: SPARK-20727
> URL: https://issues.apache.org/jira/browse/SPARK-20727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shivaram Venkataraman
>
> We should skips tests that use the Hadoop libraries while running
> on CRAN check with Windows as the operating system. This is to handle
> cases where the Hadoop winutils binaries are not available on the target
> system. The skipped tests will consist of
> 1. Tests that save, load a model in MLlib
> 2. Tests that save, load CSV, JSON and Parquet files in SQL
> 3. Hive tests
> Note that these tests will still be run on AppVeyor for every PR, so our 
> overall test coverage should not go down



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index

2017-05-19 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018313#comment-16018313
 ] 

Felix Cheung commented on SPARK-18825:
--

I see about dontrun - yes, I don't think we could remove dontrun from example 
because they would take too long for CRAN check (we are already trimming a lot 
and likely will need to trim more to make it work), but if we have a way to run 
the example during an explicit gen doc step it could be useful

> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20805) updated updateP in SVD++ is error

2017-05-19 Thread BoLing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018315#comment-16018315
 ] 

BoLing commented on SPARK-20805:


hi, Sean Owen, you can see this url 
https://github.com/NewBoLing/GraphSVD-/blob/master/SVDPlusPlus

> updated  updateP in SVD++ is error
> --
>
> Key: SPARK-20805
> URL: https://issues.apache.org/jira/browse/SPARK-20805
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1, 2.1.1
>Reporter: BoLing
>
> In algorithm svd++, we all known that the usr._2 store the value of  pu + 
> |N(u)|^(-0.5)*sum(y); the function sendMsgTrainF compute the updated value of 
> updateP,updateQ and updateY. At the beginning,the cycle iteration update the 
> part of y in usr._2, but pu is never updated. so we should fix the 
> sendMessageToSrcFunction in sendMsgTrainF. the old code is 
> ctx.sendToSrc((updateP, updateY, (err - conf.gamma6 * usr._3) * 
> conf.gamma1)). if we fix like that ctx.sendToSrc((updateP, updateP, (err - 
> conf.gamma6 * usr._3) * conf.gamma1)), it maybe arrive the effect we want.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20805) updated updateP in SVD++ is error

2017-05-19 Thread BoLing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018315#comment-16018315
 ] 

BoLing edited comment on SPARK-20805 at 5/20/17 4:31 AM:
-

hi, @Sean Owen, you can see this url 
https://github.com/NewBoLing/GraphSVD-/blob/master/SVDPlusPlus


was (Author: boling):
hi, Sean Owen, you can see this url 
https://github.com/NewBoLing/GraphSVD-/blob/master/SVDPlusPlus

> updated  updateP in SVD++ is error
> --
>
> Key: SPARK-20805
> URL: https://issues.apache.org/jira/browse/SPARK-20805
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1, 2.1.1
>Reporter: BoLing
>
> In algorithm svd++, we all known that the usr._2 store the value of  pu + 
> |N(u)|^(-0.5)*sum(y); the function sendMsgTrainF compute the updated value of 
> updateP,updateQ and updateY. At the beginning,the cycle iteration update the 
> part of y in usr._2, but pu is never updated. so we should fix the 
> sendMessageToSrcFunction in sendMsgTrainF. the old code is 
> ctx.sendToSrc((updateP, updateY, (err - conf.gamma6 * usr._3) * 
> conf.gamma1)). if we fix like that ctx.sendToSrc((updateP, updateP, (err - 
> conf.gamma6 * usr._3) * conf.gamma1)), it maybe arrive the effect we want.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index

2017-05-19 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018314#comment-16018314
 ] 

Felix Cheung commented on SPARK-18825:
--

handling a fork of knitr might be too hard to maintain, given that we don't 
have direct access to the Jenkins boxes.

> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20805) updated updateP in SVD++ is error

2017-05-19 Thread BoLing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018315#comment-16018315
 ] 

BoLing edited comment on SPARK-20805 at 5/20/17 4:32 AM:
-

hi, Sean Owen, you can see this url 
https://github.com/NewBoLing/GraphSVD-/blob/master/SVDPlusPlus


was (Author: boling):
hi, @Sean Owen, you can see this url 
https://github.com/NewBoLing/GraphSVD-/blob/master/SVDPlusPlus

> updated  updateP in SVD++ is error
> --
>
> Key: SPARK-20805
> URL: https://issues.apache.org/jira/browse/SPARK-20805
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1, 2.1.1
>Reporter: BoLing
>
> In algorithm svd++, we all known that the usr._2 store the value of  pu + 
> |N(u)|^(-0.5)*sum(y); the function sendMsgTrainF compute the updated value of 
> updateP,updateQ and updateY. At the beginning,the cycle iteration update the 
> part of y in usr._2, but pu is never updated. so we should fix the 
> sendMessageToSrcFunction in sendMsgTrainF. the old code is 
> ctx.sendToSrc((updateP, updateY, (err - conf.gamma6 * usr._3) * 
> conf.gamma1)). if we fix like that ctx.sendToSrc((updateP, updateP, (err - 
> conf.gamma6 * usr._3) * conf.gamma1)), it maybe arrive the effect we want.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20751) Built-in SQL Function Support - COT

2017-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018321#comment-16018321
 ] 

Apache Spark commented on SPARK-20751:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/18039

> Built-in SQL Function Support - COT
> ---
>
> Key: SPARK-20751
> URL: https://issues.apache.org/jira/browse/SPARK-20751
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
> Fix For: 2.3.0
>
>
> {noformat}
> COT()
> {noformat}
> Returns the cotangent of .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20815) NullPointerException in RPackageUtils#checkManifestForR

2017-05-19 Thread James Shuster (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018326#comment-16018326
 ] 

James Shuster commented on SPARK-20815:
---

I have a fix in the works, just adding a test case and running the full test 
suite now.

> NullPointerException in RPackageUtils#checkManifestForR
> ---
>
> Key: SPARK-20815
> URL: https://issues.apache.org/jira/browse/SPARK-20815
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Andrew Ash
>
> Some jars don't have manifest files in them, such as in my case 
> javax.inject-1.jar and value-2.2.1-annotations.jar
> This causes the below NPE:
> {noformat}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.deploy.RPackageUtils$.checkManifestForR(RPackageUtils.scala:95)
> at 
> org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply$mcV$sp(RPackageUtils.scala:180)
> at 
> org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180)
> at 
> org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1322)
> at 
> org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:202)
> at 
> org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:175)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> org.apache.spark.deploy.RPackageUtils$.checkAndBuildRPackage(RPackageUtils.scala:175)
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:311)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:152)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {noformat}
> due to RPackageUtils#checkManifestForR assuming {{jar.getManifest}} is 
> non-null.
> However per the JDK spec it can be null:
> {noformat}
> /**
>  * Returns the jar file manifest, or null if none.
>  *
>  * @return the jar file manifest, or null if none
>  *
>  * @throws IllegalStateException
>  * may be thrown if the jar file has been closed
>  * @throws IOException  if an I/O error has occurred
>  */
> public Manifest getManifest() throws IOException {
> return getManifestFromReference();
> }
> {noformat}
> This method should do a null check and return false if the manifest is null 
> (meaning no R code in that jar)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >