[jira] [Updated] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors

2015-10-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10981:
--
Assignee: Monica Liu

> R semijoin leads to Java errors, R leftsemi leads to Spark errors
> -
>
> Key: SPARK-10981
> URL: https://issues.apache.org/jira/browse/SPARK-10981
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.5.0
> Environment: SparkR from RStudio on Macbook
>Reporter: Monica Liu
>Assignee: Monica Liu
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 1.5.2, 1.6.0
>
>
> I am using SparkR from RStudio, and I ran into an error with the join 
> function that I recreated with a smaller example:
> {code:title=joinTest.R|borderStyle=solid}
> Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init("local[4]")
> sqlContext <- sparkRSQL.init(sc) 
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)
> df1= createDataFrame(sqlContext, df)
> showDF(df1)
> x = c(2, 3, 10)
> t = c("dd", "ee", "ff")
> c = c(FALSE, FALSE, TRUE)
> dff = data.frame(x, t, c)
> df2 = createDataFrame(sqlContext, dff)
> showDF(df2)
> res = join(df1, df2, df1$n == df2$x, "semijoin")
> showDF(res)
> {code}
> Running this code, I encountered the error:
> {panel}
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. 
> Supported join types include: 'inner', 'outer', 'full', 'fullouter', 
> 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.
> {panel}
> However, if I changed the joinType to "leftsemi", 
> {code}
> res = join(df1, df2, df1$n == df2$x, "leftsemi")
> {code}
> I would get the error:
> {panel}
> Error in .local(x, y, ...) : 
>   joinType must be one of the following types: 'inner', 'outer', 
> 'left_outer', 'right_outer', 'semijoin'
> {panel}
> Since the join function in R appears to invoke a Java method, I went into 
> DataFrame.R and changed the code on line 1374 and line 1378 to change the 
> "semijoin" to "leftsemi" to match the Java function's parameters. These also 
> make the R joinType accepted values match those of Scala's. 
> semijoin:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "semijoin")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
> }
> {code}
> leftsemi:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "leftsemi")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'")
> }
> {code}
> This fixed the issue, but I'm not sure if this solution breaks hive 
> compatibility or causes other issues, but I can submit a pull request to 
> change this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11067) Spark SQL thrift server fails to handle decimal value

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956129#comment-14956129
 ] 

Apache Spark commented on SPARK-11067:
--

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/9107

> Spark SQL thrift server fails to handle decimal value
> -
>
> Key: SPARK-11067
> URL: https://issues.apache.org/jira/browse/SPARK-11067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
> Attachments: SPARK-11067.1.patch.txt
>
>
> When executing the following query through beeline connecting to Spark sql 
> thrift server, it errors out for decimal column
> {code}
> Select decimal_column from table
> WARN  2015-10-09 15:04:00 
> org.apache.hive.service.cli.thrift.ThriftCLIService: Error fetching results: 
> java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal
>   at 
> org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:174) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:144)
>  ~[spark-hive-thriftserver_2.10-1.4.1.1.jar:1.4.1.1]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>  [hive-service-0.13.1a.jar:4.8.1-SNAPSHOT]
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>  [libthrift-0.9.2.jar:0.9.2]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11067) Spark SQL thrift server fails to handle decimal value

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11067:


Assignee: Apache Spark

> Spark SQL thrift server fails to handle decimal value
> -
>
> Key: SPARK-11067
> URL: https://issues.apache.org/jira/browse/SPARK-11067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
>Assignee: Apache Spark
> Attachments: SPARK-11067.1.patch.txt
>
>
> When executing the following query through beeline connecting to Spark sql 
> thrift server, it errors out for decimal column
> {code}
> Select decimal_column from table
> WARN  2015-10-09 15:04:00 
> org.apache.hive.service.cli.thrift.ThriftCLIService: Error fetching results: 
> java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal
>   at 
> org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:174) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:144)
>  ~[spark-hive-thriftserver_2.10-1.4.1.1.jar:1.4.1.1]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>  [hive-service-0.13.1a.jar:4.8.1-SNAPSHOT]
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>  [libthrift-0.9.2.jar:0.9.2]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9443) Expose sampleByKey in SparkR

2015-10-13 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956179#comment-14956179
 ] 

Sun Rui commented on SPARK-9443:


close it as it duplicates SPARK-10996

> Expose sampleByKey in SparkR
> 
>
> Key: SPARK-9443
> URL: https://issues.apache.org/jira/browse/SPARK-9443
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Hossein Falaki
>
> There is pull request for DataFrames (I believe close to merging) that adds 
> sampleByKey. It would be great to expose it in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9443) Expose sampleByKey in SparkR

2015-10-13 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui closed SPARK-9443.
--
Resolution: Duplicate

> Expose sampleByKey in SparkR
> 
>
> Key: SPARK-9443
> URL: https://issues.apache.org/jira/browse/SPARK-9443
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Hossein Falaki
>
> There is pull request for DataFrames (I believe close to merging) that adds 
> sampleByKey. It would be great to expose it in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9302) Handle complex JSON types in collect()/head()

2015-10-13 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui closed SPARK-9302.
--
Resolution: Fixed

> Handle complex JSON types in collect()/head()
> -
>
> Key: SPARK-9302
> URL: https://issues.apache.org/jira/browse/SPARK-9302
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Sun Rui
>
> Reported in the mailing list by Exie :
> {noformat}
> A sample record in raw JSON looks like this:
> {"version": 1,"event": "view","timestamp": 1427846422377,"system":
> "DCDS","asset": "6404476","assetType": "myType","assetCategory":
> "myCategory","extras": [{"name": "videoSource","value": "mySource"},{"name":
> "playerType","value": "Article"},{"name": "duration","value":
> "202088"}],"trackingId": "155629a0-d802-11e4-13ee-6884e43d6000","ipAddress":
> "165.69.2.4","title": "myTitle"}
> > head(mydf)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame
> >
> > show(mydf)
> DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, 
> assetType:string, event:string, 
> extras:array>, ipAddress:string, 
> memberId:string, system:string, timestamp:bigint, title:string, 
> trackingId:string, version:bigint]
> >
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10382) Make example code in user guide testable

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956250#comment-14956250
 ] 

Apache Spark commented on SPARK-10382:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9109

> Make example code in user guide testable
> 
>
> Key: SPARK-10382
> URL: https://issues.apache.org/jira/browse/SPARK-10382
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "guide" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Just one way to implement this. It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10382) Make example code in user guide testable

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10382:


Assignee: Xusen Yin  (was: Apache Spark)

> Make example code in user guide testable
> 
>
> Key: SPARK-10382
> URL: https://issues.apache.org/jira/browse/SPARK-10382
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "guide" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Just one way to implement this. It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11095) Simplify Netty RPC implementation by using a separate thread pool for each endpoint

2015-10-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11095:
---

 Summary: Simplify Netty RPC implementation by using a separate 
thread pool for each endpoint
 Key: SPARK-11095
 URL: https://issues.apache.org/jira/browse/SPARK-11095
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu


The dispatcher class and the inbox class of the current Netty-based RPC 
implementation is fairly complicated. It uses a single, shared thread pool to 
execute all the endpoints. This is similar to how Akka does actor message 
dispatching. The benefit of this design is that this RPC implementation can 
support a very large number of endpoints, as they are all multiplexed into a 
single thread pool for execution. The downside is the complexity resulting from 
synchronization and coordination.

An alternative implementation is to have a separate message queue and thread 
pool for each endpoint. The dispatcher simply routes the messages to the 
appropriate message queue, and the threads poll the queue for messages to 
process.

If the endpoint is single threaded, then the thread pool should contain only a 
single thread. If the endpoint supports concurrent execution, then the thread 
pool should contain more threads.

Two additional things we need to be careful with are:

1. An endpoint should only process normal messages after OnStart is called. 
This can be done by having the thread that starts the endpoint processing 
OnStart.

2. An endpoint should process OnStop after all normal messages have been 
processed. I think this can be done by having a busy loop to spin until the 
size of the message queue is 0.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11094) Test runner script fails to parse Java version.

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11094:


Assignee: (was: Apache Spark)

> Test runner script fails to parse Java version.
> ---
>
> Key: SPARK-11094
> URL: https://issues.apache.org/jira/browse/SPARK-11094
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
> Environment: Debian testing
>Reporter: Jakob Odersky
>Priority: Minor
>
> Running {{dev/run-tests}} fails when the local Java version has an extra 
> string appended to the version.
> For example, in Debian Stretch (currently testing distribution), {{java 
> -version}} yields "1.8.0_66-internal" where the extra part "-internal" causes 
> the script to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11094) Test runner script fails to parse Java version.

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956288#comment-14956288
 ] 

Apache Spark commented on SPARK-11094:
--

User 'jodersky' has created a pull request for this issue:
https://github.com/apache/spark/pull/9111

> Test runner script fails to parse Java version.
> ---
>
> Key: SPARK-11094
> URL: https://issues.apache.org/jira/browse/SPARK-11094
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
> Environment: Debian testing
>Reporter: Jakob Odersky
>Priority: Minor
>
> Running {{dev/run-tests}} fails when the local Java version has an extra 
> string appended to the version.
> For example, in Debian Stretch (currently testing distribution), {{java 
> -version}} yields "1.8.0_66-internal" where the extra part "-internal" causes 
> the script to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9694) Add random seed Param to Scala CrossValidator

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956234#comment-14956234
 ] 

Apache Spark commented on SPARK-9694:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9108

> Add random seed Param to Scala CrossValidator
> -
>
> Key: SPARK-9694
> URL: https://issues.apache.org/jira/browse/SPARK-9694
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9694) Add random seed Param to Scala CrossValidator

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9694:
---

Assignee: (was: Apache Spark)

> Add random seed Param to Scala CrossValidator
> -
>
> Key: SPARK-9694
> URL: https://issues.apache.org/jira/browse/SPARK-9694
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9694) Add random seed Param to Scala CrossValidator

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9694:
---

Assignee: Apache Spark

> Add random seed Param to Scala CrossValidator
> -
>
> Key: SPARK-9694
> URL: https://issues.apache.org/jira/browse/SPARK-9694
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors

2015-10-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10981.
---
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.2

Resolved by https://github.com/apache/spark/pull/9029

> R semijoin leads to Java errors, R leftsemi leads to Spark errors
> -
>
> Key: SPARK-10981
> URL: https://issues.apache.org/jira/browse/SPARK-10981
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.5.0
> Environment: SparkR from RStudio on Macbook
>Reporter: Monica Liu
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 1.5.2, 1.6.0
>
>
> I am using SparkR from RStudio, and I ran into an error with the join 
> function that I recreated with a smaller example:
> {code:title=joinTest.R|borderStyle=solid}
> Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init("local[4]")
> sqlContext <- sparkRSQL.init(sc) 
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)
> df1= createDataFrame(sqlContext, df)
> showDF(df1)
> x = c(2, 3, 10)
> t = c("dd", "ee", "ff")
> c = c(FALSE, FALSE, TRUE)
> dff = data.frame(x, t, c)
> df2 = createDataFrame(sqlContext, dff)
> showDF(df2)
> res = join(df1, df2, df1$n == df2$x, "semijoin")
> showDF(res)
> {code}
> Running this code, I encountered the error:
> {panel}
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. 
> Supported join types include: 'inner', 'outer', 'full', 'fullouter', 
> 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.
> {panel}
> However, if I changed the joinType to "leftsemi", 
> {code}
> res = join(df1, df2, df1$n == df2$x, "leftsemi")
> {code}
> I would get the error:
> {panel}
> Error in .local(x, y, ...) : 
>   joinType must be one of the following types: 'inner', 'outer', 
> 'left_outer', 'right_outer', 'semijoin'
> {panel}
> Since the join function in R appears to invoke a Java method, I went into 
> DataFrame.R and changed the code on line 1374 and line 1378 to change the 
> "semijoin" to "leftsemi" to match the Java function's parameters. These also 
> make the R joinType accepted values match those of Scala's. 
> semijoin:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "semijoin")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
> }
> {code}
> leftsemi:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "leftsemi")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'")
> }
> {code}
> This fixed the issue, but I'm not sure if this solution breaks hive 
> compatibility or causes other issues, but I can submit a pull request to 
> change this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9302) Handle complex JSON types in collect()/head()

2015-10-13 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956237#comment-14956237
 ] 

Sun Rui commented on SPARK-9302:


This is fixed after supporting complex types in DataFrame was done.

> Handle complex JSON types in collect()/head()
> -
>
> Key: SPARK-9302
> URL: https://issues.apache.org/jira/browse/SPARK-9302
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Sun Rui
>
> Reported in the mailing list by Exie :
> {noformat}
> A sample record in raw JSON looks like this:
> {"version": 1,"event": "view","timestamp": 1427846422377,"system":
> "DCDS","asset": "6404476","assetType": "myType","assetCategory":
> "myCategory","extras": [{"name": "videoSource","value": "mySource"},{"name":
> "playerType","value": "Article"},{"name": "duration","value":
> "202088"}],"trackingId": "155629a0-d802-11e4-13ee-6884e43d6000","ipAddress":
> "165.69.2.4","title": "myTitle"}
> > head(mydf)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame
> >
> > show(mydf)
> DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, 
> assetType:string, event:string, 
> extras:array>, ipAddress:string, 
> memberId:string, system:string, timestamp:bigint, title:string, 
> trackingId:string, version:bigint]
> >
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10055) San Francisco Crime Classification

2015-10-13 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956276#comment-14956276
 ] 

Xusen Yin commented on SPARK-10055:
---

Yes, I will find a new dataset soon and ping you on JIRA.

> San Francisco Crime Classification
> --
>
> Key: SPARK-10055
> URL: https://issues.apache.org/jira/browse/SPARK-10055
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>
> Apply ML pipeline API to San Francisco Crime Classification 
> (https://www.kaggle.com/c/sf-crime).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11094) Test runner script fails to parse Java version.

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11094:


Assignee: Apache Spark

> Test runner script fails to parse Java version.
> ---
>
> Key: SPARK-11094
> URL: https://issues.apache.org/jira/browse/SPARK-11094
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
> Environment: Debian testing
>Reporter: Jakob Odersky
>Assignee: Apache Spark
>Priority: Minor
>
> Running {{dev/run-tests}} fails when the local Java version has an extra 
> string appended to the version.
> For example, in Debian Stretch (currently testing distribution), {{java 
> -version}} yields "1.8.0_66-internal" where the extra part "-internal" causes 
> the script to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10996) Implement sampleBy() in DataFrameStatFunctions

2015-10-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10996.
---
   Resolution: Fixed
 Assignee: Sun Rui
Fix Version/s: 1.6.0

Resolved by https://github.com/apache/spark/pull/9023

> Implement sampleBy() in DataFrameStatFunctions
> --
>
> Key: SPARK-10996
> URL: https://issues.apache.org/jira/browse/SPARK-10996
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9338) Aliases from SELECT not available in GROUP BY

2015-10-13 Thread fang fang chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956196#comment-14956196
 ] 

fang fang chen commented on SPARK-9338:
---

Also encounter this issue, the sql is very simple:
select id as id_test from user group by id_test limit 100
The error is:
cannot resolve 'id_test' given input columns ..., id,  ...;

> Aliases from SELECT not available in GROUP BY
> -
>
> Key: SPARK-9338
> URL: https://issues.apache.org/jira/browse/SPARK-9338
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Observed on Mac OS X and Ubuntu 14.04
>Reporter: James Aley
>
> It feels like this should really be a known issue, but I've not been able to 
> find any mailing list or JIRA tickets for exactly this. There are a few 
> closed/resolved tickets about specific types of exceptions, but I couldn't 
> find this exact problem, so apologies if this is a dupe!
> Spark SQL doesn't appear to support referencing aliases from a SELECT in the 
> GROUP BY part of the query. This is confusing our analysts, as it works in 
> most other tools they use. Here's an example to reproduce:
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema =
>   StructType(
> StructField("x", IntegerType, nullable=false) ::
> StructField("y",
>   StructType(StructField("a", DoubleType, nullable=false) :: Nil),
>   nullable=false) :: Nil)
> val rdd = sc.parallelize(
>   Row(1, Row(1.0)) :: Row(2, Row(1.34)) :: Row(3, Row(2.3)) :: Row(4, 
> Row(2.5)) :: Nil)
> val df = sqlContext.createDataFrame(rdd, schema)
> // DataFrame content looks like this:
> // x   z
> // 1   {a: 1.0}
> // 2   {a: 1.34}
> // 3   {a: 2.3}
> // 4   {a: 2.5}
> df.registerTempTable("test_data")
> sqlContext.udf.register("roundToInt", (x: Double) => x.toInt)
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data 
> GROUP BY grp").show()
> // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given 
> input columns x, y
> sqlContext.sql("SELECT y.a as grp, SUM(x) as s FROM test_data GROUP BY 
> grp").show()
> // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given 
> input columns x, y;
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(y.a) as s FROM test_data 
> GROUP BY roundToInt(y.a)").show()
> // =>
> // +---++
> // |grp|   s|
> // +---++
> // |  1|2.34|
> // |  2| 4.8|
> // +---++
> {code}
> As you can see, it's particularly inconvenient when using UDFs on nested 
> fields, as it means repeating some potentially complex expressions. It's very 
> common for us to want to make a date type conversion (from epoch milliseconds 
> or something) from some nested field, then reference it in multiple places in 
> the query. With this issue, it makes for quite verbose queries. 
> Might it also mean that we're mapping these functions over the data twice? I 
> can't quite tell from the explain output whether that's been optimised out or 
> not, but here it is for somebody who understands :-)
> {code}
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data 
> GROUP BY roundToInt(y.a)").explain()
> // == Physical Plan ==
> // Aggregate false, [PartialGroup#126], [PartialGroup#126 AS 
> grp#116,CombineSum(PartialSum#125L) AS s#117L]
> // Exchange (HashPartitioning 200)
> // Aggregate true, [scalaUDF(y#7.a)], [scalaUDF(y#7.a) AS 
> PartialGroup#126,SUM(CAST(x#6, LongType)) AS PartialSum#125L]
> // PhysicalRDD [x#6,y#7], MapPartitionsRDD[10] at createDataFrame at 
> :31
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10382) Make example code in user guide testable

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10382:


Assignee: Apache Spark  (was: Xusen Yin)

> Make example code in user guide testable
> 
>
> Key: SPARK-10382
> URL: https://issues.apache.org/jira/browse/SPARK-10382
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "guide" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Just one way to implement this. It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11092:


Assignee: (was: Apache Spark)

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956270#comment-14956270
 ] 

Apache Spark commented on SPARK-11092:
--

User 'jodersky' has created a pull request for this issue:
https://github.com/apache/spark/pull/9110

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11092:


Assignee: Apache Spark

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Assignee: Apache Spark
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10935) Avito Context Ad Clicks

2015-10-13 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956279#comment-14956279
 ] 

Xusen Yin commented on SPARK-10935:
---

[~mengxr] [~kpl...@gmail.com] Are you still love to work on this? I want to 
split this if you want.

> Avito Context Ad Clicks
> ---
>
> Key: SPARK-10935
> URL: https://issues.apache.org/jira/browse/SPARK-10935
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> From [~kpl...@gmail.com]:
> I would love to do Avito Context Ad Clicks - 
> https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of 
> feature engineering and preprocessing. I would love to split this with 
> somebody else if anybody is interested on working with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11094) Test runner script fails to parse Java version.

2015-10-13 Thread Jakob Odersky (JIRA)
Jakob Odersky created SPARK-11094:
-

 Summary: Test runner script fails to parse Java version.
 Key: SPARK-11094
 URL: https://issues.apache.org/jira/browse/SPARK-11094
 Project: Spark
  Issue Type: Bug
  Components: Tests
 Environment: Debian testing
Reporter: Jakob Odersky
Priority: Minor


Running {{dev/run-tests}} fails when the local Java version has an extra string 
appended to the version.
For example, in Debian Stretch (currently testing distribution), {{java 
-version}} yields "1.8.0_66-internal" where the extra part "-internal" causes 
the script to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11078) Ensure spilling tests are actually spilling

2015-10-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11078:
--
Description: 
The new unified memory management model in SPARK-10983 uncovered many brittle 
tests that rely on arbitrary thresholds to detect spilling. Some tests don't 
even assert that spilling did occur.

We should go through all the places where we test spilling behavior and correct 
the tests, a subset of which are definitely incorrect. Potential suspects:

- UnsafeShuffleSuite
- ExternalAppendOnlyMapSuite
- ExternalSorterSuite
- SQLQuerySuite
- DistributedSuite

  was:
The new unified memory management model in SPARK-10983 uncovered many brittle 
tests that rely on arbitrary thresholds to detect spilling. Some tests don't 
even assert that spilling did occur.

We should go through all the places where we test spilling behavior and correct 
the tests, a subset of which are definitely incorrect. Potential suspects:

- UnsafeShuffleSuite
- ExternalAppendOnlyMapSuite
- ExternalSorterSuite
- SQLQuerySuite


> Ensure spilling tests are actually spilling
> ---
>
> Key: SPARK-11078
> URL: https://issues.apache.org/jira/browse/SPARK-11078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Andrew Or
>
> The new unified memory management model in SPARK-10983 uncovered many brittle 
> tests that rely on arbitrary thresholds to detect spilling. Some tests don't 
> even assert that spilling did occur.
> We should go through all the places where we test spilling behavior and 
> correct the tests, a subset of which are definitely incorrect. Potential 
> suspects:
> - UnsafeShuffleSuite
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - SQLQuerySuite
> - DistributedSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11079) Review Netty based RPC implementation

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954531#comment-14954531
 ] 

Apache Spark commented on SPARK-11079:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9091

> Review Netty based RPC implementation
> -
>
> Key: SPARK-11079
> URL: https://issues.apache.org/jira/browse/SPARK-11079
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This is a task for Reynold to review the existing implementation done by 
> [~shixi...@databricks.com].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11079) Review Netty based RPC implementation

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11079:


Assignee: Apache Spark  (was: Reynold Xin)

> Review Netty based RPC implementation
> -
>
> Key: SPARK-11079
> URL: https://issues.apache.org/jira/browse/SPARK-11079
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> This is a task for Reynold to review the existing implementation done by 
> [~shixi...@databricks.com].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6230) Provide authentication and encryption for Spark's RPC

2015-10-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954523#comment-14954523
 ] 

Patrick Wendell commented on SPARK-6230:


Should we update Spark's documentation to explain this? I think at present it 
only discusses encrypted RPC via akka. But this will be the new recommended way 
to encrypt RPC.

> Provide authentication and encryption for Spark's RPC
> -
>
> Key: SPARK-6230
> URL: https://issues.apache.org/jira/browse/SPARK-6230
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Marcelo Vanzin
>
> Make sure the RPC layer used by Spark supports the auth and encryption 
> features of the network/common module.
> This kinda ignores akka; adding support for SASL to akka, while possible, 
> seems to be at odds with the direction being taken in Spark, so let's 
> restrict this to the new RPC layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11079) Review Netty based RPC implementation

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11079:


Assignee: Reynold Xin  (was: Apache Spark)

> Review Netty based RPC implementation
> -
>
> Key: SPARK-11079
> URL: https://issues.apache.org/jira/browse/SPARK-11079
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This is a task for Reynold to review the existing implementation done by 
> [~shixi...@databricks.com].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11080) NamedExpression.newExprId should only be called on driver

2015-10-13 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-11080:
--

 Summary: NamedExpression.newExprId should only be called on driver
 Key: SPARK-11080
 URL: https://issues.apache.org/jira/browse/SPARK-11080
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


My understanding of {{NamedExpression.newExprId}} is that it is only intended 
to be called on the driver. If it is called on executors, then this may lead to 
scenarios where the same expression id is re-used in two different 
NamedExpressions.

More generally, I think that calling {{NamedExpression.newExprId}} within tasks 
may be an indicator of potential attribute binding bugs. Therefore, I think 
that we should prevent {{NamedExpression.newExprId}} from being called inside 
of tasks by throwing an exception when such calls occur. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11081) Shade Jersey dependency to work around the compatibility issue with Jersey2

2015-10-13 Thread Mingyu Kim (JIRA)
Mingyu Kim created SPARK-11081:
--

 Summary: Shade Jersey dependency to work around the compatibility 
issue with Jersey2
 Key: SPARK-11081
 URL: https://issues.apache.org/jira/browse/SPARK-11081
 Project: Spark
  Issue Type: Bug
Reporter: Mingyu Kim


As seen from this thread 
(https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E),
 Spark is incompatible with Jersey 2 especially when Spark is embedded in an 
application running with Jersey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10582) using dynamic-executor-allocation, if AM failed. the new AM will be started. But the new AM does not allocate executors to dirver

2015-10-13 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-10582:
--
Affects Version/s: 1.5.1

> using dynamic-executor-allocation, if AM failed. the new AM will be started. 
> But the new AM does not allocate executors to dirver
> -
>
> Key: SPARK-10582
> URL: https://issues.apache.org/jira/browse/SPARK-10582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.1
>Reporter: KaiXinXIaoLei
>
> During running tasks, when the total number of executors is the value of 
> spark.dynamicAllocation.maxExecutors and the AM is failed. Then a new AM 
> restarts. Because in ExecutorAllocationManager, the total number of executors 
> does not changed, driver does not send RequestExecutors to AM to ask 
> executors. Then the total number of executors is the value of 
> spark.dynamicAllocation.initialExecutors . So the total number of executors 
> in driver and AM is different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11078) Ensure spilling tests are actually spilling

2015-10-13 Thread Andrew Or (JIRA)
Andrew Or created SPARK-11078:
-

 Summary: Ensure spilling tests are actually spilling
 Key: SPARK-11078
 URL: https://issues.apache.org/jira/browse/SPARK-11078
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Tests
Reporter: Andrew Or


The new unified memory management model in SPARK-10983 uncovered many brittle 
tests that rely on arbitrary thresholds to detect spilling. Some tests don't 
even assert that spilling did occur.

We should go through all the places where we test spilling behavior and correct 
the tests, a subset of which are definitely incorrect. Potential suspects:

- UnsafeShuffleSuite
- ExternalAppendOnlyMapSuite
- ExternalSorterSuite
- SQLQuerySuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5293) Enable Spark user applications to use different versions of Akka

2015-10-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5293:
---
Target Version/s: 2+  (was: 1.6.0)

> Enable Spark user applications to use different versions of Akka
> 
>
> Key: SPARK-5293
> URL: https://issues.apache.org/jira/browse/SPARK-5293
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>
> A lot of Spark user applications are using (or want to use) Akka. Akka as a 
> whole can contribute great architectural simplicity and uniformity. However, 
> because Spark depends on Akka, it is not possible for users to rely on 
> different versions, and we have received many requests in the past asking for 
> help about this specific issue. For example, Spark Streaming might be used as 
> the receiver of Akka messages - but our dependency on Akka requires the 
> upstream Akka actors to also use the identical version of Akka.
> Since our usage of Akka is limited (mainly for RPC and single-threaded event 
> loop), we can replace it with alternative RPC implementations and a common 
> event loop in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11079) Review Netty based RPC implementation

2015-10-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11079:
---

 Summary: Review Netty based RPC implementation
 Key: SPARK-11079
 URL: https://issues.apache.org/jira/browse/SPARK-11079
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin


This is a task for Reynold to review the existing implementation done by 
[~shixi...@databricks.com].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7995) Move AkkaRpcEnv to a separate project and remove Akka from the dependencies of Core

2015-10-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7995:
---
Target Version/s: 2+  (was: )

> Move AkkaRpcEnv to a separate project and remove Akka from the dependencies 
> of Core
> ---
>
> Key: SPARK-7995
> URL: https://issues.apache.org/jira/browse/SPARK-7995
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11082) Cores per executor is wrong when response vcore number is less than requested number

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11082:


Assignee: (was: Apache Spark)

> Cores per executor is wrong when response vcore number is less than requested 
> number
> 
>
> Key: SPARK-11082
> URL: https://issues.apache.org/jira/browse/SPARK-11082
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> When DefaultResourceCalculator is set (by default) for Yarn capacity 
> scheduler, the response container resource vcore number is always 1, which 
> may be less than the requested vcore number, ExecutorRunnable should honor 
> this returned vcore number (not the requested number) to pass to each 
> executor. Otherwise, actual allocated vcore number is different from Spark's 
> managed CPU cores per executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11082) Cores per executor is wrong when response vcore number is less than requested number

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11082:


Assignee: Apache Spark

> Cores per executor is wrong when response vcore number is less than requested 
> number
> 
>
> Key: SPARK-11082
> URL: https://issues.apache.org/jira/browse/SPARK-11082
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>
> When DefaultResourceCalculator is set (by default) for Yarn capacity 
> scheduler, the response container resource vcore number is always 1, which 
> may be less than the requested vcore number, ExecutorRunnable should honor 
> this returned vcore number (not the requested number) to pass to each 
> executor. Otherwise, actual allocated vcore number is different from Spark's 
> managed CPU cores per executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11036) AttributeReference should not be created outside driver

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11036:


Assignee: Apache Spark

> AttributeReference should not be created outside driver
> ---
>
> Key: SPARK-11036
> URL: https://issues.apache.org/jira/browse/SPARK-11036
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> if AttributeReference is created in executor, the id could be the same as 
> others created in driver. We should have a way to ban that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11036) AttributeReference should not be created outside driver

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954590#comment-14954590
 ] 

Apache Spark commented on SPARK-11036:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9094

> AttributeReference should not be created outside driver
> ---
>
> Key: SPARK-11036
> URL: https://issues.apache.org/jira/browse/SPARK-11036
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> if AttributeReference is created in executor, the id could be the same as 
> others created in driver. We should have a way to ban that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11036) AttributeReference should not be created outside driver

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11036:


Assignee: (was: Apache Spark)

> AttributeReference should not be created outside driver
> ---
>
> Key: SPARK-11036
> URL: https://issues.apache.org/jira/browse/SPARK-11036
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> if AttributeReference is created in executor, the id could be the same as 
> others created in driver. We should have a way to ban that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11082) Cores per executor is wrong when response vcore number is less than requested number

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954629#comment-14954629
 ] 

Apache Spark commented on SPARK-11082:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/9095

> Cores per executor is wrong when response vcore number is less than requested 
> number
> 
>
> Key: SPARK-11082
> URL: https://issues.apache.org/jira/browse/SPARK-11082
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> When DefaultResourceCalculator is set (by default) for Yarn capacity 
> scheduler, the response container resource vcore number is always 1, which 
> may be less than the requested vcore number, ExecutorRunnable should honor 
> this returned vcore number (not the requested number) to pass to each 
> executor. Otherwise, actual allocated vcore number is different from Spark's 
> managed CPU cores per executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10582) using dynamic-executor-allocation, if AM failed. the new AM will be started. But the new AM does not allocate executors to dirver

2015-10-13 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-10582:
--
Description: During running tasks, when the total number of executors is 
the value of spark.dynamicAllocation.maxExecutors and the AM is failed. Then a 
new AM restarts. Because in ExecutorAllocationManager, the total number of 
executors does not changed, driver does not send RequestExecutors to AM to ask 
executors. Then the total number of executors is the value of 
spark.dynamicAllocation.initialExecutors . So the total number of executors in 
driver and AM is different.  (was: using spark-dynamic-executor-allocation, if 
AM failed during running task, the new AM will be started. But the new AM does 
not allocate executors for driver.)

> using dynamic-executor-allocation, if AM failed. the new AM will be started. 
> But the new AM does not allocate executors to dirver
> -
>
> Key: SPARK-10582
> URL: https://issues.apache.org/jira/browse/SPARK-10582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
>
> During running tasks, when the total number of executors is the value of 
> spark.dynamicAllocation.maxExecutors and the AM is failed. Then a new AM 
> restarts. Because in ExecutorAllocationManager, the total number of executors 
> does not changed, driver does not send RequestExecutors to AM to ask 
> executors. Then the total number of executors is the value of 
> spark.dynamicAllocation.initialExecutors . So the total number of executors 
> in driver and AM is different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7402) JSON serialization of params

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954493#comment-14954493
 ] 

Apache Spark commented on SPARK-7402:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/9090

> JSON serialization of params
> 
>
> Key: SPARK-7402
> URL: https://issues.apache.org/jira/browse/SPARK-7402
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> Add JSON support to Param in order to persist parameters with transformer, 
> estimator, and models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11069:


Assignee: (was: Apache Spark)

> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11069:


Assignee: Apache Spark

> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954536#comment-14954536
 ] 

Apache Spark commented on SPARK-11069:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/9092

> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11051) NullPointerException when action called on localCheckpointed RDD (that was checkpointed before)

2015-10-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11051:
--
Priority: Critical  (was: Major)

> NullPointerException when action called on localCheckpointed RDD (that was 
> checkpointed before)
> ---
>
> Key: SPARK-11051
> URL: https://issues.apache.org/jira/browse/SPARK-11051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: Spark version 1.6.0-SNAPSHOT built from the sources as 
> of today - Oct, 10th
>Reporter: Jacek Laskowski
>Priority: Critical
>
> While toying with {{RDD.checkpoint}} and {{RDD.localCheckpoint}} methods, the 
> following NullPointerException was thrown:
> {code}
> scala> lines.count
> java.lang.NullPointerException
>   at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
>   ... 48 elided
> {code}
> To reproduce the issue do the following:
> {code}
> $ ./bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val lines = sc.textFile("README.md")
> lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at 
> :24
> scala> sc.setCheckpointDir("checkpoints")
> scala> lines.checkpoint
> scala> lines.count
> res2: Long = 98
> scala> lines.localCheckpoint
> 15/10/10 22:59:20 WARN MapPartitionsRDD: RDD was already marked for reliable 
> checkpointing: overriding with local checkpoint.
> res4: lines.type = MapPartitionsRDD[1] at textFile at :24
> scala> lines.count
> java.lang.NullPointerException
>   at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11080) NamedExpression.newExprId should only be called on driver

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11080:


Assignee: Apache Spark  (was: Josh Rosen)

> NamedExpression.newExprId should only be called on driver
> -
>
> Key: SPARK-11080
> URL: https://issues.apache.org/jira/browse/SPARK-11080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> My understanding of {{NamedExpression.newExprId}} is that it is only intended 
> to be called on the driver. If it is called on executors, then this may lead 
> to scenarios where the same expression id is re-used in two different 
> NamedExpressions.
> More generally, I think that calling {{NamedExpression.newExprId}} within 
> tasks may be an indicator of potential attribute binding bugs. Therefore, I 
> think that we should prevent {{NamedExpression.newExprId}} from being called 
> inside of tasks by throwing an exception when such calls occur. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11080) NamedExpression.newExprId should only be called on driver

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11080:


Assignee: Josh Rosen  (was: Apache Spark)

> NamedExpression.newExprId should only be called on driver
> -
>
> Key: SPARK-11080
> URL: https://issues.apache.org/jira/browse/SPARK-11080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> My understanding of {{NamedExpression.newExprId}} is that it is only intended 
> to be called on the driver. If it is called on executors, then this may lead 
> to scenarios where the same expression id is re-used in two different 
> NamedExpressions.
> More generally, I think that calling {{NamedExpression.newExprId}} within 
> tasks may be an indicator of potential attribute binding bugs. Therefore, I 
> think that we should prevent {{NamedExpression.newExprId}} from being called 
> inside of tasks by throwing an exception when such calls occur. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11064) spark streaming checkpoint question

2015-10-13 Thread jason kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954677#comment-14954677
 ] 

jason kim commented on SPARK-11064:
---

sorry,Can you give me a link,Because I'm here to open Google has some 
problems,thank you so much


I'm not here to find the right solution.
my code:
def main(args: Array[String]) {
if (args.length < 2) {
  System.err.println("Usage:   ")
  System.exit(1)
}


val updateFunc = (values: Seq[Int], state: Option[Int]) => {
  val currentCount = values.foldLeft(0)(_ + _)
  val previousCount = state.getOrElse(0)
  Some(currentCount + previousCount)
}


StreamingExamples.setStreamingLogLevels()


val Array(brokers, topics) = args


// Create context with 2 second batch interval
val sparkConf = new 
SparkConf().setAppName("KafkaWordCountTest").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")


// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers, 
"serializer.class" -> "kafka.serializer.StringEncoder" ,  "group.id" -> 
"spark-kafka-consumer")
//update zookeeper offset
val km = new KafkaManager(kafkaParams)
val messages = km.createDirectStream[String, String, StringDecoder, 
StringDecoder](
  ssc, kafkaParams, topicsSet)


//example  2:
  messages.transform(rdd=>{
km.updateZKOffsets(rdd) 
rdd}).map(x => (66, 1L)).reduceByKeyAndWindow(_+_,  
_-_,Seconds(10),Seconds(4)).print()
// Start the computation
ssc.start()
ssc.awaitTermination()
}










> spark streaming checkpoint question
> ---
>
> Key: SPARK-11064
> URL: https://issues.apache.org/jira/browse/SPARK-11064
> Project: Spark
>  Issue Type: Question
>Affects Versions: 1.4.1
>Reporter: jason kim
> Fix For: 2+
>
>
> java.io.NotSerializableException: DStream checkpointing has been enabled but 
> the DStreams with their functions are not serializable
> Serialization stack:
>   at 
> org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:550)
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:587)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:586)
>   at 
> com.bj58.spark.streaming.KafkaWordCountTest$.main(KafkaWordCountTest.scala:70)
>   at 
> com.bj58.spark.streaming.KafkaWordCountTest.main(KafkaWordCountTest.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7796) Use the new RPC implementation by default

2015-10-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7796.

  Resolution: Fixed
Assignee: Shixiong Zhu
   Fix Version/s: 1.6.0
Target Version/s:   (was: )

> Use the new RPC implementation by default
> -
>
> Key: SPARK-7796
> URL: https://issues.apache.org/jira/browse/SPARK-7796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11047) Internal accumulators miss the internal flag when replaying events in the history server

2015-10-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11047:
--
Priority: Critical  (was: Major)

> Internal accumulators miss the internal flag when replaying events in the 
> history server
> 
>
> Key: SPARK-11047
> URL: https://issues.apache.org/jira/browse/SPARK-11047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Carson Wang
>Priority: Critical
>
> Internal accumulators don't write the internal flag to event log. So on the 
> history server Web UI, all accumulators are not internal. This causes 
> incorrect peak execution memory and unwanted accumulator table displayed on 
> the stage page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11051) NullPointerException when action called on localCheckpointed RDD (that was checkpointed before)

2015-10-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11051:
--
Affects Version/s: (was: 1.6.0)
   1.5.0
 Target Version/s: 1.5.2, 1.6.0

> NullPointerException when action called on localCheckpointed RDD (that was 
> checkpointed before)
> ---
>
> Key: SPARK-11051
> URL: https://issues.apache.org/jira/browse/SPARK-11051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: Spark version 1.6.0-SNAPSHOT built from the sources as 
> of today - Oct, 10th
>Reporter: Jacek Laskowski
>
> While toying with {{RDD.checkpoint}} and {{RDD.localCheckpoint}} methods, the 
> following NullPointerException was thrown:
> {code}
> scala> lines.count
> java.lang.NullPointerException
>   at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
>   ... 48 elided
> {code}
> To reproduce the issue do the following:
> {code}
> $ ./bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val lines = sc.textFile("README.md")
> lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at 
> :24
> scala> sc.setCheckpointDir("checkpoints")
> scala> lines.checkpoint
> scala> lines.count
> res2: Long = 98
> scala> lines.localCheckpoint
> 15/10/10 22:59:20 WARN MapPartitionsRDD: RDD was already marked for reliable 
> checkpointing: overriding with local checkpoint.
> res4: lines.type = MapPartitionsRDD[1] at textFile at :24
> scala> lines.count
> java.lang.NullPointerException
>   at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11047) Internal accumulators miss the internal flag when replaying events in the history server

2015-10-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11047:
--
Affects Version/s: (was: 1.5.1)
   1.5.0
 Target Version/s: 1.5.2, 1.6.0

> Internal accumulators miss the internal flag when replaying events in the 
> history server
> 
>
> Key: SPARK-11047
> URL: https://issues.apache.org/jira/browse/SPARK-11047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Carson Wang
>
> Internal accumulators don't write the internal flag to event log. So on the 
> history server Web UI, all accumulators are not internal. This causes 
> incorrect peak execution memory and unwanted accumulator table displayed on 
> the stage page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11080) NamedExpression.newExprId should only be called on driver

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954558#comment-14954558
 ] 

Apache Spark commented on SPARK-11080:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9093

> NamedExpression.newExprId should only be called on driver
> -
>
> Key: SPARK-11080
> URL: https://issues.apache.org/jira/browse/SPARK-11080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> My understanding of {{NamedExpression.newExprId}} is that it is only intended 
> to be called on the driver. If it is called on executors, then this may lead 
> to scenarios where the same expression id is re-used in two different 
> NamedExpressions.
> More generally, I think that calling {{NamedExpression.newExprId}} within 
> tasks may be an indicator of potential attribute binding bugs. Therefore, I 
> think that we should prevent {{NamedExpression.newExprId}} from being called 
> inside of tasks by throwing an exception when such calls occur. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10181) HiveContext is not used with keytab principal but with user principal

2015-10-13 Thread Bolke de Bruin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954582#comment-14954582
 ] 

Bolke de Bruin commented on SPARK-10181:


Yes that is correct. 

> HiveContext is not used with keytab principal but with user principal
> -
>
> Key: SPARK-10181
> URL: https://issues.apache.org/jira/browse/SPARK-10181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: kerberos
>Reporter: Bolke de Bruin
>  Labels: hive, hivecontext, kerberos
>
> `bin/spark-submit --num-executors 1 --executor-cores 5 --executor-memory 5G  
> --driver-java-options -XX:MaxPermSize=4G --driver-class-path 
> lib/datanucleus-api-jdo-3.2.6.jar:lib/datanucleus-core-3.2.10.jar:lib/datanucleus-rdbms-3.2.9.jar:conf/hive-site.xml
>  --files conf/hive-site.xml --master yarn --principal sparkjob --keytab 
> /etc/security/keytabs/sparkjob.keytab --conf 
> spark.yarn.executor.memoryOverhead=18000 --conf 
> "spark.executor.extraJavaOptions=-XX:MaxPermSize=4G" --conf 
> spark.eventLog.enabled=false ~/test.py`
> With:
> #!/usr/bin/python
> from pyspark import SparkContext
> from pyspark.sql import HiveContext
> sc = SparkContext()
> sqlContext = HiveContext(sc)
> query = """ SELECT * FROM fm.sk_cluster """
> rdd = sqlContext.sql(query)
> rdd.registerTempTable("test")
> sqlContext.sql("CREATE TABLE wcs.test LOCATION '/tmp/test_gl' AS SELECT * 
> FROM test")
> Ends up with:
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  Permission denie
> d: user=ua80tl, access=READ_EXECUTE, 
> inode="/tmp/test_gl/.hive-staging_hive_2015-08-24_10-43-09_157_78057390024057878
> 34-1/-ext-1":sparkjob:hdfs:drwxr-x---
> (Our umask denies read access to other by default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11082) Cores per executor is wrong when response vcore number is less than requested number

2015-10-13 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-11082:
---

 Summary: Cores per executor is wrong when response vcore number is 
less than requested number
 Key: SPARK-11082
 URL: https://issues.apache.org/jira/browse/SPARK-11082
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.6.0
Reporter: Saisai Shao


When DefaultResourceCalculator is set (by default) for Yarn capacity scheduler, 
the response container resource vcore number is always 1, which may be less 
than the requested vcore number, ExecutorRunnable should honor this returned 
vcore number (not the requested number) to pass to each executor. Otherwise, 
actual allocated vcore number is different from Spark's managed CPU cores per 
executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11064) spark streaming checkpoint question

2015-10-13 Thread jason kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954675#comment-14954675
 ] 

jason kim commented on SPARK-11064:
---

I'm not here to find the right solution.

my code:
def main(args: Array[String]) {
if (args.length < 2) {
  System.err.println("Usage:   ")
  System.exit(1)
}

val updateFunc = (values: Seq[Int], state: Option[Int]) => {
  val currentCount = values.foldLeft(0)(_ + _)
  val previousCount = state.getOrElse(0)
  Some(currentCount + previousCount)
}

StreamingExamples.setStreamingLogLevels()

val Array(brokers, topics) = args

// Create context with 2 second batch interval
val sparkConf = new 
SparkConf().setAppName("KafkaWordCountTest").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")

// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers, 
"serializer.class" -> "kafka.serializer.StringEncoder" ,  "group.id" -> 
"spark-kafka-consumer")

//update zookeeper offset
val km = new KafkaManager(kafkaParams)
val messages = km.createDirectStream[String, String, StringDecoder, 
StringDecoder](
  ssc, kafkaParams, topicsSet)

//example  2:
  messages.transform(rdd=>{
km.updateZKOffsets(rdd) 
rdd}).map(x => (66, 1L)).reduceByKeyAndWindow(_+_,  
_-_,Seconds(10),Seconds(4)).print()
// Start the computation
ssc.start()
ssc.awaitTermination()
}

> spark streaming checkpoint question
> ---
>
> Key: SPARK-11064
> URL: https://issues.apache.org/jira/browse/SPARK-11064
> Project: Spark
>  Issue Type: Question
>Affects Versions: 1.4.1
>Reporter: jason kim
> Fix For: 2+
>
>
> java.io.NotSerializableException: DStream checkpointing has been enabled but 
> the DStreams with their functions are not serializable
> Serialization stack:
>   at 
> org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:550)
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:587)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:586)
>   at 
> com.bj58.spark.streaming.KafkaWordCountTest$.main(KafkaWordCountTest.scala:70)
>   at 
> com.bj58.spark.streaming.KafkaWordCountTest.main(KafkaWordCountTest.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8813) Combine files when there're many small files in table

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954698#comment-14954698
 ] 

Apache Spark commented on SPARK-8813:
-

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/9097

> Combine files when there're many small files in table
> -
>
> Key: SPARK-8813
> URL: https://issues.apache.org/jira/browse/SPARK-8813
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yadong Qi
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11052) Spaces in the build dir causes failures in the build/mvn script

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11052:
--
Priority: Minor  (was: Major)

> Spaces in the build dir causes failures in the build/mvn script
> ---
>
> Key: SPARK-11052
> URL: https://issues.apache.org/jira/browse/SPARK-11052
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Trystan Leftwich
>Priority: Minor
>
> If you are running make-distribution in a path that contains a space in it 
> the build/mvn script will fail:
> {code}
> mkdir /tmp/test\ spaces
> cd /tmp/test\ spaces
> git clone https://github.com/apache/spark.git
> cd spark
> ./make-distribution.sh --name spark-1.5-test4 --tgz -Pyarn 
> -Phive-thriftserver -Phive
> {code}
> You will get the following errors
> {code}
> /tmp/test spaces/spark/build/mvn: line 107: cd: /../lib: No such file or 
> directory
> usage: dirname path
> /tmp/test spaces/spark/build/mvn: line 108: cd: /../lib: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 138: /tmp/test: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 140: /tmp/test: No such file or 
> directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

2015-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954757#comment-14954757
 ] 

Sean Owen commented on SPARK-11066:
---

Yes, the problem is anyone who submits a JIRA presumably wants to see it 
addressed and soon. Few are actually actionable, valid, and something that the 
submitter follows through on. Hence Target Version ought to be set only by 
someone who is willing and able to drive to a resolution. Then the view of 
JIRAs targeted at a release is a somewhat reliable picture of what could happen 
in that release. It's still used unevenly but that's the reason.

If it's likely to be resolved rapidly like this one I usually don't even 
bother, but, it'd be valid to target at 1.6 / 1.5.2 after seeing it's probably 
a fine change that passes tests, etc (still some style failures)

> Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler 
> occasionally fails due to j.l.UnsupportedOperationException concerning a 
> finished JobWaiter
> --
>
> Key: SPARK-11066
> URL: https://issues.apache.org/jira/browse/SPARK-11066
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, Tests
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: Multiple OS and platform types.
> (Also observed by others, e.g. see External URL)
>Reporter: Dr Stephen A Hellberg
>Priority: Minor
>
> The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent 
> problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, 
> but whilst the job will fail and a SparkDriverExecutionException will be 
> returned, a race condition exists as to whether the first task's 
> (deliberately) thrown exception causes the job to fail - and having its 
> causing exception set to the DAGSchedulerSuiteDummyException that was thrown 
> as the setup of the misbehaving test - or second (and subsequent) tasks who 
> equally end, but have instead the DAGScheduler's legitimate 
> UnsupportedOperationException (a subclass of RuntimeException) returned 
> instead as their causing exception.  This race condition is likely associated 
> with the vagaries of processing quanta, and expense of throwing two 
> exceptions (under interpreter execution) per thread of control; this race is 
> usually 'won' by the first task throwing the DAGSchedulerDummyException, as 
> desired (and expected)... but not always.
> The problem for the testcase is that the first assertion is largely 
> concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest 
> expert) capture all the causes of SparkDriverExecutionException that can 
> legitimately arise from a correctly working (not crashed) DAGScheduler.  
> Arguably, this assertion might test something of the DAGScheduler... but not 
> all the possible outcomes for a working DAGScheduler.  Nevertheless, this 
> test - when comprising a multiple task job - will report as a failure when in 
> fact the DAGScheduler is working-as-designed (and not crashed ;-).  
> Furthermore, the test is already failed before it actually tries to use the 
> SparkContext a second time (for an arbitrary processing task), which I think 
> is the real subject of the test?
> The solution, I submit, is to ensure that the job is composed of just one 
> task, and that single task will result in the call to the compromised 
> ResultHandler causing the test's deliberate exception to be thrown and 
> exercising the relevant (DAGScheduler) code paths.  Given tasks are scoped by 
> the number of partitions of an RDD, this could be achieved with a single 
> partitioned RDD (indeed, doing so seems to exercise/would test some default 
> parallelism support of the TaskScheduler?); the pull request offered, 
> however, is based on the minimal change of just using a single partition of 
> the 2 (or more) partition parallelized RDD.  This will result in scheduling a 
> job of just one task, one successful task calling the user-supplied 
> compromised ResultHandler function, which results in failing the job and 
> unambiguously wrapping our DAGSchedulerSuiteException inside a 
> SparkDriverExecutionException; there are no other tasks that on running 
> successfully will find the job failed causing the 'undesired' 
> UnsupportedOperationException to be thrown instead.  This, then, satisfies 
> the test's setup assertion.
> I have tested this hypothesis having parametised the number of partitions, N, 
> used by the "misbehaved ResultHandler" job and have observed the 1 x 
> DAGSchedulerSuiteException first, followed by the legitimate N-1 x 
> UnsupportedOperationExceptions ... what propagates back from the job seems to 
> simply become the 

[jira] [Commented] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

2015-10-13 Thread Dr Stephen A Hellberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954736#comment-14954736
 ] 

Dr Stephen A Hellberg commented on SPARK-11066:
---

Sean : apologies, re: Fix Version and Target Version.  I was led astray in 
interpreting their purpose given they were present on the Create issue template.
Fix Version makes complete sense: until the fix is integrated its not fixed; 
Target version... I'd interpreted that as where I'd hope to see the fix 
released/is suitable for being applied.  I know this issue arises in the 1.4.x 
release (and probably before) but I'm mostly interested in seeing this 
addressed in current/future releases; my fix is likely sufficient in prior 
releases equally, so what criteria is used to suggest how far back a committer 
would backport into prior releases (given only Affects Versions)?

> Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler 
> occasionally fails due to j.l.UnsupportedOperationException concerning a 
> finished JobWaiter
> --
>
> Key: SPARK-11066
> URL: https://issues.apache.org/jira/browse/SPARK-11066
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, Tests
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: Multiple OS and platform types.
> (Also observed by others, e.g. see External URL)
>Reporter: Dr Stephen A Hellberg
>Priority: Minor
>
> The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent 
> problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, 
> but whilst the job will fail and a SparkDriverExecutionException will be 
> returned, a race condition exists as to whether the first task's 
> (deliberately) thrown exception causes the job to fail - and having its 
> causing exception set to the DAGSchedulerSuiteDummyException that was thrown 
> as the setup of the misbehaving test - or second (and subsequent) tasks who 
> equally end, but have instead the DAGScheduler's legitimate 
> UnsupportedOperationException (a subclass of RuntimeException) returned 
> instead as their causing exception.  This race condition is likely associated 
> with the vagaries of processing quanta, and expense of throwing two 
> exceptions (under interpreter execution) per thread of control; this race is 
> usually 'won' by the first task throwing the DAGSchedulerDummyException, as 
> desired (and expected)... but not always.
> The problem for the testcase is that the first assertion is largely 
> concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest 
> expert) capture all the causes of SparkDriverExecutionException that can 
> legitimately arise from a correctly working (not crashed) DAGScheduler.  
> Arguably, this assertion might test something of the DAGScheduler... but not 
> all the possible outcomes for a working DAGScheduler.  Nevertheless, this 
> test - when comprising a multiple task job - will report as a failure when in 
> fact the DAGScheduler is working-as-designed (and not crashed ;-).  
> Furthermore, the test is already failed before it actually tries to use the 
> SparkContext a second time (for an arbitrary processing task), which I think 
> is the real subject of the test?
> The solution, I submit, is to ensure that the job is composed of just one 
> task, and that single task will result in the call to the compromised 
> ResultHandler causing the test's deliberate exception to be thrown and 
> exercising the relevant (DAGScheduler) code paths.  Given tasks are scoped by 
> the number of partitions of an RDD, this could be achieved with a single 
> partitioned RDD (indeed, doing so seems to exercise/would test some default 
> parallelism support of the TaskScheduler?); the pull request offered, 
> however, is based on the minimal change of just using a single partition of 
> the 2 (or more) partition parallelized RDD.  This will result in scheduling a 
> job of just one task, one successful task calling the user-supplied 
> compromised ResultHandler function, which results in failing the job and 
> unambiguously wrapping our DAGSchedulerSuiteException inside a 
> SparkDriverExecutionException; there are no other tasks that on running 
> successfully will find the job failed causing the 'undesired' 
> UnsupportedOperationException to be thrown instead.  This, then, satisfies 
> the test's setup assertion.
> I have tested this hypothesis having parametised the number of partitions, N, 
> used by the "misbehaved ResultHandler" job and have observed the 1 x 
> DAGSchedulerSuiteException first, followed by the legitimate N-1 x 
> UnsupportedOperationExceptions ... what propagates back from the 

[jira] [Commented] (SPARK-11026) spark.yarn.user.classpath.first does work for 'spark-submit --jars hdfs://user/foo.jar'

2015-10-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954928#comment-14954928
 ] 

Thomas Graves commented on SPARK-11026:
---

if you use --jars hdfs://users/foo.jar with 
spark.yarn.user.classpath.first=true, it doesn't properly add it to the system 
classpath.

> spark.yarn.user.classpath.first does work for 'spark-submit --jars 
> hdfs://user/foo.jar'
> ---
>
> Key: SPARK-11026
> URL: https://issues.apache.org/jira/browse/SPARK-11026
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Lianhui Wang
>
> when spark.yarn.user.classpath.first=true and addJars is on hdfs path, need 
> to add the yarn's linkName of addJars to the system classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11026) spark.yarn.user.classpath.first does work for 'spark-submit --jars hdfs://user/foo.jar'

2015-10-13 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-11026.
---
   Resolution: Fixed
 Assignee: Lianhui Wang
Fix Version/s: 1.6.0
   1.5.2

> spark.yarn.user.classpath.first does work for 'spark-submit --jars 
> hdfs://user/foo.jar'
> ---
>
> Key: SPARK-11026
> URL: https://issues.apache.org/jira/browse/SPARK-11026
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Lianhui Wang
>Assignee: Lianhui Wang
> Fix For: 1.5.2, 1.6.0
>
>
> when spark.yarn.user.classpath.first=true and addJars is on hdfs path, need 
> to add the yarn's linkName of addJars to the system classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6187) Report full executor exceptions to the driver

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6187.
--
Resolution: Duplicate

Ah, is this what was already resolved in 
https://issues.apache.org/jira/browse/SPARK-8625?

> Report full executor exceptions to the driver
> -
>
> Key: SPARK-6187
> URL: https://issues.apache.org/jira/browse/SPARK-6187
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Piotr Kołaczkowski
>Priority: Minor
>
> If the task fails for some reason, the driver seems to report only the 
> top-level exception, without the cause(s). While it is possible to recover 
> the full stacktrace from executor's logs, it is quite annoying and would be 
> better to just report the full stacktrace, with all the causes to the driver 
> application.
> Example stacktrace I just got reported by the application:
> {noformat}
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 
> in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 
> (TID 5, localhost): java.lang.NoClassDefFoundError: Could not initialize 
> class org.apache.cassandra.db.Keyspace
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.writeSSTables(BulkTableWriter.scala:194)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.write(BulkTableWriter.scala:223)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:280)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:280)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> As you see, this is not very informative.
> In fact, the real exception is:
> {noformat}
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.cassandra.db.Keyspace
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.writeSSTables(BulkTableWriter.scala:194)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.write(BulkTableWriter.scala:227)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:284)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:284)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> java.lang.ExceptionInInitializerError
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.writeSSTables(BulkTableWriter.scala:194)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.write(BulkTableWriter.scala:227)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:284)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:284)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.cassandra.config.DatabaseDescriptor.createAllDirectories(DatabaseDescriptor.java:741)
>   at org.apache.cassandra.db.Keyspace.(Keyspace.java:72)
>   ... 10 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11085) Add support for HTTP proxy

2015-10-13 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954989#comment-14954989
 ] 

Don Drake commented on SPARK-11085:
---

Neither of the options work.

> Add support for HTTP proxy 
> ---
>
> Key: SPARK-11085
> URL: https://issues.apache.org/jira/browse/SPARK-11085
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> Add a way to update ivysettings.xml for the spark-shell and spark-submit to 
> support proxy settings for clusters that need to access a remote repository 
> through an http proxy.  Typically this would be done like:
> JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080"
> Directly in the ivysettings.xml would look like:
>  
>  proxyport="8080" 
> nonproxyhosts="nonproxy.host"/> 
>  
> Even better would be a way to customize the ivysettings.xml with command 
> options.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11026) spark.yarn.user.classpath.first does work for 'spark-submit --jars hdfs://user/foo.jar'

2015-10-13 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-11026:
--
Summary: spark.yarn.user.classpath.first does work for 'spark-submit --jars 
hdfs://user/foo.jar'  (was: spark.yarn.user.classpath.first doesn't work for 
remote addJars)

> spark.yarn.user.classpath.first does work for 'spark-submit --jars 
> hdfs://user/foo.jar'
> ---
>
> Key: SPARK-11026
> URL: https://issues.apache.org/jira/browse/SPARK-11026
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Lianhui Wang
>
> when spark.yarn.user.classpath.first=true and addJars is on hdfs path, need 
> to add the yarn's linkName of addJars to the system classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11085) Add support for HTTP proxy

2015-10-13 Thread Dustin Cote (JIRA)
Dustin Cote created SPARK-11085:
---

 Summary: Add support for HTTP proxy 
 Key: SPARK-11085
 URL: https://issues.apache.org/jira/browse/SPARK-11085
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell, Spark Submit
Reporter: Dustin Cote


Add a way to update ivysettings.xml for the spark-shell and spark-submit to 
support proxy settings for clusters that need to access a remote repository 
through an http proxy.  Typically this would be done like:
JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 
-Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080"

Directly in the ivysettings.xml would look like:
 
 
 

Even better would be a way to customize the ivysettings.xml with command 
options.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11085) Add support for HTTP proxy

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11085:
--
Priority: Minor  (was: Major)

> Add support for HTTP proxy 
> ---
>
> Key: SPARK-11085
> URL: https://issues.apache.org/jira/browse/SPARK-11085
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> Add a way to update ivysettings.xml for the spark-shell and spark-submit to 
> support proxy settings for clusters that need to access a remote repository 
> through an http proxy.  Typically this would be done like:
> JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080"
> Directly in the ivysettings.xml would look like:
>  
>  proxyport="8080" 
> nonproxyhosts="nonproxy.host"/> 
>  
> Even better would be a way to customize the ivysettings.xml with command 
> options.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6987) Node Locality is determined with String Matching instead of Inet Comparison

2015-10-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954983#comment-14954983
 ] 

Piotr Kołaczkowski commented on SPARK-6987:
---

Probably just having ability to list the host-names that Spark knows of would 
be enough.

> Node Locality is determined with String Matching instead of Inet Comparison
> ---
>
> Key: SPARK-6987
> URL: https://issues.apache.org/jira/browse/SPARK-6987
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Russell Alexander Spitzer
>
> When determining whether or not a task can be run NodeLocal the 
> TaskSetManager ends up using a direct string comparison between the 
> preferredIp and the executor's bound interface.
> https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L878-L880
> https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L488-L490
> This means that the preferredIp must be a direct string match of the ip the 
> the worker is bound to. This means that apis which are gathering data from 
> other distributed sources must develop their own mapping between the 
> interfaces bound (or exposed) by the external sources and the interface bound 
> by the Spark executor since these may be different. 
> For example, Cassandra exposes a broadcast rpc address which doesn't have to 
> match the address which the service is bound to. This means when adding 
> preferredLocation data we must add both the rpc and the listen address to 
> ensure that we can get a string match (and of course we are out of luck if 
> Spark has been bound on to another interface). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11083) insert overwrite table failed when beeline reconnect

2015-10-13 Thread Weizhong (JIRA)
Weizhong created SPARK-11083:


 Summary: insert overwrite table failed when beeline reconnect
 Key: SPARK-11083
 URL: https://issues.apache.org/jira/browse/SPARK-11083
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: Spark: master branch
Hadoop: 2.7.1
JDK: 1.8.0_60
Reporter: Weizhong


1. Start Thriftserver
2. Use beeline connect to thriftserver, then execute "insert overwrite 
table_name ..." clause -- success
3. Exit beelin
4. Reconnect to thriftserver, and then execute "insert overwrite table_name 
..." clause. -- failed
{noformat}
15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING, 
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
at 
org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
source 
hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/.hive-staging_hive_2015-10-13_18-44-17_606_2400736035447406540-2/-ext-1/cr_returned_date=2003-08-27/part-00048
 to destination 
hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/cr_returned_date=2003-08-27/part-00048
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2892)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398)
at 

[jira] [Assigned] (SPARK-11084) SparseVector.__getitem__ should check if value can be non-zero before executing searchsorted

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11084:


Assignee: Apache Spark

> SparseVector.__getitem__ should check if value can be non-zero before 
> executing searchsorted
> 
>
> Key: SPARK-11084
> URL: https://issues.apache.org/jira/browse/SPARK-11084
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> At this moment SparseVector.\_\_getitem\_\_ executes np.searchsorted first 
> and checks if result is in an expected range after that:
> {code}
> insert_index = np.searchsorted(inds, index)
> if insert_index >= inds.size:
> return 0.
> row_ind = inds[insert_index]
> ...
> {code}
> See: https://issues.apache.org/jira/browse/SPARK-10973
> It is possible to check if index can contain non-zero value before binary 
> search: 
> {code}
> if (inds.size == 0) or (index > inds.item(-1)):
> return 0.
> insert_index = np.searchsorted(inds, index)
> row_ind = inds[insert_index]
> ...
> {code}
> It is not a huge improvement but should save some work on large vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11084) SparseVector.__getitem__ should check if value can be non-zero before executing searchsorted

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954861#comment-14954861
 ] 

Apache Spark commented on SPARK-11084:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/9098

> SparseVector.__getitem__ should check if value can be non-zero before 
> executing searchsorted
> 
>
> Key: SPARK-11084
> URL: https://issues.apache.org/jira/browse/SPARK-11084
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> At this moment SparseVector.\_\_getitem\_\_ executes np.searchsorted first 
> and checks if result is in an expected range after that:
> {code}
> insert_index = np.searchsorted(inds, index)
> if insert_index >= inds.size:
> return 0.
> row_ind = inds[insert_index]
> ...
> {code}
> See: https://issues.apache.org/jira/browse/SPARK-10973
> It is possible to check if index can contain non-zero value before binary 
> search: 
> {code}
> if (inds.size == 0) or (index > inds.item(-1)):
> return 0.
> insert_index = np.searchsorted(inds, index)
> row_ind = inds[insert_index]
> ...
> {code}
> It is not a huge improvement but should save some work on large vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11083) insert overwrite table failed when beeline reconnect

2015-10-13 Thread Weizhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weizhong updated SPARK-11083:
-
Description: 
1. Start Thriftserver
2. Use beeline connect to thriftserver, then execute "insert overwrite 
table_name ..." clause -- success
3. Exit beelin
4. Reconnect to thriftserver, and then execute "insert overwrite table_name 
..." clause. -- failed
{noformat}
15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING, 
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
at 
org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
source 
hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/.hive-staging_hive_2015-10-13_18-44-17_606_2400736035447406540-2/-ext-1/cr_returned_date=2003-08-27/part-00048
 to destination 
hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/cr_returned_date=2003-08-27/part-00048
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2892)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398)
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1593)
... 36 more
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
at 

[jira] [Created] (SPARK-11084) SparseVector.__getitem__ should check if value can be non-zero before executing searchsorted

2015-10-13 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-11084:
--

 Summary: SparseVector.__getitem__ should check if value can be 
non-zero before executing searchsorted
 Key: SPARK-11084
 URL: https://issues.apache.org/jira/browse/SPARK-11084
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.5.0, 1.4.0, 1.3.0, 1.6.0
Reporter: Maciej Szymkiewicz
Priority: Minor


At this moment SparseVector.\_\_getitem\_\_ executes np.searchsorted first and 
checks if result is in an expected range after that:

{code}
insert_index = np.searchsorted(inds, index)
if insert_index >= inds.size:
return 0.

row_ind = inds[insert_index]
...
{code}

See: https://issues.apache.org/jira/browse/SPARK-10973

It is possible to check if index can contain non-zero value before binary 
search: 

{code}
if (inds.size == 0) or (index > inds.item(-1)):
return 0.

insert_index = np.searchsorted(inds, index)
row_ind = inds[insert_index]
...
{code}

It is not a huge improvement but should save some work on large vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11085) Add support for HTTP proxy

2015-10-13 Thread Dustin Cote (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955001#comment-14955001
 ] 

Dustin Cote commented on SPARK-11085:
-

[~sowen] The problem here is that the dependencies to be downloaded with 
--packages can't be reached because those settings do not get forwarded into 
the respective Spark client.  I'll note this was being tried with Spark on YARN 
and the JAVA_OPTS was being set through spark.driver.extraJavaOptions.  The 
ivysettings change was being done through ~/.m2/ivysettings.xml.  It's more of 
a forwarding the settings to the Spark client issue.

At least on CDH, the relevant ivysettings.xml is bundled in the assembly jar 
and apparently not modified by the two methods:
:: loading settings :: url = 
jar:file:/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p819.487/jars/spark-assembly-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar!/org/apache/ivy/core/settings/ivysettings.xml
 

This JIRA would be to come up with a way to modify or override this 
ivysettings.xml so that it can be used with proxy settings.

> Add support for HTTP proxy 
> ---
>
> Key: SPARK-11085
> URL: https://issues.apache.org/jira/browse/SPARK-11085
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> Add a way to update ivysettings.xml for the spark-shell and spark-submit to 
> support proxy settings for clusters that need to access a remote repository 
> through an http proxy.  Typically this would be done like:
> JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080"
> Directly in the ivysettings.xml would look like:
>  
>  proxyport="8080" 
> nonproxyhosts="nonproxy.host"/> 
>  
> Even better would be a way to customize the ivysettings.xml with command 
> options.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11064) spark streaming checkpoint question

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11064:
--
Target Version/s:   (was: 1.4.1)
   Fix Version/s: (was: 2+)

> spark streaming checkpoint question
> ---
>
> Key: SPARK-11064
> URL: https://issues.apache.org/jira/browse/SPARK-11064
> Project: Spark
>  Issue Type: Question
>Affects Versions: 1.4.1
>Reporter: jason kim
>
> java.io.NotSerializableException: DStream checkpointing has been enabled but 
> the DStreams with their functions are not serializable
> Serialization stack:
>   at 
> org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:550)
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:587)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:586)
>   at 
> com.bj58.spark.streaming.KafkaWordCountTest$.main(KafkaWordCountTest.scala:70)
>   at 
> com.bj58.spark.streaming.KafkaWordCountTest.main(KafkaWordCountTest.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8170) Ctrl-C in pyspark shell doesn't kill running job

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8170:
-
Assignee: Ashwin Shankar

> Ctrl-C in pyspark shell doesn't kill running job
> 
>
> Key: SPARK-8170
> URL: https://issues.apache.org/jira/browse/SPARK-8170
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
> Fix For: 1.6.0
>
>
> Hitting Ctrl-C in spark-sql(and other tools like presto) cancels any running 
> job and starts a new input line on the prompt. It would be nice if pyspark 
> shell also can do that. Otherwise, in case a user submits a job, say he made 
> a mistake, and wants to cancel it, he needs to exit the shell and re-login to 
> continue his work. Re-login can be a pain especially in Spark on yarn, since 
> it takes a while to allocate AM container and initial executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11064) spark streaming checkpoint question

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11064.
---
Resolution: Duplicate

[~leehom] Please do not open any more JIRAs until you read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  Both 
JIRAs were invalid and had several problems.

> spark streaming checkpoint question
> ---
>
> Key: SPARK-11064
> URL: https://issues.apache.org/jira/browse/SPARK-11064
> Project: Spark
>  Issue Type: Question
>Affects Versions: 1.4.1
>Reporter: jason kim
>
> java.io.NotSerializableException: DStream checkpointing has been enabled but 
> the DStreams with their functions are not serializable
> Serialization stack:
>   at 
> org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:550)
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:587)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:586)
>   at 
> com.bj58.spark.streaming.KafkaWordCountTest$.main(KafkaWordCountTest.scala:70)
>   at 
> com.bj58.spark.streaming.KafkaWordCountTest.main(KafkaWordCountTest.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10959) PySpark StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10959:
--
Labels: backport-needed  (was: )

> PySpark StreamingLogisticRegressionWithSGD does not train with given regParam 
> and convergenceTol parameters
> ---
>
> Key: SPARK-10959
> URL: https://issues.apache.org/jira/browse/SPARK-10959
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.6.0
>
>
> These parameters are passed into the StreamingLogisticRegressionWithSGD 
> constructor, but do not get transferred to the model to use when training.  
> Same problem with StreamingLinearRegressionWithSGD and the intercept param is 
> in the wrong  argument place where it is being used as regularization value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11008) Spark window function returns inconsistent/wrong results

2015-10-13 Thread Johnathan Garrett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955107#comment-14955107
 ] 

Johnathan Garrett commented on SPARK-11008:
---

As an additional test, I created two separate parquet files on HDFS and did the 
above sequence on both of them.  Whichever one is processed first returns 
incorrect results.  The second one always returns the correct results.

> Spark window function returns inconsistent/wrong results
> 
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>Reporter: Prasad Chalasani
>Priority: Minor
>
> Summary: applying a windowing function on a data-frame, followed by count() 
> gives widely varying results in repeated runs: none exceed the correct value, 
> but of course all but one are wrong. On large data-sets I sometimes get as 
> small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", 
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the code in step 5 in the same shell, then the result gets 
> "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In 
> my real application the size of the data is much larger and the window 
> function is not trivial as above (i.e. there are multiple "time" values per 
> "id", etc), and I see results sometimes as small as HALF of the correct value 
> (e.g. 120,000 while the correct value is 250,000). So this is a serious 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11008) Spark window function returns inconsistent/wrong results

2015-10-13 Thread Johnathan Garrett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955107#comment-14955107
 ] 

Johnathan Garrett edited comment on SPARK-11008 at 10/13/15 3:23 PM:
-

As an additional test, I created two separate parquet files on HDFS and did the 
above sequence on both of them.  Regardless of which file I process first, the 
first dataframe returns incorrect results.  The second one always returns the 
correct results.


was (Author: jgarrett):
As an additional test, I created two separate parquet files on HDFS and did the 
above sequence on both of them.  Whichever one is processed first returns 
incorrect results.  The second one always returns the correct results.

> Spark window function returns inconsistent/wrong results
> 
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>Reporter: Prasad Chalasani
>Priority: Minor
>
> Summary: applying a windowing function on a data-frame, followed by count() 
> gives widely varying results in repeated runs: none exceed the correct value, 
> but of course all but one are wrong. On large data-sets I sometimes get as 
> small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", 
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the code in step 5 in the same shell, then the result gets 
> "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In 
> my real application the size of the data is much larger and the window 
> function is not trivial as above (i.e. there are multiple "time" values per 
> "id", etc), and I see results sometimes as small as HALF of the correct value 
> (e.g. 120,000 while the correct value is 250,000). So this is a serious 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6187) Report full executor exceptions to the driver

2015-10-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954954#comment-14954954
 ] 

Piotr Kołaczkowski commented on SPARK-6187:
---

The problem is not that the exception is not reported at all, but that the full 
stacktrace of the exception is not reported. Spark seems to report just the 
top-level exception and ignores the nested exceptions stacktraces. The full 
stacktrace is reported only in the executor logs. 

So your example is too simple to reproduce it, because you're throwing just 
one, flat exception with null cause.


> Report full executor exceptions to the driver
> -
>
> Key: SPARK-6187
> URL: https://issues.apache.org/jira/browse/SPARK-6187
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Piotr Kołaczkowski
>Priority: Minor
>
> If the task fails for some reason, the driver seems to report only the 
> top-level exception, without the cause(s). While it is possible to recover 
> the full stacktrace from executor's logs, it is quite annoying and would be 
> better to just report the full stacktrace, with all the causes to the driver 
> application.
> Example stacktrace I just got reported by the application:
> {noformat}
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 
> in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 
> (TID 5, localhost): java.lang.NoClassDefFoundError: Could not initialize 
> class org.apache.cassandra.db.Keyspace
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.writeSSTables(BulkTableWriter.scala:194)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.write(BulkTableWriter.scala:223)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:280)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:280)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> As you see, this is not very informative.
> In fact, the real exception is:
> {noformat}
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.cassandra.db.Keyspace
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.writeSSTables(BulkTableWriter.scala:194)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.write(BulkTableWriter.scala:227)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:284)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:284)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> java.lang.ExceptionInInitializerError
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.writeSSTables(BulkTableWriter.scala:194)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter.write(BulkTableWriter.scala:227)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:284)
>   at 
> com.datastax.bdp.spark.writer.BulkTableWriter$BulkSaveRDDFunctions$$anonfun$bulkSaveToCassandra$1.apply(BulkTableWriter.scala:284)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.cassandra.config.DatabaseDescriptor.createAllDirectories(DatabaseDescriptor.java:741)
>   at 

[jira] [Updated] (SPARK-11023) Error initializing SparkContext. java.net.URISyntaxException

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11023:
--
Assignee: Marcelo Vanzin

> Error initializing SparkContext. java.net.URISyntaxException
> 
>
> Key: SPARK-11023
> URL: https://issues.apache.org/jira/browse/SPARK-11023
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0, 1.5.1
> Environment: pyspark + windows 
>Reporter: Jose Antonio
>Assignee: Marcelo Vanzin
>  Labels: windows
> Fix For: 1.5.2, 1.6.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Simliar to SPARK-10326. 
> [https://issues.apache.org/jira/browse/SPARK-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949470#comment-14949470]
> C:\WINDOWS\system32>pyspark --master yarn-client
> Python 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, Sep 15 2015, 14:26:14) [MSC 
> v.1500 64 bit (AMD64)]
> Type "copyright", "credits" or "license" for more information.
> IPython 4.0.0 – An enhanced Interactive Python.
> ? -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help -> Python's own help system.
> object? -> Details about 'object', use 'object??' for extra details.
> 15/10/08 09:28:05 WARN MetricsSystem: Using default name DAGScheduler for 
> source because spark.app.id is not set.
> 15/10/08 09:28:06 WARN : Your hostname, PC-509512 resolves to a 
> loopback/non-reachable address: fe80:0:0:0:0:5efe:a5f:c318%net3, but we 
> couldn't find any external IP address!
> 15/10/08 09:28:08 WARN BlockReaderLocal: The short-circuit local reads 
> feature cannot be used because UNIX Domain sockets are not available on 
> Windows.
> 15/10/08 09:28:08 ERROR SparkContext: Error initializing SparkContext.
> java.net.URISyntaxException: Illegal character in opaque part at index 2: 
> C:\spark\bin\..\python\lib\pyspark.zip
> at java.net.URI$Parser.fail(Unknown Source)
> at java.net.URI$Parser.checkChars(Unknown Source)
> at java.net.URI$Parser.parse(Unknown Source)
> at java.net.URI.(Unknown Source)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$7.apply(Client.scala:558)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$7.apply(Client.scala:557)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:557)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:628)
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
> at org.apache.spark.SparkContext.(SparkContext.scala:523)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:214)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
> at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Unknown Source)
> 15/10/08 09:28:08 ERROR Utils: Uncaught exception in thread Thread-2
> java.lang.NullPointerException
> at 
> org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
> at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
> at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
> at 
> org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
> at org.apache.spark.SparkContext.(SparkContext.scala:593)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
> at 

[jira] [Updated] (SPARK-11081) Shade Jersey dependency to work around the compatibility issue with Jersey2

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11081:
--
Component/s: Spark Core
 Issue Type: Improvement  (was: Bug)

> Shade Jersey dependency to work around the compatibility issue with Jersey2
> ---
>
> Key: SPARK-11081
> URL: https://issues.apache.org/jira/browse/SPARK-11081
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Mingyu Kim
>
> As seen from this thread 
> (https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E),
>  Spark is incompatible with Jersey 2 especially when Spark is embedded in an 
> application running with Jersey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11085) Add support for HTTP proxy

2015-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954980#comment-14954980
 ] 

Sean Owen commented on SPARK-11085:
---

Dustin so I'm clear, do those alternatives work?

> Add support for HTTP proxy 
> ---
>
> Key: SPARK-11085
> URL: https://issues.apache.org/jira/browse/SPARK-11085
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Reporter: Dustin Cote
>
> Add a way to update ivysettings.xml for the spark-shell and spark-submit to 
> support proxy settings for clusters that need to access a remote repository 
> through an http proxy.  Typically this would be done like:
> JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080"
> Directly in the ivysettings.xml would look like:
>  
>  proxyport="8080" 
> nonproxyhosts="nonproxy.host"/> 
>  
> Even better would be a way to customize the ivysettings.xml with command 
> options.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11008) Spark window function returns inconsistent/wrong results

2015-10-13 Thread Johnathan Garrett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955035#comment-14955035
 ] 

Johnathan Garrett commented on SPARK-11008:
---

We are seeing this issue as well since upgrading to Spark 1.5.1 with ORC and 
parquet files on HDFS.  When using window functions, the first set of results 
are usually incorrect after starting up a new application.  To verify the 
issue, I ran spark-shell and entered the above sequence of commands, but 
replaced the s3 path with a path on HDFS.  After setting up the parquet file, 
every time I restart spark-shell I get the incorrect results on the first pass, 
then correct after that.

I am running spark-shell with:
spark-shell --master yarn-client --num-executors 4

> Spark window function returns inconsistent/wrong results
> 
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>Reporter: Prasad Chalasani
>Priority: Minor
>
> Summary: applying a windowing function on a data-frame, followed by count() 
> gives widely varying results in repeated runs: none exceed the correct value, 
> but of course all but one are wrong. On large data-sets I sometimes get as 
> small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", 
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the code in step 5 in the same shell, then the result gets 
> "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In 
> my real application the size of the data is much larger and the window 
> function is not trivial as above (i.e. there are multiple "time" values per 
> "id", etc), and I see results sometimes as small as HALF of the correct value 
> (e.g. 120,000 while the correct value is 250,000). So this is a serious 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

2015-10-13 Thread Dr Stephen A Hellberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955085#comment-14955085
 ] 

Dr Stephen A Hellberg commented on SPARK-11066:
---

Thanks for the clarification Sean.  And, I've given my patches' comments a bit 
of a haircut... Sorry, I probably err on verbosity.
(Ahem, some would likely consider that a stylistic failure ;-) ).

I've also had a go at getting to grips with  the dev/lint-scala tool applied to 
the codebase with my proposed (revised) patch, which passes now.

> Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler 
> occasionally fails due to j.l.UnsupportedOperationException concerning a 
> finished JobWaiter
> --
>
> Key: SPARK-11066
> URL: https://issues.apache.org/jira/browse/SPARK-11066
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, Tests
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: Multiple OS and platform types.
> (Also observed by others, e.g. see External URL)
>Reporter: Dr Stephen A Hellberg
>Priority: Minor
>
> The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent 
> problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, 
> but whilst the job will fail and a SparkDriverExecutionException will be 
> returned, a race condition exists as to whether the first task's 
> (deliberately) thrown exception causes the job to fail - and having its 
> causing exception set to the DAGSchedulerSuiteDummyException that was thrown 
> as the setup of the misbehaving test - or second (and subsequent) tasks who 
> equally end, but have instead the DAGScheduler's legitimate 
> UnsupportedOperationException (a subclass of RuntimeException) returned 
> instead as their causing exception.  This race condition is likely associated 
> with the vagaries of processing quanta, and expense of throwing two 
> exceptions (under interpreter execution) per thread of control; this race is 
> usually 'won' by the first task throwing the DAGSchedulerDummyException, as 
> desired (and expected)... but not always.
> The problem for the testcase is that the first assertion is largely 
> concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest 
> expert) capture all the causes of SparkDriverExecutionException that can 
> legitimately arise from a correctly working (not crashed) DAGScheduler.  
> Arguably, this assertion might test something of the DAGScheduler... but not 
> all the possible outcomes for a working DAGScheduler.  Nevertheless, this 
> test - when comprising a multiple task job - will report as a failure when in 
> fact the DAGScheduler is working-as-designed (and not crashed ;-).  
> Furthermore, the test is already failed before it actually tries to use the 
> SparkContext a second time (for an arbitrary processing task), which I think 
> is the real subject of the test?
> The solution, I submit, is to ensure that the job is composed of just one 
> task, and that single task will result in the call to the compromised 
> ResultHandler causing the test's deliberate exception to be thrown and 
> exercising the relevant (DAGScheduler) code paths.  Given tasks are scoped by 
> the number of partitions of an RDD, this could be achieved with a single 
> partitioned RDD (indeed, doing so seems to exercise/would test some default 
> parallelism support of the TaskScheduler?); the pull request offered, 
> however, is based on the minimal change of just using a single partition of 
> the 2 (or more) partition parallelized RDD.  This will result in scheduling a 
> job of just one task, one successful task calling the user-supplied 
> compromised ResultHandler function, which results in failing the job and 
> unambiguously wrapping our DAGSchedulerSuiteException inside a 
> SparkDriverExecutionException; there are no other tasks that on running 
> successfully will find the job failed causing the 'undesired' 
> UnsupportedOperationException to be thrown instead.  This, then, satisfies 
> the test's setup assertion.
> I have tested this hypothesis having parametised the number of partitions, N, 
> used by the "misbehaved ResultHandler" job and have observed the 1 x 
> DAGSchedulerSuiteException first, followed by the legitimate N-1 x 
> UnsupportedOperationExceptions ... what propagates back from the job seems to 
> simply become the result of the race between task threads and the 
> intermittent failures observed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Commented] (SPARK-11008) Spark window function returns inconsistent/wrong results

2015-10-13 Thread Prasad Chalasani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955139#comment-14955139
 ] 

Prasad Chalasani commented on SPARK-11008:
--

yes that's exactly what I was seeing. You're seeing it with HDFS too, so I 
don't think the S3 eventual consistency has anything to do with it. I'm 
starting to wonder if this could be a spark issue after all.

> Spark window function returns inconsistent/wrong results
> 
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>Reporter: Prasad Chalasani
>Priority: Minor
>
> Summary: applying a windowing function on a data-frame, followed by count() 
> gives widely varying results in repeated runs: none exceed the correct value, 
> but of course all but one are wrong. On large data-sets I sometimes get as 
> small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", 
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the code in step 5 in the same shell, then the result gets 
> "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In 
> my real application the size of the data is much larger and the window 
> function is not trivial as above (i.e. there are multiple "time" values per 
> "id", etc), and I see results sometimes as small as HALF of the correct value 
> (e.g. 120,000 while the correct value is 250,000). So this is a serious 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955308#comment-14955308
 ] 

Reynold Xin commented on SPARK-:


[~sandyr] I thought a lot about doing this on top of the existing RDD API for a 
while, and that was my preference. However, we would need to break the RDD API, 
which breaks all existing applications.



> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11067) Spark SQL thrift server fails to handle decimal value

2015-10-13 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955341#comment-14955341
 ] 

Alex Liu commented on SPARK-11067:
--

[~navis] Can you elaborate what could be done to improve Hive JDBC handling on 
big decimal type?

> Spark SQL thrift server fails to handle decimal value
> -
>
> Key: SPARK-11067
> URL: https://issues.apache.org/jira/browse/SPARK-11067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
> Attachments: SPARK-11067.1.patch.txt
>
>
> When executing the following query through beeline connecting to Spark sql 
> thrift server, it errors out for decimal column
> {code}
> Select decimal_column from table
> WARN  2015-10-09 15:04:00 
> org.apache.hive.service.cli.thrift.ThriftCLIService: Error fetching results: 
> java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal
>   at 
> org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:174) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:144)
>  ~[spark-hive-thriftserver_2.10-1.4.1.1.jar:1.4.1.1]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>  [hive-service-0.13.1a.jar:4.8.1-SNAPSHOT]
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>  [libthrift-0.9.2.jar:0.9.2]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11079) Post-hoc review Netty based RPC implementation - round 1

2015-10-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11079.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Post-hoc review Netty based RPC implementation - round 1
> 
>
> Key: SPARK-11079
> URL: https://issues.apache.org/jira/browse/SPARK-11079
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>
> This is a task for Reynold to review the existing implementation done by 
> [~shixi...@databricks.com].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9844) File appender race condition during SparkWorker shutdown

2015-10-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9844:
---
Priority: Major  (was: Minor)

> File appender race condition during SparkWorker shutdown
> 
>
> Key: SPARK-9844
> URL: https://issues.apache.org/jira/browse/SPARK-9844
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Alex Liu
>
> We find this issue still exists in 1.3.1
> {code}
> ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 
> - Error writing stream to file 
> /var/lib/spark/worker/worker-0/app-20150728224954-0003/0/stderr
> ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 
> - java.io.IOException: Stream closed
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.BufferedInputStream.read(BufferedInputStream.java:345) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.FilterInputStream.read(FilterInputStream.java:107) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
>  ~[spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
>  [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
>  [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
>  [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) 
> [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,656 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) 
> [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> {code}
> at  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L159
> The process auto shuts down, but the log appenders are still running, which 
> causes the error log messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955304#comment-14955304
 ] 

Sean Owen commented on SPARK-:
--

I had a similar question about how much more this is than the current RDD API. 
For example, is the idea that, with the help of caller-provided annotations 
and/or some code analysis perhaps you could deduce more about operations and 
optimize them more? A lot of the API already covers the basics, like assuming 
reduce functions are associative, etc.

I get transformations on domain objects in the style of Spark SQL but I can 
already "groupBy(customer.name)" in a normal RDD.
I can also go sorta easily from DataFrames to RDDs and back.

So I assume it's about static analysis of user functions, in the main?
Or about getting to/from a Row faster?

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11084) SparseVector.__getitem__ should check if value can be non-zero before executing searchsorted

2015-10-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11084:
--
Shepherd: Joseph K. Bradley
Assignee: Maciej Szymkiewicz
Target Version/s: 1.6.0

> SparseVector.__getitem__ should check if value can be non-zero before 
> executing searchsorted
> 
>
> Key: SPARK-11084
> URL: https://issues.apache.org/jira/browse/SPARK-11084
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
>
> At this moment SparseVector.\_\_getitem\_\_ executes np.searchsorted first 
> and checks if result is in an expected range after that:
> {code}
> insert_index = np.searchsorted(inds, index)
> if insert_index >= inds.size:
> return 0.
> row_ind = inds[insert_index]
> ...
> {code}
> See: https://issues.apache.org/jira/browse/SPARK-10973
> It is possible to check if index can contain non-zero value before binary 
> search: 
> {code}
> if (inds.size == 0) or (index > inds.item(-1)):
> return 0.
> insert_index = np.searchsorted(inds, index)
> row_ind = inds[insert_index]
> ...
> {code}
> It is not a huge improvement but should save some work on large vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10913) Add attach() function for DataFrame

2015-10-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10913:
--
Assignee: Weiqiang Zhuang

> Add attach() function for DataFrame
> ---
>
> Key: SPARK-10913
> URL: https://issues.apache.org/jira/browse/SPARK-10913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Weiqiang Zhuang
>Assignee: Weiqiang Zhuang
>Priority: Minor
>
> Need a R-like attach() API: "Attach Set of R Objects to Search Path"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11086) createDataFrame should dropFactor column-wise not cell-wise

2015-10-13 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-11086:
--

 Summary: createDataFrame should dropFactor column-wise not 
cell-wise 
 Key: SPARK-11086
 URL: https://issues.apache.org/jira/browse/SPARK-11086
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Maciej Szymkiewicz


At this moment SparkR {{createDataFrame}} [is using nested 
loop|https://github.com/apache/spark/blob/896edb51ab7a88bbb31259e565311a9be6f2ca6d/R/pkg/R/SQLContext.R#L99]
 to convert {{factors}} to {{character}} when called on a local {{data.frame}}.

{code}
data <- lapply(1:n, function(i) {
lapply(1:m, function(j) { dropFactor(data[i,j]) })
})
{code}

It works but is incredibly slow especially with {{data.table}} (~ 2 orders of 
magnitude compared to  PySpark / Pandas version on a DateFrame of size 1M rows 
x 2 columns).

A simple improvement is to apply {{dropFactor}} column-wise and then reshape 
output list:

{code}
args <- list(FUN=list, SIMPLIFY=FALSE, USE.NAMES=FALSE)  
data <- do.call(mapply, append(args, setNames(lapply(data, dropFactor), NULL)))
{code}

It should at least partially address 
[SPARK-8277|https://issues.apache.org/jira/browse/SPARK-8277].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >