[jira] [Commented] (SPARK-1902) Spark shell prints error when :4040 port already in use
[ https://issues.apache.org/jira/browse/SPARK-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647623#comment-14647623 ] Eugene Morozov commented on SPARK-1902: --- It looks like package name has changed since and now log4j.properties has to have another logger name to turn it off: {noformat} log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR {noformat} I'm not sure what should I do: 1. Reopen this issue 2. Create a new one 3. Or it's not that important to make this change. Please, suggest. Spark shell prints error when :4040 port already in use --- Key: SPARK-1902 URL: https://issues.apache.org/jira/browse/SPARK-1902 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Assignee: Andrew Ash Fix For: 1.1.0 When running two shells on the same machine, I get the below error. The issue is that the first shell takes port 4040, then the next tries tries 4040 and fails so falls back to 4041, then a third would try 4040 and 4041 before landing on 4042, etc. We should catch the error and instead log as Unable to use port 4041; already in use. Attempting port 4042... {noformat} 14/05/22 11:31:54 WARN component.AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4041: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.server.Server.doStart(Server.java:293) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205) at org.apache.spark.ui.WebUI.bind(WebUI.scala:99) at org.apache.spark.SparkContext.init(SparkContext.scala:217) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957) at $line3.$read$$iwC$$iwC.init(console:8) at $line3.$read$$iwC.init(console:14) at $line3.$read.init(console:16) at $line3.$read$.init(console:20) at $line3.$read$.clinit(console) at $line3.$eval$.init(console:7) at $line3.$eval$.clinit(console) at $line3.$eval.$print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120) at
[jira] [Created] (SPARK-9475) Consistent hadoop config for external/*
Cody Koeninger created SPARK-9475: - Summary: Consistent hadoop config for external/* Key: SPARK-9475 URL: https://issues.apache.org/jira/browse/SPARK-9475 Project: Spark Issue Type: Sub-task Reporter: Cody Koeninger Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9473) Consistent hadoop config for SQL
Cody Koeninger created SPARK-9473: - Summary: Consistent hadoop config for SQL Key: SPARK-9473 URL: https://issues.apache.org/jira/browse/SPARK-9473 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cody Koeninger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9380: - Assignee: Alexander Ulanov Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Assignee: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files
[ https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647630#comment-14647630 ] Liang-Chi Hsieh commented on SPARK-9347: It will merge different schema if the parquet schema merging configuration is enabled. spark load of existing parquet files extremely slow if large number of files Key: SPARK-9347 URL: https://issues.apache.org/jira/browse/SPARK-9347 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Samphel Norden When spark sql shell is launched and we point it to a folder containing a large number of parquet files, the sqlContext.parquetFile() command takes a very long time to load the tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9472) Consistent hadoop config for streaming
[ https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9472: --- Assignee: (was: Apache Spark) Consistent hadoop config for streaming -- Key: SPARK-9472 URL: https://issues.apache.org/jira/browse/SPARK-9472 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Cody Koeninger Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9472) Consistent hadoop config for streaming
[ https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9472: --- Assignee: Apache Spark Consistent hadoop config for streaming -- Key: SPARK-9472 URL: https://issues.apache.org/jira/browse/SPARK-9472 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Cody Koeninger Assignee: Apache Spark Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9447) Update python API to include RandomForest as classifier changes.
[ https://issues.apache.org/jira/browse/SPARK-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9447: - Component/s: PySpark MLlib Update python API to include RandomForest as classifier changes. Key: SPARK-9447 URL: https://issues.apache.org/jira/browse/SPARK-9447 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: holdenk The API should still work after SPARK-9016-make-random-forest-classifiers-implement-classification-trait gets merged in, but we might want to extend provide predictRaw and similar in the Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9377) Shuffle tuning should discuss task size optimisation
[ https://issues.apache.org/jira/browse/SPARK-9377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647599#comment-14647599 ] Jem Tucker commented on SPARK-9377: --- Yes I will do Shuffle tuning should discuss task size optimisation Key: SPARK-9377 URL: https://issues.apache.org/jira/browse/SPARK-9377 Project: Spark Issue Type: Documentation Components: Documentation, Shuffle Reporter: Jem Tucker Priority: Minor Recent issue SPARK-9310 highlighted the negative effects of having too high parallelism caused by task overhead. Although large task numbers is unavoidable with high volumes of data, more in detail in the documentation will be very beneficial to newcomers when optimising the performance of their applications. Areas to discuss could be: - What are the overheads of a Spark task? -- Does this overhead chance with task size etc? - How to dynamically calculate a suitable parallelism for a Spark job - Examples of designing code to minimise shuffles -- How to minimise the data volumes when shuffles are required - Differences between sort-based and hash-based shuffles -- Benefits and weaknesses of each -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9377) Shuffle tuning should discuss task size optimisation
[ https://issues.apache.org/jira/browse/SPARK-9377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647600#comment-14647600 ] Jem Tucker commented on SPARK-9377: --- Yes I will do Shuffle tuning should discuss task size optimisation Key: SPARK-9377 URL: https://issues.apache.org/jira/browse/SPARK-9377 Project: Spark Issue Type: Documentation Components: Documentation, Shuffle Reporter: Jem Tucker Priority: Minor Recent issue SPARK-9310 highlighted the negative effects of having too high parallelism caused by task overhead. Although large task numbers is unavoidable with high volumes of data, more in detail in the documentation will be very beneficial to newcomers when optimising the performance of their applications. Areas to discuss could be: - What are the overheads of a Spark task? -- Does this overhead chance with task size etc? - How to dynamically calculate a suitable parallelism for a Spark job - Examples of designing code to minimise shuffles -- How to minimise the data volumes when shuffles are required - Differences between sort-based and hash-based shuffles -- Benefits and weaknesses of each -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files
[ https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647606#comment-14647606 ] Samphel Norden commented on SPARK-9347: --- One additional question. Assuming schema does evolve and if we have folder 1 and folder 2 each with a different _common_metadata that represents schema evolution, spark will do a merge of the 2 different _common_metadata files? or would this not work? spark load of existing parquet files extremely slow if large number of files Key: SPARK-9347 URL: https://issues.apache.org/jira/browse/SPARK-9347 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Samphel Norden When spark sql shell is launched and we point it to a folder containing a large number of parquet files, the sqlContext.parquetFile() command takes a very long time to load the tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9472) Consistent hadoop config for streaming
Cody Koeninger created SPARK-9472: - Summary: Consistent hadoop config for streaming Key: SPARK-9472 URL: https://issues.apache.org/jira/browse/SPARK-9472 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Cody Koeninger Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9276) ThriftServer process can't stop if using command yarn application -kill appid
[ https://issues.apache.org/jira/browse/SPARK-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9276. -- Resolution: Not A Problem Reopen if someone can explain in more detail what the problem is ThriftServer process can't stop if using command yarn application -kill appid --- Key: SPARK-9276 URL: https://issues.apache.org/jira/browse/SPARK-9276 Project: Spark Issue Type: Bug Components: SQL Reporter: meiyoula Reproduction Steps: 1. starting thriftserver 2. using beeline to connect thriftserver 3.using commad “yarn application -kill appid” or from yarn webui to kill the application of thriftserver 4.ApplicationMaster has stopped, but the driver process will always be there Reproduction Condition: There must have client connect to thriftserver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9248) Closing curly-braces should always be on their own line
[ https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9248: --- Assignee: (was: Apache Spark) Closing curly-braces should always be on their own line --- Key: SPARK-9248 URL: https://issues.apache.org/jira/browse/SPARK-9248 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Priority: Minor Closing curly-braces should always be on their own line For example, {noformat} inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always be on their own line, unless it's followed by an else. }, error = function(err) { ^ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9248) Closing curly-braces should always be on their own line
[ https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9248: --- Assignee: Apache Spark Closing curly-braces should always be on their own line --- Key: SPARK-9248 URL: https://issues.apache.org/jira/browse/SPARK-9248 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Apache Spark Priority: Minor Closing curly-braces should always be on their own line For example, {noformat} inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always be on their own line, unless it's followed by an else. }, error = function(err) { ^ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9248) Closing curly-braces should always be on their own line
[ https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647512#comment-14647512 ] Apache Spark commented on SPARK-9248: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/7795 Closing curly-braces should always be on their own line --- Key: SPARK-9248 URL: https://issues.apache.org/jira/browse/SPARK-9248 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Priority: Minor Closing curly-braces should always be on their own line For example, {noformat} inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always be on their own line, unless it's followed by an else. }, error = function(err) { ^ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8978) Implement the DirectKafkaController
[ https://issues.apache.org/jira/browse/SPARK-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8978: --- Assignee: Apache Spark Implement the DirectKafkaController --- Key: SPARK-8978 URL: https://issues.apache.org/jira/browse/SPARK-8978 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Iulian Dragos Assignee: Apache Spark Fix For: 1.5.0 Based on this [design doc|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing]. The DirectKafkaInputDStream should use the rate estimate to control how many records/partition to put in the next batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files
[ https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647634#comment-14647634 ] Samphel Norden commented on SPARK-9347: --- Am trying to get spark to only look at _common_metadata files for 2 different schema. But if the new option is turned on (respect.summarymetadata?) would it merge based on different_common_metadata files or would it have to be disabled, and we use regular part merging? spark load of existing parquet files extremely slow if large number of files Key: SPARK-9347 URL: https://issues.apache.org/jira/browse/SPARK-9347 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Samphel Norden When spark sql shell is launched and we point it to a folder containing a large number of parquet files, the sqlContext.parquetFile() command takes a very long time to load the tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8978) Implement the DirectKafkaController
[ https://issues.apache.org/jira/browse/SPARK-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647631#comment-14647631 ] Apache Spark commented on SPARK-8978: - User 'dragos' has created a pull request for this issue: https://github.com/apache/spark/pull/7796 Implement the DirectKafkaController --- Key: SPARK-8978 URL: https://issues.apache.org/jira/browse/SPARK-8978 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Iulian Dragos Fix For: 1.5.0 Based on this [design doc|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing]. The DirectKafkaInputDStream should use the rate estimate to control how many records/partition to put in the next batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8978) Implement the DirectKafkaController
[ https://issues.apache.org/jira/browse/SPARK-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8978: --- Assignee: (was: Apache Spark) Implement the DirectKafkaController --- Key: SPARK-8978 URL: https://issues.apache.org/jira/browse/SPARK-8978 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Iulian Dragos Fix For: 1.5.0 Based on this [design doc|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing]. The DirectKafkaInputDStream should use the rate estimate to control how many records/partition to put in the next batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9472) Consistent hadoop config for streaming
[ https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647665#comment-14647665 ] Apache Spark commented on SPARK-9472: - User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/7772 Consistent hadoop config for streaming -- Key: SPARK-9472 URL: https://issues.apache.org/jira/browse/SPARK-9472 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Cody Koeninger Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4666) executor.memoryOverhead config should take a memory string
[ https://issues.apache.org/jira/browse/SPARK-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4666. -- Resolution: Won't Fix I think this timed out and/or got subsumed in another JIRA executor.memoryOverhead config should take a memory string -- Key: SPARK-4666 URL: https://issues.apache.org/jira/browse/SPARK-4666 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ryan Williams This config value currently takes an integer number of megabytes, but it should also be able to parse strings like 1g, the way several other config params do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9377) Shuffle tuning should discuss task size optimisation
[ https://issues.apache.org/jira/browse/SPARK-9377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jem Tucker updated SPARK-9377: -- Comment: was deleted (was: Yes I will do) Shuffle tuning should discuss task size optimisation Key: SPARK-9377 URL: https://issues.apache.org/jira/browse/SPARK-9377 Project: Spark Issue Type: Documentation Components: Documentation, Shuffle Reporter: Jem Tucker Priority: Minor Recent issue SPARK-9310 highlighted the negative effects of having too high parallelism caused by task overhead. Although large task numbers is unavoidable with high volumes of data, more in detail in the documentation will be very beneficial to newcomers when optimising the performance of their applications. Areas to discuss could be: - What are the overheads of a Spark task? -- Does this overhead chance with task size etc? - How to dynamically calculate a suitable parallelism for a Spark job - Examples of designing code to minimise shuffles -- How to minimise the data volumes when shuffles are required - Differences between sort-based and hash-based shuffles -- Benefits and weaknesses of each -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9474) Consistent hadoop config for core
Cody Koeninger created SPARK-9474: - Summary: Consistent hadoop config for core Key: SPARK-9474 URL: https://issues.apache.org/jira/browse/SPARK-9474 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Cody Koeninger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9377) Shuffle tuning should discuss task size optimisation
[ https://issues.apache.org/jira/browse/SPARK-9377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647502#comment-14647502 ] Sean Owen commented on SPARK-9377: -- [~jem.tucker] do you want to open a PR that implements these? Shuffle tuning should discuss task size optimisation Key: SPARK-9377 URL: https://issues.apache.org/jira/browse/SPARK-9377 Project: Spark Issue Type: Documentation Components: Documentation, Shuffle Reporter: Jem Tucker Priority: Minor Recent issue SPARK-9310 highlighted the negative effects of having too high parallelism caused by task overhead. Although large task numbers is unavoidable with high volumes of data, more in detail in the documentation will be very beneficial to newcomers when optimising the performance of their applications. Areas to discuss could be: - What are the overheads of a Spark task? -- Does this overhead chance with task size etc? - How to dynamically calculate a suitable parallelism for a Spark job - Examples of designing code to minimise shuffles -- How to minimise the data volumes when shuffles are required - Differences between sort-based and hash-based shuffles -- Benefits and weaknesses of each -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9476) Kafka stream loses leader after 2h of operation
Ruben Ramalho created SPARK-9476: Summary: Kafka stream loses leader after 2h of operation Key: SPARK-9476 URL: https://issues.apache.org/jira/browse/SPARK-9476 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.1 Environment: Docker, Centos, Spark standalone, core i7, 8Gb Reporter: Ruben Ramalho This seems to happen every 2h, it happens both with the direct stream and regular stream, I'm doing window operations over a 1h period (if that can help). Here's part of the error message: 2015-07-30 13:27:23 WARN ClientUtils$:89 - Fetching topic metadata with correlation id 10 for topics [Set(updates)] from broker [id:0,host:192.168.3.23,port:3000] failed java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73) at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72) at kafka.producer.SyncProducer.send(SyncProducer.scala:113) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93) at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) 2015-07-30 13:27:23 INFO SyncProducer:68 - Disconnecting from 192.168.3.23:3000 2015-07-30 13:27:23 WARN ConsumerFetcherManager$LeaderFinderThread:89 - [spark-group_81563e123e9f-1438259236988-fc3d82bf-leader-finder-thread], Failed to find leader for Set([updates,0]) kafka.common.KafkaException: fetching topic metadata for topics [Set(oversight-updates)] from broker [ArrayBuffer(id:0,host:192.168.3.23,port:3000)] failed at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:72) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93) at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) Caused by: java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73) at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72) at kafka.producer.SyncProducer.send(SyncProducer.scala:113) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58) After the crash I tried to communicate with kafka with a simple scala consumer and producer and have no problem at all. Spark tough needs a kafka container restart to start normal operaiton. There are no errors on the kafka log, apart from an improper closed connection. I have been trying to solve this problem for days, I suspect this has something to do with spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9478) Add class weights to Random Forest
Patrick Crenshaw created SPARK-9478: --- Summary: Add class weights to Random Forest Key: SPARK-9478 URL: https://issues.apache.org/jira/browse/SPARK-9478 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.1 Reporter: Patrick Crenshaw Currently, this implementation of random forest does not support class weights. Class weights are important when there is imbalanced training data or the evaluation metric of a classifier is imbalanced (e.g. true positive rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).
[ https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647767#comment-14647767 ] Stacy Pedersen commented on SPARK-9477: --- Hi Sean, can we not just list it as a Cluster Manager type? For example in http://spark.apache.org/docs/latest/cluster-overview.html - and point to the IBM Knowledge Center? You guys don't have to document it, just list our product as a type since you list Mesos and YARN. Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos). Key: SPARK-9477 URL: https://issues.apache.org/jira/browse/SPARK-9477 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Stacy Pedersen Priority: Minor Fix For: 1.4.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).
[ https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647767#comment-14647767 ] Stacy Pedersen edited comment on SPARK-9477 at 7/30/15 3:16 PM: Hi Sean, can we not just list it as a Cluster Manager type? For example in http://spark.apache.org/docs/latest/cluster-overview.html - and point to the IBM Knowledge Center? It doesn't have to be documented again, maybe just have our product listed as a type since you list Mesos and YARN. was (Author: stacyp): Hi Sean, can we not just list it as a Cluster Manager type? For example in http://spark.apache.org/docs/latest/cluster-overview.html - and point to the IBM Knowledge Center? You guys don't have to document it, just list our product as a type since you list Mesos and YARN. Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos). Key: SPARK-9477 URL: https://issues.apache.org/jira/browse/SPARK-9477 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Stacy Pedersen Priority: Minor Fix For: 1.4.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9479) ReceiverTrackerSuite fails for maven build
[ https://issues.apache.org/jira/browse/SPARK-9479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647836#comment-14647836 ] Apache Spark commented on SPARK-9479: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/7797 ReceiverTrackerSuite fails for maven build -- Key: SPARK-9479 URL: https://issues.apache.org/jira/browse/SPARK-9479 Project: Spark Issue Type: Bug Components: Streaming, Tests Reporter: Shixiong Zhu The test failure is here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3109/ I saw the following exception in the log: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.NullPointerException org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1297) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:834) {code} This exception is because SparkEnv.get returns null. I found the maven build is different from the sbt build. The maven build will create all Suite classes at the beginning. `ReceiverTrackerSuite` creates StreamingContext (SparkContext) in the constructor. That means SparkContext is created very early. And the global SparkEnv will be set to null in the previous test. Therefore we saw the above exception when running `Receiver tracker - propagates rate limit` in `ReceiverTrackerSuite`. This test was added recently. Note: the previous tests in `ReceiverTrackerSuite` didn't use SparkContext actually, that's why we didn't see such failure before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).
[ https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647703#comment-14647703 ] Stacy Pedersen commented on SPARK-9477: --- Here is link to the IBM Knowledge Center with info on Platform Application Service Controller - http://www-01.ibm.com/support/knowledgecenter/SS3MQL/product_welcome_asc.html Here is how we integrate currently with Spark - http://www-01.ibm.com/support/knowledgecenter/SS3MQL_1.1.0/manage_resources/spark_overview.dita Here is the link to a free trail version of Platform Application Service Controller: https://www-01.ibm.com/marketing/iwm/iwm/web/preLogin.do?source=eipasc Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos). Key: SPARK-9477 URL: https://issues.apache.org/jira/browse/SPARK-9477 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Stacy Pedersen Priority: Minor Fix For: 1.4.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).
[ https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647748#comment-14647748 ] Sean Owen commented on SPARK-9477: -- The usual question is: does this need to live in Spark docs? if it doesn't live in Spark? this sounds like something that's perfectly well documented already. Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos). Key: SPARK-9477 URL: https://issues.apache.org/jira/browse/SPARK-9477 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Stacy Pedersen Priority: Minor Fix For: 1.4.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add class weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647772#comment-14647772 ] Patrick Crenshaw commented on SPARK-9478: - Similar to this ticket for Logistic Regression https://issues.apache.org/jira/browse/SPARK-7685 and this one for SVMWithSGD https://issues.apache.org/jira/browse/SPARK-3246 Add class weights to Random Forest -- Key: SPARK-9478 URL: https://issues.apache.org/jira/browse/SPARK-9478 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.1 Reporter: Patrick Crenshaw Currently, this implementation of random forest does not support class weights. Class weights are important when there is imbalanced training data or the evaluation metric of a classifier is imbalanced (e.g. true positive rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8998) Collect enough frequent prefixes before projection in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8998. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7783 [https://github.com/apache/spark/pull/7783] Collect enough frequent prefixes before projection in PrefixSpan Key: SPARK-8998 URL: https://issues.apache.org/jira/browse/SPARK-8998 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Zhang JiaJin Fix For: 1.5.0 Original Estimate: 48h Remaining Estimate: 48h The implementation in SPARK-6487 might have scalability issues when the number of frequent items is very small. In this case, we can generate candidate sets of higher orders using Apriori-like algorithms and count them, until we collect enough prefixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9479) ReceiverTrackerSuite fails for maven build
[ https://issues.apache.org/jira/browse/SPARK-9479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9479: --- Assignee: Apache Spark ReceiverTrackerSuite fails for maven build -- Key: SPARK-9479 URL: https://issues.apache.org/jira/browse/SPARK-9479 Project: Spark Issue Type: Bug Components: Streaming, Tests Reporter: Shixiong Zhu Assignee: Apache Spark The test failure is here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3109/ I saw the following exception in the log: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.NullPointerException org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1297) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:834) {code} This exception is because SparkEnv.get returns null. I found the maven build is different from the sbt build. The maven build will create all Suite classes at the beginning. `ReceiverTrackerSuite` creates StreamingContext (SparkContext) in the constructor. That means SparkContext is created very early. And the global SparkEnv will be set to null in the previous test. Therefore we saw the above exception when running `Receiver tracker - propagates rate limit` in `ReceiverTrackerSuite`. This test was added recently. Note: the previous tests in `ReceiverTrackerSuite` didn't use SparkContext actually, that's why we didn't see such failure before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647741#comment-14647741 ] Yanbo Liang commented on SPARK-6885: [~josephkb] I create a new version of InformationGainStats called ImpurityStats. It stores information gain, impurity, prediction related data all in one data structure which make LearningNode simplicity. Meanwhile it simplifies and optimizes binsToBestSplit function. I will fix some trivial issues after your reviews. It looks like code refactor in a way. Decision trees: predict class probabilities --- Key: SPARK-6885 URL: https://issues.apache.org/jira/browse/SPARK-6885 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Yanbo Liang Under spark.ml, have DecisionTreeClassifier (currently being added) extend ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).
[ https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647767#comment-14647767 ] Stacy Pedersen edited comment on SPARK-9477 at 7/30/15 3:24 PM: Hi Sean, can we not just list it as a Cluster Manager type? For example in http://spark.apache.org/docs/latest/cluster-overview.html - and point to the IBM Knowledge Center? It doesn't have to be documented again, maybe just have our product listed as a type since it lists Mesos and YARN. Just a thought :) was (Author: stacyp): Hi Sean, can we not just list it as a Cluster Manager type? For example in http://spark.apache.org/docs/latest/cluster-overview.html - and point to the IBM Knowledge Center? It doesn't have to be documented again, maybe just have our product listed as a type since you list Mesos and YARN. Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos). Key: SPARK-9477 URL: https://issues.apache.org/jira/browse/SPARK-9477 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Stacy Pedersen Priority: Minor Fix For: 1.4.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9479) ReceiverTrackerSuite fails for maven build
Shixiong Zhu created SPARK-9479: --- Summary: ReceiverTrackerSuite fails for maven build Key: SPARK-9479 URL: https://issues.apache.org/jira/browse/SPARK-9479 Project: Spark Issue Type: Bug Components: Streaming, Tests Reporter: Shixiong Zhu The test failure is here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3109/ I saw the following exception in the log: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.NullPointerException org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1297) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:834) {code} This exception is because SparkEnv.get returns null. I found the maven build is different from the sbt build. The maven build will create all Suite classes at the beginning. `ReceiverTrackerSuite` creates StreamingContext (SparkContext) in the constructor. That means SparkContext is created very early. And the global SparkEnv will be set to null in the previous test. Therefore we saw the above exception when running `Receiver tracker - propagates rate limit` in `ReceiverTrackerSuite`. This test was added recently. Note: the previous tests in `ReceiverTrackerSuite` didn't use SparkContext actually, that's why we didn't see such failure before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files
[ https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647673#comment-14647673 ] Liang-Chi Hsieh commented on SPARK-9347: Actually the newly introduced configuration is working only if the parquet schema merging configuration is enabled. So you need to turn both on. spark load of existing parquet files extremely slow if large number of files Key: SPARK-9347 URL: https://issues.apache.org/jira/browse/SPARK-9347 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Samphel Norden When spark sql shell is launched and we point it to a folder containing a large number of parquet files, the sqlContext.parquetFile() command takes a very long time to load the tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5561) Generalize PeriodicGraphCheckpointer for RDDs
[ https://issues.apache.org/jira/browse/SPARK-5561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5561. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7728 [https://github.com/apache/spark/pull/7728] Generalize PeriodicGraphCheckpointer for RDDs - Key: SPARK-5561 URL: https://issues.apache.org/jira/browse/SPARK-5561 Project: Spark Issue Type: Improvement Components: GraphX, MLlib, Spark Core Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Fix For: 1.5.0 PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), but it could be generalized to work with both Graphs and RDDs. It should be generalized and moved out of MLlib. (For those who are not familiar with it, it tries to automatically handle persisting/unpersisting and checkpointing/removing checkpoint files in a lineage of Graphs.) A generalized version might be immediately useful for: * RandomForest * Streaming * GLMs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).
Stacy Pedersen created SPARK-9477: - Summary: Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos). Key: SPARK-9477 URL: https://issues.apache.org/jira/browse/SPARK-9477 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Stacy Pedersen Priority: Minor Fix For: 1.4.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7368) add QR decomposition for RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7368. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 5909 [https://github.com/apache/spark/pull/5909] add QR decomposition for RowMatrix -- Key: SPARK-7368 URL: https://issues.apache.org/jira/browse/SPARK-7368 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Assignee: yuhao yang Fix For: 1.5.0 Original Estimate: 48h Remaining Estimate: 48h Add QR decomposition for RowMatrix. There's a great distributed algorithm for QR decomposition, which I'm currently referring to. Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International Conference on Big Data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9479) ReceiverTrackerSuite fails for maven build
[ https://issues.apache.org/jira/browse/SPARK-9479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9479: --- Assignee: (was: Apache Spark) ReceiverTrackerSuite fails for maven build -- Key: SPARK-9479 URL: https://issues.apache.org/jira/browse/SPARK-9479 Project: Spark Issue Type: Bug Components: Streaming, Tests Reporter: Shixiong Zhu The test failure is here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3109/ I saw the following exception in the log: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.NullPointerException org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1297) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:834) {code} This exception is because SparkEnv.get returns null. I found the maven build is different from the sbt build. The maven build will create all Suite classes at the beginning. `ReceiverTrackerSuite` creates StreamingContext (SparkContext) in the constructor. That means SparkContext is created very early. And the global SparkEnv will be set to null in the previous test. Therefore we saw the above exception when running `Receiver tracker - propagates rate limit` in `ReceiverTrackerSuite`. This test was added recently. Note: the previous tests in `ReceiverTrackerSuite` didn't use SparkContext actually, that's why we didn't see such failure before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9469) TungstenSort should not do safe - unsafe conversion itself
[ https://issues.apache.org/jira/browse/SPARK-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648309#comment-14648309 ] Apache Spark commented on SPARK-9469: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7803 TungstenSort should not do safe - unsafe conversion itself --- Key: SPARK-9469 URL: https://issues.apache.org/jira/browse/SPARK-9469 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical TungstenSort itself assumes input rows are safe rows, and uses a projection to turn the safe rows into UnsafeRows. We should take that part of the logic out of TungstenSort, and let the planner take care of the conversion. In that case, if the input is UnsafeRow already, no conversion is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9454) LDASuite should use vector comparisons
[ https://issues.apache.org/jira/browse/SPARK-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9454: - Assignee: Feynman Liang LDASuite should use vector comparisons -- Key: SPARK-9454 URL: https://issues.apache.org/jira/browse/SPARK-9454 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor Fix For: 1.5.0 {{LDASuite}}'s OnlineLDAOptimizer one iteration currently compares correctness using hacky string comparisons. We should compare the vectors instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9487: - Target Version/s: 1.5.0 Use the same num. worker threads in Scala/Python unit tests --- Key: SPARK-9487 URL: https://issues.apache.org/jira/browse/SPARK-9487 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core, SQL, Tests Affects Versions: 1.5.0 Reporter: Xiangrui Meng In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the operation depends on partition IDs, e.g., random number generator, this will lead to different result in Python and Scala/Java. It would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9484) Word2Vec import/export for original binary format
Joseph K. Bradley created SPARK-9484: Summary: Word2Vec import/export for original binary format Key: SPARK-9484 URL: https://issues.apache.org/jira/browse/SPARK-9484 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor It would be nice to add model import/export for Word2Vec which handles the original binary format used by [https://code.google.com/p/word2vec/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648180#comment-14648180 ] Joseph K. Bradley commented on SPARK-5692: -- This was not, but thanks for the reminder; it'd be nice to add. I'll make and link a JIRA for it Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Manoj Kumar Fix For: 1.4.0 Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7583) User guide update for RegexTokenizer
[ https://issues.apache.org/jira/browse/SPARK-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648182#comment-14648182 ] Joseph K. Bradley commented on SPARK-7583: -- Yes, please! This can go in after the feature freeze. User guide update for RegexTokenizer Key: SPARK-7583 URL: https://issues.apache.org/jira/browse/SPARK-7583 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9488) pyspark.sql.types.Row very slow when used named arguments
Alexis Benoist created SPARK-9488: - Summary: pyspark.sql.types.Row very slow when used named arguments Key: SPARK-9488 URL: https://issues.apache.org/jira/browse/SPARK-9488 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Reporter: Alexis Benoist We can see that the implementation of the Row is accessing items in O(n). https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1217 We could use an OrderedDict instead of a tuple to make the access time in O(1). Can the keys be of an unhashable type? I'm ok to do the edit. Cheers, Alexis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9488) pyspark.sql.types.Row very slow when using named arguments
[ https://issues.apache.org/jira/browse/SPARK-9488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexis Benoist updated SPARK-9488: -- Summary: pyspark.sql.types.Row very slow when using named arguments (was: pyspark.sql.types.Row very slow when used named arguments) pyspark.sql.types.Row very slow when using named arguments -- Key: SPARK-9488 URL: https://issues.apache.org/jira/browse/SPARK-9488 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Reporter: Alexis Benoist Labels: performance We can see that the implementation of the Row is accessing items in O(n). https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1217 We could use an OrderedDict instead of a tuple to make the access time in O(1). Can the keys be of an unhashable type? I'm ok to do the edit. Cheers, Alexis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8197) date/time function: trunc
[ https://issues.apache.org/jira/browse/SPARK-8197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648351#comment-14648351 ] Apache Spark commented on SPARK-8197: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7805 date/time function: trunc - Key: SPARK-8197 URL: https://issues.apache.org/jira/browse/SPARK-8197 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin trunc(string date[, string format]): string trunc(date date[, string format]): date Returns date truncated to the unit specified by the format (as of Hive 1.2.0). Supported formats: MONTH/MON/MM, YEAR//YY. If format is omitted the date will be truncated to the nearest day. Example: trunc('2015-03-17', 'MM') = 2015-03-01. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648192#comment-14648192 ] Joseph K. Bradley commented on SPARK-6227: -- That's great you're interested. Please read this for lots of helpful info: [https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark] I would download the original source code from the Apache Spark website and install it natively, without using the VM. There are instructions for that in the Spark docs and READMEs. To get started, I recommend finding some small JIRAs which have been resolved already and looking at the PRs which solved them. Those will give you an idea of the code structure. Good luck! PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6684) Add checkpointing to GradientBoostedTrees
[ https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6684: - Shepherd: Xiangrui Meng Add checkpointing to GradientBoostedTrees - Key: SPARK-6684 URL: https://issues.apache.org/jira/browse/SPARK-6684 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley We should add checkpointing to GradientBoostedTrees since it maintains RDDs with long lineages. keywords: gradient boosting, gbt, gradient boosted trees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8176) date/time function: to_date
[ https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648350#comment-14648350 ] Apache Spark commented on SPARK-8176: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7805 date/time function: to_date --- Key: SPARK-8176 URL: https://issues.apache.org/jira/browse/SPARK-8176 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang parse a timestamp string and return the date portion {code} to_date(string timestamp): date {code} Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 1970-01-01 (in some date format) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9408) Refactor mllib/linalg.py to mllib/linalg
[ https://issues.apache.org/jira/browse/SPARK-9408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9408: - Shepherd: Davies Liu (was: Xiangrui Meng) Refactor mllib/linalg.py to mllib/linalg Key: SPARK-9408 URL: https://issues.apache.org/jira/browse/SPARK-9408 Project: Spark Issue Type: Task Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Manoj Kumar We need to refactor mllib/linalg.py to mllib/linalg so that the project structure is similar to that of Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-9485: Description: Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. spark-submit --master yarn-client Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java :1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal a:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at
[jira] [Commented] (SPARK-967) start-slaves.sh uses local path from master on remote slave nodes
[ https://issues.apache.org/jira/browse/SPARK-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648234#comment-14648234 ] David Chin commented on SPARK-967: -- I won't create a pull request unless asked to, but I have a solution for this. I am running Spark in standalone mode within a Univa Grid Engine cluster. As such, configs and logs, etc should be specific to each UGE job, identified by an integer job ID. Currently, any environment variables on the master are not passed along by the sbin/start-slaves.sh invocation of ssh. I put in a fix on my local version, which works. However, this is still less than ideal in that UGE's job accounting cannot keep track of resource usage by jobs not under its process tree. Not sure, yet, what the correct solution is. I thought I saw a feature request to allow other remote shell programs besides ssh, but I can't find it now. Please see my version of sbin/start-slaves.sh here: https://github.com/prehensilecode/spark/blob/master/sbin/start-slaves.sh start-slaves.sh uses local path from master on remote slave nodes - Key: SPARK-967 URL: https://issues.apache.org/jira/browse/SPARK-967 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 0.8.0, 0.8.1, 0.9.0 Reporter: Evgeniy Tsvigun Priority: Trivial Labels: script, starter If a slave node has home path other than master, start-slave.sh fails to start a worker instance, for other nodes behaves as expected, in my case: $ ./bin/start-slaves.sh node05.dev.vega.ru: bash: line 0: cd: /usr/home/etsvigun/spark/bin/..: No such file or directory node04.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as process 4796. Stop it first. node03.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as process 61348. Stop it first. I don't mention /usr/home anywhere, the only environment variable I set is $SPARK_HOME, relative to $HOME on every node, which makes me think some script takes `pwd` on master and tries to use it on slaves. Spark version: fb6875dd5c9334802580155464cef9ac4d4cc1f0 OS: FreeBSD 8.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648233#comment-14648233 ] Philip Adetiloye commented on SPARK-9485: - [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop SPARK_YARN_QUEUE=dev and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed s:bin/java::) export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib export HADOOP_OPTS=$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH Hope this helps. Thanks, - Phil Failed to connect to yarn / spark-submit --master yarn-client - Key: SPARK-9485 URL: https://issues.apache.org/jira/browse/SPARK-9485 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit, YARN Affects Versions: 1.4.1 Environment: DEV Reporter: Philip Adetiloye Priority: Minor Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. spark-submit --master yarn-client Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java :1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal a:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9485: - Shepherd: (was: MEN CHAMROEUN) Target Version/s: (was: 1.4.1) Environment: (was: DEV) Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -- this JIRA had some fields set that should not be. I don't think that helps since it's just a list of your local configs, specific to your environment. Obivously, in general yarn-client mode does not yield a failure on startup so this isn't quite helpful in understanding the failure. It seems specific to your env. Failed to connect to yarn / spark-submit --master yarn-client - Key: SPARK-9485 URL: https://issues.apache.org/jira/browse/SPARK-9485 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit, YARN Affects Versions: 1.4.1 Reporter: Philip Adetiloye Priority: Minor Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. spark-submit --master yarn-client Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java :1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal a:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24)
[jira] [Comment Edited] (SPARK-967) start-slaves.sh uses local path from master on remote slave nodes
[ https://issues.apache.org/jira/browse/SPARK-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648234#comment-14648234 ] David Chin edited comment on SPARK-967 at 7/30/15 8:34 PM: --- I won't create a pull request unless asked to, but I have a solution for this. I am running Spark in standalone mode within a Univa Grid Engine cluster. As such, configs and logs, etc should be specific to each UGE job, identified by an integer job ID. Currently, any environment variables on the master are not passed along by the sbin/start-slaves.sh invocation of ssh. I put in a fix on my local version, which works. However, this is still less than ideal in that UGE's job accounting cannot keep track of resource usage by jobs not under its process tree. Not sure, yet, what the correct solution is. I thought I saw a feature request to allow other remote shell programs besides ssh, but I can't find it now. Please see my version of sbin/start-slaves.sh here, forked from current master: https://github.com/prehensilecode/spark/blob/master/sbin/start-slaves.sh was (Author: prehensilecode): I won't create a pull request unless asked to, but I have a solution for this. I am running Spark in standalone mode within a Univa Grid Engine cluster. As such, configs and logs, etc should be specific to each UGE job, identified by an integer job ID. Currently, any environment variables on the master are not passed along by the sbin/start-slaves.sh invocation of ssh. I put in a fix on my local version, which works. However, this is still less than ideal in that UGE's job accounting cannot keep track of resource usage by jobs not under its process tree. Not sure, yet, what the correct solution is. I thought I saw a feature request to allow other remote shell programs besides ssh, but I can't find it now. Please see my version of sbin/start-slaves.sh here: https://github.com/prehensilecode/spark/blob/master/sbin/start-slaves.sh start-slaves.sh uses local path from master on remote slave nodes - Key: SPARK-967 URL: https://issues.apache.org/jira/browse/SPARK-967 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 0.8.0, 0.8.1, 0.9.0 Reporter: Evgeniy Tsvigun Priority: Trivial Labels: script, starter If a slave node has home path other than master, start-slave.sh fails to start a worker instance, for other nodes behaves as expected, in my case: $ ./bin/start-slaves.sh node05.dev.vega.ru: bash: line 0: cd: /usr/home/etsvigun/spark/bin/..: No such file or directory node04.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as process 4796. Stop it first. node03.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as process 61348. Stop it first. I don't mention /usr/home anywhere, the only environment variable I set is $SPARK_HOME, relative to $HOME on every node, which makes me think some script takes `pwd` on master and tries to use it on slaves. Spark version: fb6875dd5c9334802580155464cef9ac4d4cc1f0 OS: FreeBSD 8.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark
[ https://issues.apache.org/jira/browse/SPARK-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9486: --- Assignee: Apache Spark Add aliasing to data sources to allow external packages to register themselves with Spark - Key: SPARK-9486 URL: https://issues.apache.org/jira/browse/SPARK-9486 Project: Spark Issue Type: Improvement Components: SQL Reporter: Joseph Batchik Assignee: Apache Spark Priority: Minor Currently Spark allows users to use external data sources like spark-avro, spark-csv, etc by having them specifying their full class name: {code:java} sqlContext.read.format(com.databricks.spark.avro).load(path) {code} Typing in a full class is not the best idea so it would be nice to allow the external packages to be able to register themselves with Spark to allow users to do something like: {code:java} sqlContext.read.format(avro).load(path) {code} This would make it so that the external data source packages follow the same convention as the built in data sources do, parquet, json, jdbc, etc. This could be accomplished by using a ServiceLoader. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark
[ https://issues.apache.org/jira/browse/SPARK-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9486: --- Assignee: (was: Apache Spark) Add aliasing to data sources to allow external packages to register themselves with Spark - Key: SPARK-9486 URL: https://issues.apache.org/jira/browse/SPARK-9486 Project: Spark Issue Type: Improvement Components: SQL Reporter: Joseph Batchik Priority: Minor Currently Spark allows users to use external data sources like spark-avro, spark-csv, etc by having them specifying their full class name: {code:java} sqlContext.read.format(com.databricks.spark.avro).load(path) {code} Typing in a full class is not the best idea so it would be nice to allow the external packages to be able to register themselves with Spark to allow users to do something like: {code:java} sqlContext.read.format(avro).load(path) {code} This would make it so that the external data source packages follow the same convention as the built in data sources do, parquet, json, jdbc, etc. This could be accomplished by using a ServiceLoader. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9485) Failed to connect to yarn
Philip Adetiloye created SPARK-9485: --- Summary: Failed to connect to yarn Key: SPARK-9485 URL: https://issues.apache.org/jira/browse/SPARK-9485 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit, YARN Affects Versions: 1.4.1 Environment: DEV Reporter: Philip Adetiloye Priority: Minor Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java :1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal a:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at
[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648233#comment-14648233 ] Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:16 PM: -- [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop SPARK_YARN_QUEUE=dev and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: ` export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed s:bin/java::) export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib export HADOOP_OPTS=$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH ` Hope this helps. Thanks, - Phil was (Author: pkadetiloye): [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop SPARK_YARN_QUEUE=dev and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed s:bin/java::) export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib export HADOOP_OPTS=$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH Hope this helps. Thanks, - Phil Failed to connect to yarn / spark-submit --master yarn-client - Key: SPARK-9485 URL: https://issues.apache.org/jira/browse/SPARK-9485 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit, YARN Affects Versions: 1.4.1 Environment: DEV Reporter: Philip Adetiloye Priority: Minor Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. spark-submit --master yarn-client Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at
[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648233#comment-14648233 ] Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:16 PM: -- [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop SPARK_YARN_QUEUE=dev and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed s:bin/java::) export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib export HADOOP_OPTS=$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH Hope this helps. Thanks, - Phil was (Author: pkadetiloye): [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop SPARK_YARN_QUEUE=dev and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: ` export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed s:bin/java::) export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib export HADOOP_OPTS=$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH ` Hope this helps. Thanks, - Phil Failed to connect to yarn / spark-submit --master yarn-client - Key: SPARK-9485 URL: https://issues.apache.org/jira/browse/SPARK-9485 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit, YARN Affects Versions: 1.4.1 Environment: DEV Reporter: Philip Adetiloye Priority: Minor Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. spark-submit --master yarn-client Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at
[jira] [Commented] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark
[ https://issues.apache.org/jira/browse/SPARK-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648287#comment-14648287 ] Apache Spark commented on SPARK-9486: - User 'JDrit' has created a pull request for this issue: https://github.com/apache/spark/pull/7802 Add aliasing to data sources to allow external packages to register themselves with Spark - Key: SPARK-9486 URL: https://issues.apache.org/jira/browse/SPARK-9486 Project: Spark Issue Type: Improvement Components: SQL Reporter: Joseph Batchik Priority: Minor Currently Spark allows users to use external data sources like spark-avro, spark-csv, etc by having them specifying their full class name: {code:java} sqlContext.read.format(com.databricks.spark.avro).load(path) {code} Typing in a full class is not the best idea so it would be nice to allow the external packages to be able to register themselves with Spark to allow users to do something like: {code:java} sqlContext.read.format(avro).load(path) {code} This would make it so that the external data source packages follow the same convention as the built in data sources do, parquet, json, jdbc, etc. This could be accomplished by using a ServiceLoader. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6684) Add checkpointing to GradientBoostedTrees
[ https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648310#comment-14648310 ] Apache Spark commented on SPARK-6684: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/7804 Add checkpointing to GradientBoostedTrees - Key: SPARK-6684 URL: https://issues.apache.org/jira/browse/SPARK-6684 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley We should add checkpointing to GradientBoostedTrees since it maintains RDDs with long lineages. keywords: gradient boosting, gbt, gradient boosted trees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6684) Add checkpointing to GradientBoostedTrees
[ https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6684: --- Assignee: Apache Spark (was: Joseph K. Bradley) Add checkpointing to GradientBoostedTrees - Key: SPARK-6684 URL: https://issues.apache.org/jira/browse/SPARK-6684 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark We should add checkpointing to GradientBoostedTrees since it maintains RDDs with long lineages. keywords: gradient boosting, gbt, gradient boosted trees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6684) Add checkpointing to GradientBoostedTrees
[ https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6684: --- Assignee: Joseph K. Bradley (was: Apache Spark) Add checkpointing to GradientBoostedTrees - Key: SPARK-6684 URL: https://issues.apache.org/jira/browse/SPARK-6684 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley We should add checkpointing to GradientBoostedTrees since it maintains RDDs with long lineages. keywords: gradient boosting, gbt, gradient boosted trees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5567) Add prediction methods to LDA
[ https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5567: - Assignee: Feynman Liang Add prediction methods to LDA - Key: SPARK-5567 URL: https://issues.apache.org/jira/browse/SPARK-5567 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Feynman Liang Original Estimate: 168h Remaining Estimate: 168h LDA currently supports prediction on the training set. E.g., you can call logLikelihood and topicDistributions to get that info for the training data. However, it should support the same functionality for new (test) documents. This will require inference but should be able to use the same code, with a few modification to keep the inferred topics fixed. Note: The API for these methods is already in the code but is commented out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648233#comment-14648233 ] Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:17 PM: -- [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop SPARK_YARN_QUEUE=dev and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed s:bin/java::) export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib export HADOOP_OPTS=$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH Hope this helps. Thanks, Phil was (Author: pkadetiloye): [~srowen] Thanks for the quick reply. It actually consistent (everytime) and here is the details of my configuration. conf/spark-env.sh basically has this settings: #!/usr/bin/env bash HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop SPARK_YARN_QUEUE=dev and my conf/slaves 10.0.0.204 10.0.0.205 ~/.profile contains my settings here: export JAVA_HOME=$(readlink -f /usr/share/jdk1.8.0_45/bin/java | sed s:bin/java::) export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_CONF_DIR=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib export HADOOP_OPTS=$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native export PATH=$PATH:/usr/local/spark/sbin export PATH=$PATH:/usr/local/spark/bin export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/ export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH Hope this helps. Thanks, - Phil Failed to connect to yarn / spark-submit --master yarn-client - Key: SPARK-9485 URL: https://issues.apache.org/jira/browse/SPARK-9485 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit, YARN Affects Versions: 1.4.1 Environment: DEV Reporter: Philip Adetiloye Priority: Minor Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. spark-submit --master yarn-client Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at
[jira] [Resolved] (SPARK-5567) Add prediction methods to LDA
[ https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-5567. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7760 [https://github.com/apache/spark/pull/7760] Add prediction methods to LDA - Key: SPARK-5567 URL: https://issues.apache.org/jira/browse/SPARK-5567 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Feynman Liang Fix For: 1.5.0 Original Estimate: 168h Remaining Estimate: 168h LDA currently supports prediction on the training set. E.g., you can call logLikelihood and topicDistributions to get that info for the training data. However, it should support the same functionality for new (test) documents. This will require inference but should be able to use the same code, with a few modification to keep the inferred topics fixed. Note: The API for these methods is already in the code but is commented out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9133) Add and Subtract should support date/timestamp and interval type
[ https://issues.apache.org/jira/browse/SPARK-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9133. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7754 [https://github.com/apache/spark/pull/7754] Add and Subtract should support date/timestamp and interval type Key: SPARK-9133 URL: https://issues.apache.org/jira/browse/SPARK-9133 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Fix For: 1.5.0 Should support date + interval interval + date timestamp + interval interval + timestamp The best way to support this is probably to resolve this to a date add/substract expression, rather than making add/subtract support these types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8194) date/time function: add_months
[ https://issues.apache.org/jira/browse/SPARK-8194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8194. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7754 [https://github.com/apache/spark/pull/7754] date/time function: add_months -- Key: SPARK-8194 URL: https://issues.apache.org/jira/browse/SPARK-8194 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Fix For: 1.5.0 add_months(string start_date, int num_months): string add_months(date start_date, int num_months): date Returns the date that is num_months after start_date. The time part of start_date is ignored. If start_date is the last day of the month or if the resulting month has fewer days than the day component of start_date, then the result is the last day of the resulting month. Otherwise, the result has the same day component as start_date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8186) date/time function: date_add
[ https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8186. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7754 [https://github.com/apache/spark/pull/7754] date/time function: date_add Key: SPARK-8186 URL: https://issues.apache.org/jira/browse/SPARK-8186 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang Fix For: 1.5.0 date_add(timestamp startdate, int days): timestamp date_add(timestamp startdate, interval i): timestamp date_add(date date, int days): date date_add(date date, interval i): date -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8187) date/time function: date_sub
[ https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8187. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7754 [https://github.com/apache/spark/pull/7754] date/time function: date_sub Key: SPARK-8187 URL: https://issues.apache.org/jira/browse/SPARK-8187 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang Fix For: 1.5.0 date_sub(timestamp startdate, int days): timestamp date_sub(timestamp startdate, interval i): timestamp date_sub(date date, int days): date date_sub(date date, interval i): date -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9290) DateExpressionsSuite is slow to run
[ https://issues.apache.org/jira/browse/SPARK-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9290. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7754 [https://github.com/apache/spark/pull/7754] DateExpressionsSuite is slow to run --- Key: SPARK-9290 URL: https://issues.apache.org/jira/browse/SPARK-9290 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Fix For: 1.5.0 We are running way too many test cases in here. {code} [info] - DayOfYear (16 seconds, 998 milliseconds) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8198) date/time function: months_between
[ https://issues.apache.org/jira/browse/SPARK-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8198. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7754 [https://github.com/apache/spark/pull/7754] date/time function: months_between -- Key: SPARK-8198 URL: https://issues.apache.org/jira/browse/SPARK-8198 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Fix For: 1.5.0 months_between(date1, date2): double Returns number of months between dates date1 and date2 (as of Hive 1.2.0). If date1 is later than date2, then the result is positive. If date1 is earlier than date2, then the result is negative. If date1 and date2 are either the same days of the month or both last days of months, then the result is always an integer. Otherwise the UDF calculates the fractional portion of the result based on a 31-day month and considers the difference in time components date1 and date2. date1 and date2 type can be date, timestamp or string in the format '-MM-dd' or '-MM-dd HH:mm:ss'. The result is rounded to 8 decimal places. Example: months_between('1997-02-28 10:30:00', '1996-10-30') = 3.94959677 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add class weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648184#comment-14648184 ] Joseph K. Bradley commented on SPARK-9478: -- This sounds valuable. Handling it by reweighting examples (as is being done for logreg) seems like the simplest solution for now. I'll keep an eye on the ticket! Add class weights to Random Forest -- Key: SPARK-9478 URL: https://issues.apache.org/jira/browse/SPARK-9478 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.1 Reporter: Patrick Crenshaw Currently, this implementation of random forest does not support class weights. Class weights are important when there is imbalanced training data or the evaluation metric of a classifier is imbalanced (e.g. true positive rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648200#comment-14648200 ] Sean Owen commented on SPARK-9485: -- I don't think this is sufficient to be a JIRA bug report; there's no detail for reproducing it. It also just appears to be some kind of (other) error at startup causing initialization to fail. Can you start on user@ please? and if there isn't guidance there, provide a consistent reproduction? Failed to connect to yarn / spark-submit --master yarn-client - Key: SPARK-9485 URL: https://issues.apache.org/jira/browse/SPARK-9485 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit, YARN Affects Versions: 1.4.1 Environment: DEV Reporter: Philip Adetiloye Priority: Minor Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. spark-submit --master yarn-client Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java :1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal a:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at
[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client
[ https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-9485: Shepherd: MEN CHAMROEUN Failed to connect to yarn / spark-submit --master yarn-client - Key: SPARK-9485 URL: https://issues.apache.org/jira/browse/SPARK-9485 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit, YARN Affects Versions: 1.4.1 Environment: DEV Reporter: Philip Adetiloye Priority: Minor Spark-submit throws an exception when connecting to yarn but it works when used in standalone mode. I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but got the same exception below. spark-submit --master yarn-client Here is a stack trace of the exception: 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone schedule r to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken d.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2 66) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala: 139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java :1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal a:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at
[jira] [Updated] (SPARK-9481) LocalLDAModel logLikelihood
[ https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9481: - Shepherd: Joseph K. Bradley Assignee: Feynman Liang LocalLDAModel logLikelihood --- Key: SPARK-9481 URL: https://issues.apache.org/jira/browse/SPARK-9481 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Trivial We already have a variational {{bound}} method so we should provide a public {{logLikelihood}} that uses the model's parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9454) LDASuite should use vector comparisons
[ https://issues.apache.org/jira/browse/SPARK-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9454: - Shepherd: Joseph K. Bradley LDASuite should use vector comparisons -- Key: SPARK-9454 URL: https://issues.apache.org/jira/browse/SPARK-9454 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor Fix For: 1.5.0 {{LDASuite}}'s OnlineLDAOptimizer one iteration currently compares correctness using hacky string comparisons. We should compare the vectors instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9458) Avoid object allocation in prefix generation
[ https://issues.apache.org/jira/browse/SPARK-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648308#comment-14648308 ] Apache Spark commented on SPARK-9458: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7803 Avoid object allocation in prefix generation Key: SPARK-9458 URL: https://issues.apache.org/jira/browse/SPARK-9458 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 In our existing sort prefix generation code, we use expression's eval method to generate the prefix, which results in object allocation for every prefix. We can use the specialized getters available on InternalRow directly to avoid the object allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9454) LDASuite should use vector comparisons
[ https://issues.apache.org/jira/browse/SPARK-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9454. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7775 [https://github.com/apache/spark/pull/7775] LDASuite should use vector comparisons -- Key: SPARK-9454 URL: https://issues.apache.org/jira/browse/SPARK-9454 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor Fix For: 1.5.0 {{LDASuite}}'s OnlineLDAOptimizer one iteration currently compares correctness using hacky string comparisons. We should compare the vectors instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark
Joseph Batchik created SPARK-9486: - Summary: Add aliasing to data sources to allow external packages to register themselves with Spark Key: SPARK-9486 URL: https://issues.apache.org/jira/browse/SPARK-9486 Project: Spark Issue Type: Improvement Components: SQL Reporter: Joseph Batchik Priority: Minor Currently Spark allows users to use external data sources like spark-avro, spark-csv, etc by having them specifying their full class name: {code:java} sqlContext.read.format(com.databricks.spark.avro).load(path) {code} Typing in a full class is not the best idea so it would be nice to allow the external packages to be able to register themselves with Spark to allow users to do something like: {code:java} sqlContext.read.format(avro).load(path) {code} This would make it so that the external data source packages follow the same convention as the built in data sources do, parquet, json, jdbc, etc. This could be accomplished by using a ServiceLoader. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
Xiangrui Meng created SPARK-9487: Summary: Use the same num. worker threads in Scala/Python unit tests Key: SPARK-9487 URL: https://issues.apache.org/jira/browse/SPARK-9487 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core, SQL, Tests Affects Versions: 1.5.0 Reporter: Xiangrui Meng In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the operation depends on partition IDs, e.g., random number generator, this will lead to different result in Python and Scala/Java. It would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-4823: Attachment: SparkMeetup2015-Experiments2.pdf SparkMeetup2015-Experiments1.pdf rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Attachments: MovieLensSimilarity Comparisons.pdf, SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648340#comment-14648340 ] Debasish Das commented on SPARK-4823: - We did more detailed experiment for July 2015 Spark Meetup to understand the shuffle effects on runtime. I attached the data for experiments in the JIRA. I will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5. rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Attachments: MovieLensSimilarity Comparisons.pdf RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9320) Add `summary` as a synonym for `describe`
[ https://issues.apache.org/jira/browse/SPARK-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9320: --- Assignee: (was: Apache Spark) Add `summary` as a synonym for `describe` - Key: SPARK-9320 URL: https://issues.apache.org/jira/browse/SPARK-9320 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman `summary` is used to provide similar functionality in R data frames. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9318: --- Assignee: Apache Spark Add `merge` as synonym for join --- Key: SPARK-9318 URL: https://issues.apache.org/jira/browse/SPARK-9318 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9320) Add `summary` as a synonym for `describe`
[ https://issues.apache.org/jira/browse/SPARK-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648388#comment-14648388 ] Apache Spark commented on SPARK-9320: - User 'falaki' has created a pull request for this issue: https://github.com/apache/spark/pull/7806 Add `summary` as a synonym for `describe` - Key: SPARK-9320 URL: https://issues.apache.org/jira/browse/SPARK-9320 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman `summary` is used to provide similar functionality in R data frames. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648387#comment-14648387 ] Apache Spark commented on SPARK-9318: - User 'falaki' has created a pull request for this issue: https://github.com/apache/spark/pull/7806 Add `merge` as synonym for join --- Key: SPARK-9318 URL: https://issues.apache.org/jira/browse/SPARK-9318 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9320) Add `summary` as a synonym for `describe`
[ https://issues.apache.org/jira/browse/SPARK-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9320: --- Assignee: Apache Spark Add `summary` as a synonym for `describe` - Key: SPARK-9320 URL: https://issues.apache.org/jira/browse/SPARK-9320 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Apache Spark `summary` is used to provide similar functionality in R data frames. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9318: --- Assignee: (was: Apache Spark) Add `merge` as synonym for join --- Key: SPARK-9318 URL: https://issues.apache.org/jira/browse/SPARK-9318 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange
[ https://issues.apache.org/jira/browse/SPARK-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9489: --- Assignee: Apache Spark (was: Josh Rosen) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange --- Key: SPARK-9489 URL: https://issues.apache.org/jira/browse/SPARK-9489 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Apache Spark While reviewing [~yhuai]'s patch for SPARK-2205, I noticed that Exchange's {{compatible}} check may be incorrectly returning {{false}} in many cases. As far as I know, this is not actually a problem because the {{compatible}}, {{meetsRequirements}}, and {{needsAnySort}} checks are serving only as short-circuit performance optimizations that are not necessary for correctness. In order to reduce code complexity, I think that we should remove these checks and unconditionally rewrite the operator's children. This should be safe because we rewrite the tree in a single bottom-up pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange
[ https://issues.apache.org/jira/browse/SPARK-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648405#comment-14648405 ] Apache Spark commented on SPARK-9489: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7807 Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange --- Key: SPARK-9489 URL: https://issues.apache.org/jira/browse/SPARK-9489 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen While reviewing [~yhuai]'s patch for SPARK-2205, I noticed that Exchange's {{compatible}} check may be incorrectly returning {{false}} in many cases. As far as I know, this is not actually a problem because the {{compatible}}, {{meetsRequirements}}, and {{needsAnySort}} checks are serving only as short-circuit performance optimizations that are not necessary for correctness. In order to reduce code complexity, I think that we should remove these checks and unconditionally rewrite the operator's children. This should be safe because we rewrite the tree in a single bottom-up pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange
[ https://issues.apache.org/jira/browse/SPARK-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9489: --- Assignee: Josh Rosen (was: Apache Spark) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange --- Key: SPARK-9489 URL: https://issues.apache.org/jira/browse/SPARK-9489 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen While reviewing [~yhuai]'s patch for SPARK-2205, I noticed that Exchange's {{compatible}} check may be incorrectly returning {{false}} in many cases. As far as I know, this is not actually a problem because the {{compatible}}, {{meetsRequirements}}, and {{needsAnySort}} checks are serving only as short-circuit performance optimizations that are not necessary for correctness. In order to reduce code complexity, I think that we should remove these checks and unconditionally rewrite the operator's children. This should be safe because we rewrite the tree in a single bottom-up pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9479) ReceiverTrackerSuite fails for maven build
[ https://issues.apache.org/jira/browse/SPARK-9479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9479. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.5.0 ReceiverTrackerSuite fails for maven build -- Key: SPARK-9479 URL: https://issues.apache.org/jira/browse/SPARK-9479 Project: Spark Issue Type: Bug Components: Streaming, Tests Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.5.0 The test failure is here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3109/ I saw the following exception in the log: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.NullPointerException org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1297) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:834) {code} This exception is because SparkEnv.get returns null. I found the maven build is different from the sbt build. The maven build will create all Suite classes at the beginning. `ReceiverTrackerSuite` creates StreamingContext (SparkContext) in the constructor. That means SparkContext is created very early. And the global SparkEnv will be set to null in the previous test. Therefore we saw the above exception when running `Receiver tracker - propagates rate limit` in `ReceiverTrackerSuite`. This test was added recently. Note: the previous tests in `ReceiverTrackerSuite` didn't use SparkContext actually, that's why we didn't see such failure before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org