[jira] [Created] (SPARK-7554) Throw errors when an active StreamingContext is used to create DStreams and output operations
Tathagata Das created SPARK-7554: Summary: Throw errors when an active StreamingContext is used to create DStreams and output operations Key: SPARK-7554 URL: https://issues.apache.org/jira/browse/SPARK-7554 Project: Spark Issue Type: Improvement Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7026) LeftSemiJoin can not work when it has both equal condition and not equal condition.
[ https://issues.apache.org/jira/browse/SPARK-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei closed SPARK-7026. - Resolution: Duplicate LeftSemiJoin can not work when it has both equal condition and not equal condition. - Key: SPARK-7026 URL: https://issues.apache.org/jira/browse/SPARK-7026 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Zhongshuai Pei Run sql like that {panel} select * from web_sales ws1 left semi join web_sales ws2 on ws1.ws_order_number = ws2.ws_order_number and ws1.ws_warehouse_sk ws2.ws_warehouse_sk {panel} then get an exception {panel} Couldn't find ws_warehouse_sk#287 in {ws_sold_date_sk#237,ws_sold_time_sk#238,ws_ship_date_sk#239,ws_item_sk#240,ws_bill_customer_sk#241,ws_bill_cdemo_sk#242,ws_bill_hdemo_sk#243,ws_bill_addr_sk#244,ws_ship_customer_sk#245,ws_ship_cdemo_sk#246,ws_ship_hdemo_sk#247,ws_ship_addr_sk#248,ws_web_page_sk#249,ws_web_site_sk#250,ws_ship_mode_sk#251,ws_warehouse_sk#252,ws_promo_sk#253,ws_order_number#254,ws_quantity#255,ws_wholesale_cost#256,ws_list_price#257,ws_sales_price#258,ws_ext_discount_amt#259,ws_ext_sales_price#260,ws_ext_wholesale_cost#261,ws_ext_list_price#262,ws_ext_tax#263,ws_coupon_amt#264,ws_ext_ship_cost#265,ws_net_paid#266,ws_net_paid_inc_tax#267,ws_net_paid_inc_ship#268,ws_net_paid_inc_ship_tax#269,ws_net_profit#270,ws_sold_date#236} at scala.sys.package$.error(package.scala:27) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7545) Bernoulli NaiveBayes should validate data
[ https://issues.apache.org/jira/browse/SPARK-7545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539081#comment-14539081 ] Joseph K. Bradley commented on SPARK-7545: -- OK, I appreciate it! Bernoulli NaiveBayes should validate data - Key: SPARK-7545 URL: https://issues.apache.org/jira/browse/SPARK-7545 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Leah McGuire Priority: Minor Bernoulli NaiveBayes expects input features to take values 0 or 1, but it does not actually check that. It should check and throw an exception if it finds invalid values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala
[ https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7331. - Resolution: Fixed Fix Version/s: 1.2.3 Issue resolved by pull request 6036 [https://github.com/apache/spark/pull/6036] Create HiveConf per application instead of per query in HiveQl.scala Key: SPARK-7331 URL: https://issues.apache.org/jira/browse/SPARK-7331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Priority: Minor Fix For: 1.2.3 A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets created once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7538) Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge
[ https://issues.apache.org/jira/browse/SPARK-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-7538. Resolution: Fixed This was a cross post from the mailing list. The poster closed the thread with the following: {code} Ted, many thanks. I'm not used to Java dependencies so this was a real head-scratcher for me. Downloading the two metrics packages from the maven repository (metrics-core, metrics-annotation) and supplying it on the spark-submit command line worked. My final spark-submit for a python project using Kafka as an input source: /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \ --packages TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1 \ --jars /home/ubuntu/jars/metrics-core-2.2.0.jar,/home/ubuntu/jars/metrics-annotation-2.2.0.jar \ --conf spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 \ --master spark://127.0.0.1:7077 \ affected_hosts.py Now we're seeing data from the stream. Thanks again! {code} Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge --- Key: SPARK-7538 URL: https://issues.apache.org/jira/browse/SPARK-7538 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Environment: Ubuntu 14.04 LTS java version 1.7.0_79 OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2) OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode) Spark 1.3.1 release. Reporter: Lee McFadden We have a simple streaming job, the components of which work fine in a batch environment reading from a cassandra table as the source. We adapted it to work with streaming using the Python libs. Submit command line: {code} /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \ --packages TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1 \ --conf spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 \ --master spark://127.0.0.1:7077 \ affected_hosts.py {code} When we run the streaming job everything starts just fine, then we see the following in the logs: {code} 15/05/11 19:50:46 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 70, ip-10-10-102-53.us-west-2.compute.internal): java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge at kafka.consumer.ZookeeperConsumerConnector.createFetcher(ZookeeperConsumerConnector.scala:151) at kafka.consumer.ZookeeperConsumerConnector.init(ZookeeperConsumerConnector.scala:115) at kafka.consumer.ZookeeperConsumerConnector.init(ZookeeperConsumerConnector.scala:128) at kafka.consumer.Consumer$.create(ConsumerConnector.scala:89) at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:298) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:290) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: com.yammer.metrics.core.Gauge at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 17 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7520) Install Jekyll On Jenkins Machines
[ https://issues.apache.org/jira/browse/SPARK-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-7520. Resolution: Fixed Fix Version/s: 1.4.0 All green - awesome thanks [~shaneknapp]! Install Jekyll On Jenkins Machines -- Key: SPARK-7520 URL: https://issues.apache.org/jira/browse/SPARK-7520 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Assignee: shane knapp Priority: Critical Fix For: 1.4.0 Hey Shane, SPARK-1517 requires us to install Jekyll on the build machines. Any chance we can do that? http://jekyllrb.com/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6876) DataFrame.na.replace value support for Python
[ https://issues.apache.org/jira/browse/SPARK-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6876: -- Assignee: Adrian Wang DataFrame.na.replace value support for Python - Key: SPARK-6876 URL: https://issues.apache.org/jira/browse/SPARK-6876 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang Scala/Java support is in. We should provide the Python version, similar to what Pandas supports. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7411) CTAS parser is incomplete
[ https://issues.apache.org/jira/browse/SPARK-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7411. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5963 [https://github.com/apache/spark/pull/5963] CTAS parser is incomplete - Key: SPARK-7411 URL: https://issues.apache.org/jira/browse/SPARK-7411 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Cheng Hao Priority: Blocker Fix For: 1.4.0 The change to use an isolated classloader removed the use of the Semantic Analyzer for parsing CTAS queries. We should fix this before the release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7553) Add methods to maintain a singleton StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7553: - Description: In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. was: In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Add methods to maintain a singleton StreamingContext - Key: SPARK-7553 URL: https://issues.apache.org/jira/browse/SPARK-7553 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7553) Add methods to maintain a singleton StreamingContext
Tathagata Das created SPARK-7553: Summary: Add methods to maintain a singleton StreamingContext Key: SPARK-7553 URL: https://issues.apache.org/jira/browse/SPARK-7553 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7552) Close files correctly when iteration is finished in WAL recovery
[ https://issues.apache.org/jira/browse/SPARK-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7552: --- Assignee: Apache Spark Close files correctly when iteration is finished in WAL recovery Key: SPARK-7552 URL: https://issues.apache.org/jira/browse/SPARK-7552 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1, 1.4.0 Reporter: Saisai Shao Assignee: Apache Spark Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7553) Add methods to maintain a singleton StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7553: - Description: In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Since this is useful in REPL environments, its best to add this as an Experimental support in the Scala API only. was: In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Add methods to maintain a singleton StreamingContext - Key: SPARK-7553 URL: https://issues.apache.org/jira/browse/SPARK-7553 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Since this is useful in REPL environments, its best to add this as an Experimental support in the Scala API only. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7553) Add methods to maintain a singleton StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7553: - Description: In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Since this problem occurs useful in REPL environments, its best to add this as an Experimental support in the Scala API only so that it can be used in Scala REPLs and notebooks. was: In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Since this is useful in REPL environments, its best to add this as an Experimental support in the Scala API only. Add methods to maintain a singleton StreamingContext - Key: SPARK-7553 URL: https://issues.apache.org/jira/browse/SPARK-7553 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Since this problem occurs useful in REPL environments, its best to add this as an Experimental support in the Scala API only so that it can be used in Scala REPLs and notebooks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7552) Close files correctly when iteration is finished in WAL recovery
[ https://issues.apache.org/jira/browse/SPARK-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-7552: --- Labels: backport-needed (was: ) Close files correctly when iteration is finished in WAL recovery Key: SPARK-7552 URL: https://issues.apache.org/jira/browse/SPARK-7552 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1, 1.4.0 Reporter: Saisai Shao Labels: backport-needed Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7553) Add methods to maintain a singleton StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7553: --- Assignee: Tathagata Das (was: Apache Spark) Add methods to maintain a singleton StreamingContext - Key: SPARK-7553 URL: https://issues.apache.org/jira/browse/SPARK-7553 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Since this problem occurs useful in REPL environments, its best to add this as an Experimental support in the Scala API only so that it can be used in Scala REPLs and notebooks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7553) Add methods to maintain a singleton StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539140#comment-14539140 ] Apache Spark commented on SPARK-7553: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/6070 Add methods to maintain a singleton StreamingContext - Key: SPARK-7553 URL: https://issues.apache.org/jira/browse/SPARK-7553 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Since this problem occurs useful in REPL environments, its best to add this as an Experimental support in the Scala API only so that it can be used in Scala REPLs and notebooks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7553) Add methods to maintain a singleton StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7553: --- Assignee: Apache Spark (was: Tathagata Das) Add methods to maintain a singleton StreamingContext - Key: SPARK-7553 URL: https://issues.apache.org/jira/browse/SPARK-7553 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Assignee: Apache Spark Priority: Blocker In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands {{ val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again }} The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Since this problem occurs useful in REPL environments, its best to add this as an Experimental support in the Scala API only so that it can be used in Scala REPLs and notebooks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7554) Throw exception when an active StreamingContext is used to create DStreams and output operations
[ https://issues.apache.org/jira/browse/SPARK-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7554: - Priority: Blocker (was: Critical) Throw exception when an active StreamingContext is used to create DStreams and output operations Key: SPARK-7554 URL: https://issues.apache.org/jira/browse/SPARK-7554 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7554) Throw errors when an active StreamingContext is used to create DStreams and output operations
[ https://issues.apache.org/jira/browse/SPARK-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7554: - Component/s: Streaming Target Version/s: 1.4.0 Throw errors when an active StreamingContext is used to create DStreams and output operations - Key: SPARK-7554 URL: https://issues.apache.org/jira/browse/SPARK-7554 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7554) Throw exception when an active StreamingContext is used to create DStreams and output operations
[ https://issues.apache.org/jira/browse/SPARK-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7554: - Summary: Throw exception when an active StreamingContext is used to create DStreams and output operations (was: Throw errors when an active StreamingContext is used to create DStreams and output operations) Throw exception when an active StreamingContext is used to create DStreams and output operations Key: SPARK-7554 URL: https://issues.apache.org/jira/browse/SPARK-7554 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7554) Throw exception when an active StreamingContext is used to create DStreams and output operations
[ https://issues.apache.org/jira/browse/SPARK-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539143#comment-14539143 ] Tathagata Das commented on SPARK-7554: -- Currently, adding DStreams to an active context is not supported, but the errors are ambiguous. Make this error more explicit. Throw exception when an active StreamingContext is used to create DStreams and output operations Key: SPARK-7554 URL: https://issues.apache.org/jira/browse/SPARK-7554 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7509) Add drop column to Python DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7509. Resolution: Fixed Fix Version/s: 1.4.0 Add drop column to Python DataFrame API --- Key: SPARK-7509 URL: https://issues.apache.org/jira/browse/SPARK-7509 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6198) Support select current_database()
[ https://issues.apache.org/jira/browse/SPARK-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei closed SPARK-6198. - Resolution: Won't Fix Support select current_database() --- Key: SPARK-6198 URL: https://issues.apache.org/jira/browse/SPARK-6198 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: Zhongshuai Pei The method(evaluate) has changed in UDFCurrentDB, it just throws a exception.But hiveUdfs call this method and failed. @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { throw new IllegalStateException(never); } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5129) make SqlContext support select date +/- XX DAYS from table
[ https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei closed SPARK-5129. - Resolution: Won't Fix make SqlContext support select date +/- XX DAYS from table -- Key: SPARK-5129 URL: https://issues.apache.org/jira/browse/SPARK-5129 Project: Spark Issue Type: Improvement Components: SQL Reporter: Zhongshuai Pei Priority: Minor Example : create table test (date: Date) 2014-01-01 2014-01-02 2014-01-03 when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 and running select date - 10 DAYS from test, get 2013-12-22 2013-12-23 2013-12-24 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6768) Do not support float/double union decimal or decimal(a ,b) union decimal(c, d)
[ https://issues.apache.org/jira/browse/SPARK-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei closed SPARK-6768. - Resolution: Fixed Do not support float/double union decimal or decimal(a ,b) union decimal(c, d) Key: SPARK-6768 URL: https://issues.apache.org/jira/browse/SPARK-6768 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei Do not support sql like that : select cast(12.2056999 as float) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 select cast(12.2056999 as double) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 select cast(1241.20 as decimal(6, 2)) from testData limit 1 union select cast(1.204 as decimal(5, 3)) from testData limit 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7530) Add API to get the current state of a StreamingContext
Tathagata Das created SPARK-7530: Summary: Add API to get the current state of a StreamingContext Key: SPARK-7530 URL: https://issues.apache.org/jira/browse/SPARK-7530 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7531) Install GPG on Jenkins machines
[ https://issues.apache.org/jira/browse/SPARK-7531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538154#comment-14538154 ] shane knapp commented on SPARK-7531: is this version sufficient? -bash-4.1$ gpg --version gpg (GnuPG) 2.0.14 libgcrypt 1.4.5 Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Home: ~/.gnupg Supported algorithms: Pubkey: RSA, ELG, DSA Cipher: 3DES, CAST5, BLOWFISH, AES, AES192, AES256, TWOFISH, CAMELLIA128, CAMELLIA192, CAMELLIA256 Hash: MD5, SHA1, RIPEMD160, SHA256, SHA384, SHA512, SHA224 Compression: Uncompressed, ZIP, ZLIB, BZIP2 Install GPG on Jenkins machines --- Key: SPARK-7531 URL: https://issues.apache.org/jira/browse/SPARK-7531 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Assignee: shane knapp This one is also required for us to cut regular snapshot releases from Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7532) Make StreamingContext.start() idempotent
[ https://issues.apache.org/jira/browse/SPARK-7532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7532: - Component/s: Streaming Make StreamingContext.start() idempotent Key: SPARK-7532 URL: https://issues.apache.org/jira/browse/SPARK-7532 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Currently calling StreamingContext.start() throws error when the context is already started. This is inconsistent with the the StreamingContext.stop() which is idempotent, that is, called stop() on a stopped context is a no-op. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7458) Check 1.3- 1.4 MLlib API compliance using java-compliance-checker
[ https://issues.apache.org/jira/browse/SPARK-7458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7458: - Description: We should do this after 1.4-rc1 is cut. Check 1.3- 1.4 MLlib API compliance using java-compliance-checker -- Key: SPARK-7458 URL: https://issues.apache.org/jira/browse/SPARK-7458 Project: Spark Issue Type: Sub-task Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng We should do this after 1.4-rc1 is cut. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7355) FlakyTest - o.a.s.DriverSuite
[ https://issues.apache.org/jira/browse/SPARK-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7355: --- Assignee: Andrew Or (was: Apache Spark) FlakyTest - o.a.s.DriverSuite - Key: SPARK-7355 URL: https://issues.apache.org/jira/browse/SPARK-7355 Project: Spark Issue Type: Test Components: Spark Core, Tests Reporter: Tathagata Das Assignee: Andrew Or Priority: Blocker Labels: flaky-test -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7355) FlakyTest - o.a.s.DriverSuite
[ https://issues.apache.org/jira/browse/SPARK-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538185#comment-14538185 ] Apache Spark commented on SPARK-7355: - User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/6059 FlakyTest - o.a.s.DriverSuite - Key: SPARK-7355 URL: https://issues.apache.org/jira/browse/SPARK-7355 Project: Spark Issue Type: Test Components: Spark Core, Tests Reporter: Tathagata Das Assignee: Andrew Or Priority: Blocker Labels: flaky-test -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7355) FlakyTest - o.a.s.DriverSuite
[ https://issues.apache.org/jira/browse/SPARK-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7355: --- Assignee: Apache Spark (was: Andrew Or) FlakyTest - o.a.s.DriverSuite - Key: SPARK-7355 URL: https://issues.apache.org/jira/browse/SPARK-7355 Project: Spark Issue Type: Test Components: Spark Core, Tests Reporter: Tathagata Das Assignee: Apache Spark Priority: Blocker Labels: flaky-test -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7532) Make StreamingContext.start() idempotent
[ https://issues.apache.org/jira/browse/SPARK-7532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7532: --- Assignee: Tathagata Das (was: Apache Spark) Make StreamingContext.start() idempotent Key: SPARK-7532 URL: https://issues.apache.org/jira/browse/SPARK-7532 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Currently calling StreamingContext.start() throws error when the context is already started. This is inconsistent with the the StreamingContext.stop() which is idempotent, that is, called stop() on a stopped context is a no-op. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7532) Make StreamingContext.start() idempotent
[ https://issues.apache.org/jira/browse/SPARK-7532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538193#comment-14538193 ] Apache Spark commented on SPARK-7532: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/6060 Make StreamingContext.start() idempotent Key: SPARK-7532 URL: https://issues.apache.org/jira/browse/SPARK-7532 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Currently calling StreamingContext.start() throws error when the context is already started. This is inconsistent with the the StreamingContext.stop() which is idempotent, that is, called stop() on a stopped context is a no-op. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7532) Make StreamingContext.start() idempotent
[ https://issues.apache.org/jira/browse/SPARK-7532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7532: --- Assignee: Apache Spark (was: Tathagata Das) Make StreamingContext.start() idempotent Key: SPARK-7532 URL: https://issues.apache.org/jira/browse/SPARK-7532 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Apache Spark Priority: Blocker Currently calling StreamingContext.start() throws error when the context is already started. This is inconsistent with the the StreamingContext.stop() which is idempotent, that is, called stop() on a stopped context is a no-op. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python
[ https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538272#comment-14538272 ] Nicholas Chammas commented on SPARK-7133: - [~rxin] - Should we also implement {{\_\_getitem\_\_}} access in PySpark for {{Row}}? Or does this patch also cover that? As of Spark 1.3.1, you can do {{row.field}} but not {{row\['field'\]}}. Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python -- Key: SPARK-7133 URL: https://issues.apache.org/jira/browse/SPARK-7133 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan Priority: Blocker Fix For: 1.4.0 Typing {code} df.col[1] {code} and {code} df.col['field'] {code} is so much eaiser than {code} df.col.getField('field') df.col.getItem(1) {code} This would require us to define (in Column) an apply function in Scala, and a __getitem__ function in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7361) Throw unambiguous exception when attempting to start multiple StreamingContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-7361. -- Resolution: Fixed Fix Version/s: 1.4.0 Throw unambiguous exception when attempting to start multiple StreamingContexts in the same JVM --- Key: SPARK-7361 URL: https://issues.apache.org/jira/browse/SPARK-7361 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Fix For: 1.4.0 Currently attempt to start a streamingContext while another one is started throws a confusing exception that the action name JobScheduler is already registered. Instead its best to throw a proper exception as it is not supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7497) test_count_by_value_and_window is flaky
[ https://issues.apache.org/jira/browse/SPARK-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7497: - Assignee: Davies Liu (was: Tathagata Das) test_count_by_value_and_window is flaky --- Key: SPARK-7497 URL: https://issues.apache.org/jira/browse/SPARK-7497 Project: Spark Issue Type: Bug Components: PySpark, Streaming Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Davies Liu Labels: flaky-test Saw this test failure in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32268/console {code} == FAIL: test_count_by_value_and_window (__main__.WindowFunctionTests) -- Traceback (most recent call last): File pyspark/streaming/tests.py, line 418, in test_count_by_value_and_window self._test_func(input, func, expected) File pyspark/streaming/tests.py, line 133, in _test_func self.assertEqual(expected, result) AssertionError: Lists differ: [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]] != [[1], [2], [3], [4], [5], [6], [6], [6]] First list contains 2 additional elements. First extra element 8: [6] - [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]] ? -- + [[1], [2], [3], [4], [5], [6], [6], [6]] -- {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7522) ML Examples option for dataFormat should not be enclosed in angle brackets
[ https://issues.apache.org/jira/browse/SPARK-7522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7522. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6049 [https://github.com/apache/spark/pull/6049] ML Examples option for dataFormat should not be enclosed in angle brackets -- Key: SPARK-7522 URL: https://issues.apache.org/jira/browse/SPARK-7522 Project: Spark Issue Type: Bug Components: Examples Reporter: Bryan Cutler Priority: Minor Fix For: 1.4.0 Some ML examples include an option for specifying the data format, such as DecisionTreeExample, but the option is enclosed in angle brackets like opt[String](dataFormat). This is probably just a typo but makes it awkward to use the option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7534) Fix the Stage table when a stage is missing
Shixiong Zhu created SPARK-7534: --- Summary: Fix the Stage table when a stage is missing Key: SPARK-7534 URL: https://issues.apache.org/jira/browse/SPARK-7534 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Shixiong Zhu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7400) PortableDataStream UDT
[ https://issues.apache.org/jira/browse/SPARK-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538236#comment-14538236 ] Eron Wright commented on SPARK-7400: - Given the above, my proposal is to modify [org.apache.spark.sql.catalyst.ScalaReflection::schemaFor|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L74] to return an instance of org.apache.spark.sql.types.PortableDataStreamUDT. PortableDataStream UDT -- Key: SPARK-7400 URL: https://issues.apache.org/jira/browse/SPARK-7400 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Eron Wright Improve support for PortableDataStream in a DataFrame by implementing PortableDataStreamUDT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7443) MLlib 1.4 QA plan
[ https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7443: - Description: TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) ** Java compatibility (SPARK-7529) ** Python API coverage * audit Pipeline APIs ** feature transformers ** tree models ** elastic-net ** ML attributes ** developer APIs * graduate spark.ml from alpha ** remove AlphaComponent annotations ** remove mima excludes for spark.ml h2. Algorithms and performance * list missing performance tests from spark-perf * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * Bernoulli naive Bayes (SPARK-7453) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python correctness: * PMML ** scoring using PMML evaluator vs. MLlib models * save/load h2. Documentation and example code * create JIRAs for the user guide to each new algorithm and assign them to the corresponding author * create example code for major components ** cross validation in python ** pipeline with complex feature transformations (scala/java/python) ** elastic-net (possibly with cross validation) was: TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) ** Java compatibility ** Python API coverage * audit Pipeline APIs ** feature transformers ** tree models ** elastic-net ** ML attributes ** developer APIs * graduate spark.ml from alpha ** remove AlphaComponent annotations ** remove mima excludes for spark.ml h2. Algorithms and performance * list missing performance tests from spark-perf * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * Bernoulli naive Bayes (SPARK-7453) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python correctness: * PMML ** scoring using PMML evaluator vs. MLlib models * save/load h2. Documentation and example code * create JIRAs for the user guide to each new algorithm and assign them to the corresponding author * create example code for major components ** cross validation in python ** pipeline with complex feature transformations (scala/java/python) ** elastic-net (possibly with cross validation) MLlib 1.4 QA plan - Key: SPARK-7443 URL: https://issues.apache.org/jira/browse/SPARK-7443 Project: Spark Issue Type: Umbrella Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Priority: Critical TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) ** Java compatibility (SPARK-7529) ** Python API coverage * audit Pipeline APIs ** feature transformers ** tree models ** elastic-net ** ML attributes ** developer APIs * graduate spark.ml from alpha ** remove AlphaComponent annotations ** remove mima excludes for spark.ml h2. Algorithms and performance * list missing performance tests from spark-perf * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * Bernoulli naive Bayes (SPARK-7453) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python correctness: * PMML ** scoring using PMML evaluator vs. MLlib models * save/load h2. Documentation and example code * create JIRAs for the user guide to each new algorithm and assign them to the corresponding author * create example code for major components ** cross validation in python ** pipeline with complex feature transformations (scala/java/python) ** elastic-net (possibly with cross validation) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6092) Add RankingMetrics in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6092. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6044 [https://github.com/apache/spark/pull/6044] Add RankingMetrics in PySpark/MLlib --- Key: SPARK-6092 URL: https://issues.apache.org/jira/browse/SPARK-6092 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Yanbo Liang Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7529) Java compatibility check for MLlib 1.4
Xiangrui Meng created SPARK-7529: Summary: Java compatibility check for MLlib 1.4 Key: SPARK-7529 URL: https://issues.apache.org/jira/browse/SPARK-7529 Project: Spark Issue Type: Sub-task Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Check Java compatibility for MLlib 1.4. We should create separate JIRAs for each possible issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7530) Add API to get the current state of a StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7530: --- Assignee: Tathagata Das (was: Apache Spark) Add API to get the current state of a StreamingContext -- Key: SPARK-7530 URL: https://issues.apache.org/jira/browse/SPARK-7530 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7522) ML Examples option for dataFormat should not be enclosed in angle brackets
[ https://issues.apache.org/jira/browse/SPARK-7522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7522: - Target Version/s: 1.1.2, 1.2.3, 1.3.2 Affects Version/s: 1.4.0 1.1.1 1.2.2 1.3.1 ML Examples option for dataFormat should not be enclosed in angle brackets -- Key: SPARK-7522 URL: https://issues.apache.org/jira/browse/SPARK-7522 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.1.1, 1.2.2, 1.3.1, 1.4.0 Reporter: Bryan Cutler Assignee: Bryan Cutler Priority: Minor Fix For: 1.4.0 Some ML examples include an option for specifying the data format, such as DecisionTreeExample, but the option is enclosed in angle brackets like opt[String](dataFormat). This is probably just a typo but makes it awkward to use the option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7520) Install Jekyll On Jenkins Machines
[ https://issues.apache.org/jira/browse/SPARK-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538148#comment-14538148 ] shane knapp commented on SPARK-7520: any particular version of ruby? 1.8.7? Install Jekyll On Jenkins Machines -- Key: SPARK-7520 URL: https://issues.apache.org/jira/browse/SPARK-7520 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Assignee: shane knapp Priority: Critical Hey Shane, SPARK-1517 requires us to install Jekyll on the build machines. Any chance we can do that? http://jekyllrb.com/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7504) NullPointerException when initializing SparkContext in YARN-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538159#comment-14538159 ] Marcelo Vanzin commented on SPARK-7504: --- If I understand correctly, what you're doing is running the equivalent of this in you code, right: {code} new SparkContext(new SparkConf().set(spark.master, yarn-cluster)) {code} That's not really supported, since that will not work in yarn-cluster mode, even if the ApplicationMaster launches successfully. I also took a look at your PR and that won't help. The fix here, if any, is to not allow the above code to work by throwing an exception early. NullPointerException when initializing SparkContext in YARN-cluster mode Key: SPARK-7504 URL: https://issues.apache.org/jira/browse/SPARK-7504 Project: Spark Issue Type: Bug Components: Deploy, YARN Reporter: Zoltán Zvara Labels: deployment, yarn, yarn-client It is not clear for most users that, while running Spark on YARN a {{SparkContext}} with a given execution plan can be run locally as {{yarn-client}}, but can not deploy itself to the cluster. This is currently performed using {{org.apache.spark.deploy.yarn.Client}}. {color:gray} I think we should support deployment through {{SparkContext}}, but this is not the point I wish to make here. {color} Configuring a {{SparkContext}} to deploy itself currently will yield an {{ERROR}} while accessing {{spark.yarn.app.id}} in {{YarnClusterSchedulerBackend}}, and after that a {{NullPointerException}} while referencing the {{ApplicationMaster}} instance. Spark should clearly inform the user that it might be running in {{yarn-cluster}} mode without a proper submission using {{Client}} and that deploying is not supported directly from {{SparkContext}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7507) pyspark.sql.types.StructType and Row should implement __iter__()
[ https://issues.apache.org/jira/browse/SPARK-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538283#comment-14538283 ] Nicholas Chammas commented on SPARK-7507: - On a related note, perhaps we should also offer a method to quickly turn Python dicts back into StructTypes or Rows. pyspark.sql.types.StructType and Row should implement __iter__() Key: SPARK-7507 URL: https://issues.apache.org/jira/browse/SPARK-7507 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Nicholas Chammas Priority: Minor {{StructType}} looks an awful lot like a Python dictionary. However, it doesn't implement {{\_\_iter\_\_()}}, so doing a quick conversion like this doesn't work: {code} df = sqlContext.jsonRDD(sc.parallelize(['{name: El Magnifico}'])) df.schema StructType(List(StructField(name,StringType,true))) dict(df.schema) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'StructType' object is not iterable {code} This would be super helpful for doing any custom schema manipulations without having to go through the whole {{.json() - json.loads() - manipulate() - json.dumps() - .fromJson()}} charade. Same goes for {{Row}}, which offers an [{{asDict()}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.Row.asDict] method but doesn't support the more Pythonic {{dict(Row)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7528) Java compatibility of RankingMetrics
Xiangrui Meng created SPARK-7528: Summary: Java compatibility of RankingMetrics Key: SPARK-7528 URL: https://issues.apache.org/jira/browse/SPARK-7528 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng This is to check Java compatibility of RankingMetrics, which uses ClassTag. Maybe we should create a factory method for Java users that uses a fake tag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7530) Add API to get the current state of a StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538126#comment-14538126 ] Apache Spark commented on SPARK-7530: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/6058 Add API to get the current state of a StreamingContext -- Key: SPARK-7530 URL: https://issues.apache.org/jira/browse/SPARK-7530 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7530) Add API to get the current state of a StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7530: --- Assignee: Apache Spark (was: Tathagata Das) Add API to get the current state of a StreamingContext -- Key: SPARK-7530 URL: https://issues.apache.org/jira/browse/SPARK-7530 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Apache Spark Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538209#comment-14538209 ] Alexander Ulanov commented on SPARK-5575: - Current implementation: https://github.com/avulanov/spark/tree/ann-interface-gemm Artificial neural networks for MLlib deep learning -- Key: SPARK-5575 URL: https://issues.apache.org/jira/browse/SPARK-5575 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Goal: Implement various types of artificial neural networks Motivation: deep learning trend Requirements: 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused 2) Implement complex abstractions, such as feed forward and recurrent networks 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), autoencoder (sparse and denoising), stacked autoencoder, restricted boltzmann machines (RBM), deep belief networks (DBN) etc. 4) Implement or reuse supporting constucts, such as classifiers, normalizers, poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7533) Decrease spacing between AM-RM heartbeats.
Sandy Ryza created SPARK-7533: - Summary: Decrease spacing between AM-RM heartbeats. Key: SPARK-7533 URL: https://issues.apache.org/jira/browse/SPARK-7533 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.1 Reporter: Sandy Ryza The current default of spark.yarn.scheduler.heartbeat.interval-ms is 5 seconds. This is really long. For reference, the MR equivalent is 1 second. To avoid noise and unnecessary communication, we could have a fast rate for when we're waiting for executors and a slow rate for when we're just heartbeating. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7531) Install GPG on Jenkins machines
Patrick Wendell created SPARK-7531: -- Summary: Install GPG on Jenkins machines Key: SPARK-7531 URL: https://issues.apache.org/jira/browse/SPARK-7531 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Assignee: shane knapp This one is also required for us to cut regular snapshot releases from Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7534) Fix the Stage table when a stage is missing
[ https://issues.apache.org/jira/browse/SPARK-7534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7534: --- Assignee: (was: Apache Spark) Fix the Stage table when a stage is missing --- Key: SPARK-7534 URL: https://issues.apache.org/jira/browse/SPARK-7534 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Shixiong Zhu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7534) Fix the Stage table when a stage is missing
[ https://issues.apache.org/jira/browse/SPARK-7534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538222#comment-14538222 ] Apache Spark commented on SPARK-7534: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/6061 Fix the Stage table when a stage is missing --- Key: SPARK-7534 URL: https://issues.apache.org/jira/browse/SPARK-7534 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Shixiong Zhu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception
[ https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538288#comment-14538288 ] Joseph K. Bradley commented on SPARK-7483: -- I agree; it should work, but I'm not sure why it's failing. I'm not that familiar with Kryo, but I'll ask around. Thanks for reporting this! [MLLib] Using Kryo with FPGrowth fails with an exception Key: SPARK-7483 URL: https://issues.apache.org/jira/browse/SPARK-7483 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Tomasz Bartczak Priority: Minor When using FPGrowth algorithm with KryoSerializer - Spark fails with {code} Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Can not set final scala.collection.mutable.ListBuffer field org.apache.spark.mllib.fpm.FPTree$Summary.nodes to scala.collection.mutable.ArrayBuffer Serialization trace: nodes (org.apache.spark.mllib.fpm.FPTree$Summary) org$apache$spark$mllib$fpm$FPTree$$summaries (org.apache.spark.mllib.fpm.FPTree) {code} This can be easily reproduced in spark codebase by setting {code} conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) {code} and running FPGrowthSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org