[jira] [Commented] (SPARK-5456) Decimal Type comparison issue
[ https://issues.apache.org/jira/browse/SPARK-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383392#comment-14383392 ] Karthik G commented on SPARK-5456: -- This is a blocker when using Spark with Databases which have Decimal / Big Decimal columns. Is there a workaround? Decimal Type comparison issue - Key: SPARK-5456 URL: https://issues.apache.org/jira/browse/SPARK-5456 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Kuldeep Not quite able to figure this out but here is a junit test to reproduce this, in JavaAPISuite.java {code:title=DecimalBug.java} @Test public void decimalQueryTest() { ListRow decimalTable = new ArrayListRow(); decimalTable.add(RowFactory.create(new BigDecimal(1), new BigDecimal(2))); decimalTable.add(RowFactory.create(new BigDecimal(3), new BigDecimal(4))); JavaRDDRow rows = sc.parallelize(decimalTable); ListStructField fields = new ArrayListStructField(7); fields.add(DataTypes.createStructField(a, DataTypes.createDecimalType(), true)); fields.add(DataTypes.createStructField(b, DataTypes.createDecimalType(), true)); sqlContext.applySchema(rows.rdd(), DataTypes.createStructType(fields)).registerTempTable(foo); Assert.assertEquals(sqlContext.sql(select * from foo where a 0).collectAsList(), decimalTable); } {code} Fails with java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6119) better support for working with missing data
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6119: --- Description: Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. Some ideas: 1. Support replacing all null value for a column (or all columns) with a fixed value. 2. Support replacing a set of values with another set of values. 3. interpolate was: Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. Some ideas: 1. Support replacing all null value for a column with a fixed value. 2. Support replacing all null value for all columns with a fixed value. better support for working with missing data Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. Some ideas: 1. Support replacing all null value for a column (or all columns) with a fixed value. 2. Support replacing a set of values with another set of values. 3. interpolate -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6406) Launcher backward compatibility issues
[ https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6406: --- Assignee: (was: Apache Spark) Launcher backward compatibility issues -- Key: SPARK-6406 URL: https://issues.apache.org/jira/browse/SPARK-6406 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Nishkam Ravi Priority: Minor The new launcher library breaks backward compatibility. hadoop string in the spark assembly should not be mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6561) Add partition support in saveAsParquet
Jianshi Huang created SPARK-6561: Summary: Add partition support in saveAsParquet Key: SPARK-6561 URL: https://issues.apache.org/jira/browse/SPARK-6561 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquet(path: String, partitionColumns: Seq[String]) {code} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6119) better support for working with missing data
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6119: --- Labels: DataFrame (was: ) better support for working with missing data Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6119) better support for working with missing data
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6119: --- Summary: better support for working with missing data (was: missing data support) better support for working with missing data Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6119) better support for working with missing data
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6119: --- Description: Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. Some ideas: 1. Support replacing all null value for a column with a fixed value. 2. Support replacing all null value for all columns with a fixed value. was: Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. better support for working with missing data Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. Some ideas: 1. Support replacing all null value for a column with a fixed value. 2. Support replacing all null value for all columns with a fixed value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6561) Add partition support in saveAsParquet
[ https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6561: - Description: Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquetFile(path: String, partitionColumns: Seq[String]) {code} Jianshi was: Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquet(path: String, partitionColumns: Seq[String]) {code} Jianshi Add partition support in saveAsParquet -- Key: SPARK-6561 URL: https://issues.apache.org/jira/browse/SPARK-6561 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquetFile(path: String, partitionColumns: Seq[String]) {code} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6561) Add partition support in saveAsParquet
[ https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383413#comment-14383413 ] Patrick Wendell commented on SPARK-6561: FYI - I just removed Affects Version's since that is only for bugs (to indicate which version has the bug). Add partition support in saveAsParquet -- Key: SPARK-6561 URL: https://issues.apache.org/jira/browse/SPARK-6561 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquetFile(path: String, partitionColumns: Seq[String]) {code} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6561) Add partition support in saveAsParquet
[ https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6561: --- Affects Version/s: (was: 1.3.1) (was: 1.3.0) Add partition support in saveAsParquet -- Key: SPARK-6561 URL: https://issues.apache.org/jira/browse/SPARK-6561 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquetFile(path: String, partitionColumns: Seq[String]) {code} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6119) better support for working with missing data
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6119: --- Description: Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. Some ideas: 1. Support replacing all null value for a column (or all columns) with a fixed value. 2. Support dropping rows with null values (dropna). 3. Support replacing a set of values with another set of values (i.e. map join) was: Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. Some ideas: 1. Support replacing all null value for a column (or all columns) with a fixed value. 2. Support replacing a set of values with another set of values. 3. interpolate better support for working with missing data Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. Some ideas: 1. Support replacing all null value for a column (or all columns) with a fixed value. 2. Support dropping rows with null values (dropna). 3. Support replacing a set of values with another set of values (i.e. map join) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6406) Launcher backward compatibility issues
[ https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6406: --- Assignee: Apache Spark Launcher backward compatibility issues -- Key: SPARK-6406 URL: https://issues.apache.org/jira/browse/SPARK-6406 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Nishkam Ravi Assignee: Apache Spark Priority: Minor The new launcher library breaks backward compatibility. hadoop string in the spark assembly should not be mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6341) Upgrade breeze from 0.11.1 to 0.11.2 or later
[ https://issues.apache.org/jira/browse/SPARK-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6341. -- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5222 [https://github.com/apache/spark/pull/5222] Upgrade breeze from 0.11.1 to 0.11.2 or later - Key: SPARK-6341 URL: https://issues.apache.org/jira/browse/SPARK-6341 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.3.0 Reporter: Yu Ishikawa Priority: Minor Fix For: 1.3.1, 1.4.0 There is a bug to divide a breeze sparse vector which has any zero values with a scalar value. However, this bug is in breeze's side. I heard that once David fixes it and publishes it to maven, we can upgrade to breeze 0.11.2 or later. - [Apache Spark Developers List: Is there any bugs to divide a Breeze sparse vector at Spark v1.3.0-rc3](http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-td11056.html) - [Is there any bugs to divide a sparse vector with `:/` at v0.11.1? · Issue #382 · scalanlp/breeze](https://github.com/scalanlp/breeze/issues/382#issuecomment-80896698) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6341) Upgrade breeze from 0.11.1 to 0.11.2 or later
[ https://issues.apache.org/jira/browse/SPARK-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6341: - Assignee: Yu Ishikawa Upgrade breeze from 0.11.1 to 0.11.2 or later - Key: SPARK-6341 URL: https://issues.apache.org/jira/browse/SPARK-6341 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.3.0 Reporter: Yu Ishikawa Assignee: Yu Ishikawa Priority: Minor Fix For: 1.3.1, 1.4.0 There is a bug to divide a breeze sparse vector which has any zero values with a scalar value. However, this bug is in breeze's side. I heard that once David fixes it and publishes it to maven, we can upgrade to breeze 0.11.2 or later. - [Apache Spark Developers List: Is there any bugs to divide a Breeze sparse vector at Spark v1.3.0-rc3](http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-td11056.html) - [Is there any bugs to divide a sparse vector with `:/` at v0.11.1? · Issue #382 · scalanlp/breeze](https://github.com/scalanlp/breeze/issues/382#issuecomment-80896698) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6443) Could not submit app in standalone cluster mode when HA is enabled
[ https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang updated SPARK-6443: Priority: Critical (was: Major) Could not submit app in standalone cluster mode when HA is enabled -- Key: SPARK-6443 URL: https://issues.apache.org/jira/browse/SPARK-6443 Project: Spark Issue Type: Bug Components: Spark Submit Reporter: Tao Wang Priority: Critical After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar and it failed with error message: Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://doggie153:7077,doggie159:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more But in client mode it ended with correct result. So my guess is right. I will fix it in the related PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6567) Large linear model parallelism via a join and reduceByKey
Reza Zadeh created SPARK-6567: - Summary: Large linear model parallelism via a join and reduceByKey Key: SPARK-6567 URL: https://issues.apache.org/jira/browse/SPARK-6567 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Reza Zadeh To train a linear model, each training point in the training set needs its dot product computed against the model, per iteration. If the model is large (too large to fit in memory on a single machine) then SPARK-4590 proposes using parameter server. There is an easier way to achieve this without parameter servers. In particular, if the data is held as a BlockMatrix and the model as an RDD, then each block can be joined with the relevant part of the model, followed by a reduceByKey to compute the dot products. This obviates the need for a parameter server, at least for linear models. However, it's unclear how it compares performance-wise to parameter servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6568) spark-shell.cmd --jars option does not accept the jar that has space in its path
Masayoshi TSUZUKI created SPARK-6568: Summary: spark-shell.cmd --jars option does not accept the jar that has space in its path Key: SPARK-6568 URL: https://issues.apache.org/jira/browse/SPARK-6568 Project: Spark Issue Type: Bug Components: Spark Core, Windows Affects Versions: 1.3.0 Environment: Windows 8.1 Reporter: Masayoshi TSUZUKI spark-shell.cmd --jars option does not accept the jar that has space in its path. The path of jar sometimes containes space in Windows. {code} bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar {code} this gets {code} Exception in thread main java.net.URISyntaxException: Illegal character in path at index 10: C:/Program Files/some/jar1.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
Platon Potapov created SPARK-6569: - Summary: Kafka directInputStream logs what appear to be incorrect warnings Key: SPARK-6569 URL: https://issues.apache.org/jira/browse/SPARK-6569 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: Spark 1.3.0 Reporter: Platon Potapov Priority: Minor During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 * the ${part.fromOffset} is not correctly substituted to a value * is the condition really mandates a warning logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
[ https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5111: --- Assignee: (was: Apache Spark) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5 --- Key: SPARK-5111 URL: https://issues.apache.org/jira/browse/SPARK-5111 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhan Zhang Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 support in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
[ https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5111: --- Assignee: Apache Spark HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5 --- Key: SPARK-5111 URL: https://issues.apache.org/jira/browse/SPARK-5111 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhan Zhang Assignee: Apache Spark Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 support in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6566) Update Spark to use the latest version of Parquet libraries
[ https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383627#comment-14383627 ] Cheng Lian commented on SPARK-6566: --- Hi [~k.shaposhni...@gmail.com], as described in SPARK-5463, we do want to upgrade Parquet. However, currently we have two concerns: # The most recent Parquet RC release introduces subtle API incompatibilities related to filter push-down and Parquet metadata gathering, which I believe requires more work than the patch you provided if we want everything works perfectly with the best performance. # We'd like to wait for the official release of Parquet 1.6.0. This is the first release for Parquet as an Apache top-level project, so it takes more time than usual. We probably will first try to upgrade to a most recent 1.6.0 RC release in Spark master, and then switch to the official 1.6.0 release in Spark 1.4.0 (and Spark 1.3.2 if there will be one). Update Spark to use the latest version of Parquet libraries --- Key: SPARK-6566 URL: https://issues.apache.org/jira/browse/SPARK-6566 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). E.g. PARQUET-136 It would be good to update Spark to use the latest parquet version. The following changes are required: {code} diff --git a/pom.xml b/pom.xml index 5ad39a9..095b519 100644 --- a/pom.xml +++ b/pom.xml @@ -132,7 +132,7 @@ !-- Version used for internal directory structure -- hive.version.short0.13.1/hive.version.short derby.version10.10.1.1/derby.version -parquet.version1.6.0rc3/parquet.version +parquet.version1.6.0rc7/parquet.version jblas.version1.2.3/jblas.version jetty.version8.1.14.v20131031/jetty.version orbit.version3.0.0.v201112011016/orbit.version {code} and {code} --- a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat globalMetaData = new GlobalMetaData(globalMetaData.getSchema, mergedMetadata, globalMetaData.getCreatedBy) -val readContext = getReadSupport(configuration).init( +val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( new InitContext(configuration, globalMetaData.getKeyValueMetaData, globalMetaData.getSchema)) {code} I am happy to prepare a pull request if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Platon Potapov updated SPARK-6569: -- Description: During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning logged? was: During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 * the ${part.fromOffset} is not correctly substituted to a value * is the condition really mandates a warning logged? Kafka directInputStream logs what appear to be incorrect warnings - Key: SPARK-6569 URL: https://issues.apache.org/jira/browse/SPARK-6569 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: Spark 1.3.0 Reporter: Platon Potapov Priority: Minor During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Platon Potapov updated SPARK-6569: -- Description: During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning being logged? was: During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning logged? Kafka directInputStream logs what appear to be incorrect warnings - Key: SPARK-6569 URL: https://issues.apache.org/jira/browse/SPARK-6569 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: Spark 1.3.0 Reporter: Platon Potapov Priority: Minor During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning being logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6548) Adding stddev to DataFrame functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6548: --- Assignee: (was: Apache Spark) Adding stddev to DataFrame functions Key: SPARK-6548 URL: https://issues.apache.org/jira/browse/SPARK-6548 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Labels: DataFrame, starter Fix For: 1.4.0 Add it to the list of aggregate functions: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Also add it to https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala We can either add a Stddev Catalyst expression, or just compute it using existing functions like here: https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6548) Adding stddev to DataFrame functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6548: --- Assignee: Apache Spark Adding stddev to DataFrame functions Key: SPARK-6548 URL: https://issues.apache.org/jira/browse/SPARK-6548 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Labels: DataFrame, starter Fix For: 1.4.0 Add it to the list of aggregate functions: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Also add it to https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala We can either add a Stddev Catalyst expression, or just compute it using existing functions like here: https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6548) Adding stddev to DataFrame functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383635#comment-14383635 ] Apache Spark commented on SPARK-6548: - User 'dreamquster' has created a pull request for this issue: https://github.com/apache/spark/pull/5228 Adding stddev to DataFrame functions Key: SPARK-6548 URL: https://issues.apache.org/jira/browse/SPARK-6548 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Labels: DataFrame, starter Fix For: 1.4.0 Add it to the list of aggregate functions: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Also add it to https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala We can either add a Stddev Catalyst expression, or just compute it using existing functions like here: https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns
[ https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383650#comment-14383650 ] sdfox commented on SPARK-6489: -- I am interesting in this. Optimize lateral view with explode to not read unnecessary columns -- Key: SPARK-6489 URL: https://issues.apache.org/jira/browse/SPARK-6489 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov Labels: starter Currently a query with lateral view explode(...) results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS PartialSum#38L] Project [name#0,d#21] Generate explode(data#2), true, false InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35), Some(ppl)) {noformat} Note that *age* column is not needed to produce the output but it is still read from the underlying RDD. A sample program to demonstrate the issue: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) object ExplodeDemo extends App { val ppl = Array( Person(A, 20, Array(10, 12, 19)), Person(B, 25, Array(7, 8, 4)), Person(C, 19, Array(12, 4, 232))) val conf = new SparkConf().setMaster(local[2]).setAppName(sql) val sc = new SparkContext(conf) val sqlCtx = new HiveContext(sc) import sqlCtx.implicits._ val df = sc.makeRDD(ppl).toDF df.registerTempTable(ppl) sqlCtx.cacheTable(ppl) // cache table otherwise ExistingRDD will be used that do not support column pruning val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d as d group by name) s.explain(true) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6564) SQLContext.emptyDataFrame should contain 0 rows, not 1 row
Reynold Xin created SPARK-6564: -- Summary: SQLContext.emptyDataFrame should contain 0 rows, not 1 row Key: SPARK-6564 URL: https://issues.apache.org/jira/browse/SPARK-6564 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Right now emptyDataFrame actually contains 1 row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383599#comment-14383599 ] Masayoshi TSUZUKI commented on SPARK-6435: -- Release 1.3.0 works fine. But the problem occurs in the latest script in master branch (under developing for 1.4). spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6562) DataFrame.replace value support
Reynold Xin created SPARK-6562: -- Summary: DataFrame.replace value support Key: SPARK-6562 URL: https://issues.apache.org/jira/browse/SPARK-6562 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Support replacing a set of values with another set of values (i.e. map join), similar to Pandas' replace. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6119) DataFrame.dropna support
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6119: --- Summary: DataFrame.dropna support (was: better support for working with missing data) DataFrame.dropna support Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. Some ideas: 1. Support replacing all null value for a column (or all columns) with a fixed value. 2. Support dropping rows with null values (dropna). 3. Support replacing a set of values with another set of values (i.e. map join) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6563) DataFrame.fillna
Reynold Xin created SPARK-6563: -- Summary: DataFrame.fillna Key: SPARK-6563 URL: https://issues.apache.org/jira/browse/SPARK-6563 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Support replacing all null value for a column (or all columns) with a fixed value. Similar to Pandas' fillna. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.fillna.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6119) DataFrame.dropna support
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6119: --- Description: Support dropping rows with null values (dropna). Similar to Pandas' dropna http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html was: Real world data can be messy. An important feature of data frames is support for missing data. We should figure out what we want to do here. Some ideas: 1. Support replacing all null value for a column (or all columns) with a fixed value. 2. Support dropping rows with null values (dropna). 3. Support replacing a set of values with another set of values (i.e. map join) DataFrame.dropna support Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Support dropping rows with null values (dropna). Similar to Pandas' dropna http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383560#comment-14383560 ] Masayoshi TSUZUKI commented on SPARK-6435: -- I looked into the script of the latest version and unfortunately found that it doesn't work properly too. We have the same symptom when we specify multiple jars with --jars option in spark-shell.cmd, but the cause is different. These work fine. {code} bin\spark-shell.cmd --jars C:\jar1.jar bin\spark-shell.cmd --jars C:\jar1.jar {code} But this doesn't work. {code} bin\spark-shell.cmd --jars C:\jar1.jar,C:\jar2.jar {code} this gets {code} Exception in thread main java.net.URISyntaxException: Illegal character in path at index 11: C:/jar1.jar C:/jar2.jar {code} spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383550#comment-14383550 ] Masayoshi TSUZUKI commented on SPARK-5389: -- Hmm... sorry it seems different case from what I expected. And I still have no idea how to reproduce it. spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6119) DataFrame.dropna support
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383491#comment-14383491 ] Apache Spark commented on SPARK-6119: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5225 DataFrame.dropna support Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Support dropping rows with null values (dropna). Similar to Pandas' dropna http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6119) DataFrame.dropna support
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6119: --- Assignee: (was: Apache Spark) DataFrame.dropna support Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Support dropping rows with null values (dropna). Similar to Pandas' dropna http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6119) DataFrame.dropna support
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6119: --- Assignee: Apache Spark DataFrame.dropna support Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Labels: DataFrame Support dropping rows with null values (dropna). Similar to Pandas' dropna http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6564) SQLContext.emptyDataFrame should contain 0 rows, not 1 row
[ https://issues.apache.org/jira/browse/SPARK-6564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6564: --- Assignee: Apache Spark (was: Reynold Xin) SQLContext.emptyDataFrame should contain 0 rows, not 1 row -- Key: SPARK-6564 URL: https://issues.apache.org/jira/browse/SPARK-6564 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Right now emptyDataFrame actually contains 1 row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6564) SQLContext.emptyDataFrame should contain 0 rows, not 1 row
[ https://issues.apache.org/jira/browse/SPARK-6564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6564: --- Assignee: Reynold Xin (was: Apache Spark) SQLContext.emptyDataFrame should contain 0 rows, not 1 row -- Key: SPARK-6564 URL: https://issues.apache.org/jira/browse/SPARK-6564 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Right now emptyDataFrame actually contains 1 row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6564) SQLContext.emptyDataFrame should contain 0 rows, not 1 row
[ https://issues.apache.org/jira/browse/SPARK-6564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383495#comment-14383495 ] Apache Spark commented on SPARK-6564: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5226 SQLContext.emptyDataFrame should contain 0 rows, not 1 row -- Key: SPARK-6564 URL: https://issues.apache.org/jira/browse/SPARK-6564 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Right now emptyDataFrame actually contains 1 row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6565) Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF
Cheng Lian created SPARK-6565: - Summary: Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF Key: SPARK-6565 URL: https://issues.apache.org/jira/browse/SPARK-6565 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Priority: Minor Since 1.3.0, {{SQLContext.jsonRDD}} actually returns a {{DataFrame}}, the original name becomes confusing. Would be better to deprecate it and add {{jsonDataFrame}} or {{jsonDF}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6406) Launcher backward compatibility issues
[ https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383519#comment-14383519 ] Sean Owen commented on SPARK-6406: -- (Might want to update the title and description to reflect what this is really about now; I wasn't 100% sure what the latest intent was.) Launcher backward compatibility issues -- Key: SPARK-6406 URL: https://issues.apache.org/jira/browse/SPARK-6406 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Nishkam Ravi Priority: Minor The new launcher library breaks backward compatibility. hadoop string in the spark assembly should not be mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383601#comment-14383601 ] Apache Spark commented on SPARK-6435: - User 'tsudukim' has created a pull request for this issue: https://github.com/apache/spark/pull/5227 spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6435: --- Assignee: (was: Apache Spark) spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6435: --- Assignee: Apache Spark spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Assignee: Apache Spark Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6566) Update Spark to use the latest version of Parquet libraries
Konstantin Shaposhnikov created SPARK-6566: -- Summary: Update Spark to use the latest version of Parquet libraries Key: SPARK-6566 URL: https://issues.apache.org/jira/browse/SPARK-6566 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). E.g. PARQUET-136 It would be good to update Spark to use the latest parquet version. The following changes are required: {code} diff --git a/pom.xml b/pom.xml index 5ad39a9..095b519 100644 --- a/pom.xml +++ b/pom.xml @@ -132,7 +132,7 @@ !-- Version used for internal directory structure -- hive.version.short0.13.1/hive.version.short derby.version10.10.1.1/derby.version -parquet.version1.6.0rc3/parquet.version +parquet.version1.6.0rc7/parquet.version jblas.version1.2.3/jblas.version jetty.version8.1.14.v20131031/jetty.version orbit.version3.0.0.v201112011016/orbit.version {code} and {code} --- a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat globalMetaData = new GlobalMetaData(globalMetaData.getSchema, mergedMetadata, globalMetaData.getCreatedBy) -val readContext = getReadSupport(configuration).init( +val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( new InitContext(configuration, globalMetaData.getKeyValueMetaData, globalMetaData.getSchema)) {code} I am happy to prepare a pull request if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383566#comment-14383566 ] vijay commented on SPARK-6435: -- Strange - when I test it with multiple jars (with the fixed script) everything works spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383566#comment-14383566 ] vijay edited comment on SPARK-6435 at 3/27/15 9:17 AM: --- Strange - when I test it with multiple jars (with the fixed script) everything works. Something has changed in some other script wrt the released 1.3.0 was (Author: vjapache): Strange - when I test it with multiple jars (with the fixed script) everything works spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6255) Python MLlib API missing items: Classification
[ https://issues.apache.org/jira/browse/SPARK-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6255: --- Assignee: Apache Spark (was: Yanbo Liang) Python MLlib API missing items: Classification -- Key: SPARK-6255 URL: https://issues.apache.org/jira/browse/SPARK-6255 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. LogisticRegressionWithLBFGS * setNumClasses * setValidateData LogisticRegressionModel * getThreshold * numClasses * numFeatures SVMWithSGD * setValidateData SVMModel * getThreshold -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6255) Python MLlib API missing items: Classification
[ https://issues.apache.org/jira/browse/SPARK-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6255: --- Assignee: Yanbo Liang (was: Apache Spark) Python MLlib API missing items: Classification -- Key: SPARK-6255 URL: https://issues.apache.org/jira/browse/SPARK-6255 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Yanbo Liang This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. LogisticRegressionWithLBFGS * setNumClasses * setValidateData LogisticRegressionModel * getThreshold * numClasses * numFeatures SVMWithSGD * setValidateData SVMModel * getThreshold -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5563: --- Assignee: Apache Spark (was: yuhao yang) LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5563: --- Assignee: yuhao yang (was: Apache Spark) LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383729#comment-14383729 ] Sean Owen commented on SPARK-6435: -- OK, [~vjapache] would you like to submit a PR that changes to use the brackets? You may need two PRs, one for branch-1.3 and one for master, since some occurrences are now gone in master. [~tsudukim] OK I understand your pull request is to fix a similar issue but in the new 1.4 / master code? spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6556) Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver
[ https://issues.apache.org/jira/browse/SPARK-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6556: - Affects Version/s: 1.4.0 Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver Key: SPARK-6556 URL: https://issues.apache.org/jira/browse/SPARK-6556 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.4.0 The current reading logic of executorTimeoutMs is: {code} private val executorTimeoutMs = sc.conf.getLong(spark.network.timeout, sc.conf.getLong(spark.storage.blockManagerSlaveTimeoutMs, 120)) * 1000 {code} So if spark.storage.blockManagerSlaveTimeoutMs is 1 and spark.network.timeout is not set, executorTimeoutMs will be 1 * 1000. But the correct value should have been 1. checkTimeoutIntervalMs has the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6556) Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver
[ https://issues.apache.org/jira/browse/SPARK-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6556. -- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Shixiong Zhu Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver Key: SPARK-6556 URL: https://issues.apache.org/jira/browse/SPARK-6556 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.4.0 The current reading logic of executorTimeoutMs is: {code} private val executorTimeoutMs = sc.conf.getLong(spark.network.timeout, sc.conf.getLong(spark.storage.blockManagerSlaveTimeoutMs, 120)) * 1000 {code} So if spark.storage.blockManagerSlaveTimeoutMs is 1 and spark.network.timeout is not set, executorTimeoutMs will be 1 * 1000. But the correct value should have been 1. checkTimeoutIntervalMs has the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5155) Python API for MQTT streaming
[ https://issues.apache.org/jira/browse/SPARK-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5155: --- Assignee: Prabeesh K (was: Apache Spark) Python API for MQTT streaming - Key: SPARK-5155 URL: https://issues.apache.org/jira/browse/SPARK-5155 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Davies Liu Assignee: Prabeesh K Python API for MQTT Utils -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5155) Python API for MQTT streaming
[ https://issues.apache.org/jira/browse/SPARK-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5155: --- Assignee: Apache Spark (was: Prabeesh K) Python API for MQTT streaming - Key: SPARK-5155 URL: https://issues.apache.org/jira/browse/SPARK-5155 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Davies Liu Assignee: Apache Spark Python API for MQTT Utils -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6569. -- Resolution: Duplicate Kafka directInputStream logs what appear to be incorrect warnings - Key: SPARK-6569 URL: https://issues.apache.org/jira/browse/SPARK-6569 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: Spark 1.3.0 Reporter: Platon Potapov Priority: Minor During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning being logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6535) new RDD function that returns intermediate Future
[ https://issues.apache.org/jira/browse/SPARK-6535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6535. -- Resolution: Not a Problem I think it's fair to say that this would not require a change to Spark to implement the desired functionality, so closing it. new RDD function that returns intermediate Future - Key: SPARK-6535 URL: https://issues.apache.org/jira/browse/SPARK-6535 Project: Spark Issue Type: Wish Components: Spark Core Reporter: Eric Johnston Priority: Minor Labels: features, newbie Original Estimate: 168h Remaining Estimate: 168h I'm suggesting a possible Spark RDD method that I think could give value to a number of people. I'd be interested in thoughts and feedback. Is this a good or bad idea in general? Will it work well, but is too specific for Spark-Core? def mapIO[V : ClassTag](f1 : T = Future[U], f2 : U = V, batchSize : Int) : RDD[V] The idea is that often times we have an RDD[T] containing metadata, for example a file path or a unique identifier to data in an external database. We would like to retrieve this data, process it, and provide the output as an RDD. Right now, one way to do that is with two map calls: the first being T = U, followed by U = V. However, this will block on all T = U IO operations. By wrapping U in a Future, this problem is avoided. The batchSize is added because we do not want to create a future for every row in a partition -- we may get too much data back at once. The batchSize limits the number of outstanding Futures within a partition. Ideally this number is set to be big enough so that there is always data ready to process, but small enough that not too much data is pulled at any one time. We could potentially default the batchSize to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6440) ipv6 URI for HttpServer
[ https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6440: --- Assignee: (was: Apache Spark) ipv6 URI for HttpServer --- Key: SPARK-6440 URL: https://issues.apache.org/jira/browse/SPARK-6440 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster Reporter: Arsenii Krasikov Priority: Minor In {{org.apache.spark.HttpServer}} uri is generated as {code:java}spark:// + localHostname + : + masterPort{code}, where {{localHostname}} is {code:java} org.apache.spark.util.Utils.localHostName() = customHostname.getOrElse(localIpAddressHostname){code}. If the host has an ipv6 address then it would be interpolated into invalid URI: {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}. The solution is to separate uri and hostname entities. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6440) ipv6 URI for HttpServer
[ https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6440: --- Assignee: Apache Spark ipv6 URI for HttpServer --- Key: SPARK-6440 URL: https://issues.apache.org/jira/browse/SPARK-6440 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster Reporter: Arsenii Krasikov Assignee: Apache Spark Priority: Minor In {{org.apache.spark.HttpServer}} uri is generated as {code:java}spark:// + localHostname + : + masterPort{code}, where {{localHostname}} is {code:java} org.apache.spark.util.Utils.localHostName() = customHostname.getOrElse(localIpAddressHostname){code}. If the host has an ipv6 address then it would be interpolated into invalid URI: {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}. The solution is to separate uri and hostname entities. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name
[ https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6558: --- Assignee: Apache Spark (was: Thomas Graves) Utils.getCurrentUserName returns the full principal name instead of login name -- Key: SPARK-6558 URL: https://issues.apache.org/jira/browse/SPARK-6558 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Thomas Graves Assignee: Apache Spark Priority: Critical Utils.getCurrentUserName returns UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't set. It should return UserGroupInformation.getCurrentUser().getShortUserName() getUserName() returns the users full principal name (ie us...@corp.com). getShortUserName() returns just the users login name (user1). This just happens to work on YARN because the Client code sets: env(SPARK_USER) = UserGroupInformation.getCurrentUser().getShortUserName() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6570) Spark SQL arrays: explode() fails and cannot save array type to Parquet
[ https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Chase updated SPARK-6570: - Summary: Spark SQL arrays: explode() fails and cannot save array type to Parquet (was: Spark SQL explode() fails, assumes underlying SQL array is represented by Scala Seq) Spark SQL arrays: explode() fails and cannot save array type to Parquet - Key: SPARK-6570 URL: https://issues.apache.org/jira/browse/SPARK-6570 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase {code} @Rule public TemporaryFolder tmp = new TemporaryFolder(); @Test public void testPercentileWithExplode() throws Exception { StructType schema = DataTypes.createStructType(Lists.newArrayList( DataTypes.createStructField(col1, DataTypes.StringType, false), DataTypes.createStructField(col2s, DataTypes.createArrayType(DataTypes.IntegerType, true), true) )); JavaRDDRow rowRDD = sc.parallelize(Lists.newArrayList( RowFactory.create(test, new int[]{1, 2, 3}) )); DataFrame df = sql.createDataFrame(rowRDD, schema); df.registerTempTable(df); df.printSchema(); Listint[] ints = sql.sql(select col2s from df).javaRDD() .map(row - (int[]) row.get(0)).collect(); assertEquals(1, ints.size()); assertArrayEquals(new int[]{1, 2, 3}, ints.get(0)); // fails: lateral view explode does not work: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq ListInteger explodedInts = sql.sql(select col2 from df lateral view explode(col2s) splode as col2).javaRDD() .map(row - row.getInt(0)).collect(); assertEquals(3, explodedInts.size()); assertEquals(Lists.newArrayList(1, 2, 3), explodedInts); // fails: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq df.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + /parquet); DataFrame loadedDf = sql.load(tmp.getRoot().getAbsolutePath() + /parquet); loadedDf.registerTempTable(loadedDf); Listint[] moreInts = sql.sql(select col2s from loadedDf).javaRDD() .map(row - (int[]) row.get(0)).collect(); assertEquals(1, moreInts.size()); assertArrayEquals(new int[]{1, 2, 3}, moreInts.get(0)); } {code} {code} root |-- col1: string (nullable = false) |-- col2s: array (nullable = true) ||-- element: integer (containsNull = true) ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 (TID 15) java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq at org.apache.spark.sql.catalyst.expressions.Explode.eval(generators.scala:125) ~[spark-catalyst_2.10-1.3.0.jar:1.3.0] at org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:70) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:69) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library-2.10.4.jar:na] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library-2.10.4.jar:na] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6570) Spark SQL explode() fails, assumes underlying SQL array is represented by Scala Seq
[ https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383794#comment-14383794 ] Jon Chase commented on SPARK-6570: -- Stack trace for saveAsParquetFile(): {code} root |-- col1: string (nullable = false) |-- col2s: array (nullable = true) ||-- element: integer (containsNull = true) SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder. SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 (TID 15) java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:185) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) ~[parquet-hadoop-1.6.0rc3.jar:na] at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) ~[parquet-hadoop-1.6.0rc3.jar:na] at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) ~[parquet-hadoop-1.6.0rc3.jar:na] at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:631) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) ~[spark-core_2.10-1.3.0.jar:1.3.0] at org.apache.spark.scheduler.Task.run(Task.scala:64) ~[spark-core_2.10-1.3.0.jar:1.3.0] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) ~[spark-core_2.10-1.3.0.jar:1.3.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_31] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_31] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31] WARN o.a.spark.scheduler.TaskSetManager Lost task 7.0 in stage 1.0 (TID 15, localhost): java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:185) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:631) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ERROR o.a.spark.scheduler.TaskSetManager Task 7 in stage 1.0 failed 1 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 1.0 failed 1 times, most recent failure: Lost task 7.0 in stage 1.0 (TID 15, localhost): java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:185) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at
[jira] [Assigned] (SPARK-1684) Merge script should standardize SPARK-XXX prefix
[ https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1684: --- Assignee: Patrick Wendell (was: Apache Spark) Merge script should standardize SPARK-XXX prefix Key: SPARK-1684 URL: https://issues.apache.org/jira/browse/SPARK-1684 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Minor Labels: starter Attachments: spark_pulls_before_after.txt If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK-XXX: Issue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6571) MatrixFactorizationModel created by load fails on predictAll
Charles Hayden created SPARK-6571: - Summary: MatrixFactorizationModel created by load fails on predictAll Key: SPARK-6571 URL: https://issues.apache.org/jira/browse/SPARK-6571 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Charles Hayden This code, adapted from the documentation, fails when using a loaded model. from pyspark.mllib.recommendation import ALS, Rating, MatrixFactorizationModel r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.parallelize([r1, r2, r3]) model = ALS.trainImplicit(ratings, 1, seed=10) print '(2, 2)', model.predict(2, 2) #0.43... testset = sc.parallelize([(1, 2), (1, 1)]) print 'all', model.predictAll(testset).collect() #[Rating(user=1, product=1, rating=1.0...), Rating(user=1, product=2, rating=1.9...)] import os, tempfile path = tempfile.mkdtemp() model.save(sc, path) sameModel = MatrixFactorizationModel.load(sc, path) print '(2, 2)', sameModel.predict(2,2) sameModel.predictAll(testset).collect() This gives (2, 2) 0.443547642944 all [Rating(user=1, product=1, rating=1.1538351103381217), Rating(user=1, product=2, rating=0.7153473708381739)] (2, 2) 0.443547642944 --- Py4JError Traceback (most recent call last) ipython-input-18-af6612bed9d0 in module() 19 sameModel = MatrixFactorizationModel.load(sc, path) 20 print '(2, 2)', sameModel.predict(2,2) --- 21 sameModel.predictAll(testset).collect() 22 /home/ubuntu/spark/python/pyspark/mllib/recommendation.pyc in predictAll(self, user_product) 104 assert len(first) == 2, user_product should be RDD of (user, product) 105 user_product = user_product.map(lambda (u, p): (int(u), int(p))) -- 106 return self.call(predict, user_product) 107 108 def userFeatures(self): /home/ubuntu/spark/python/pyspark/mllib/common.pyc in call(self, name, *a) 134 def call(self, name, *a): 135 Call method of java_model -- 136 return callJavaFunc(self._sc, getattr(self._java_model, name), *a) 137 138 /home/ubuntu/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, *args) 111 Call Java Function 112 args = [_py2java(sc, a) for a in args] -- 113 return _java2py(sc, func(*args)) 114 115 /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 302 raise Py4JError( 303 'An error occurred while calling {0}{1}{2}. Trace:\n{3}\n'. -- 304 format(target_id, '.', name, value)) 305 else: 306 raise Py4JError( Py4JError: An error occurred while calling o450.predict. Trace: py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6348: --- Assignee: (was: Apache Spark) Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name
[ https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-6558: Assignee: Thomas Graves Utils.getCurrentUserName returns the full principal name instead of login name -- Key: SPARK-6558 URL: https://issues.apache.org/jira/browse/SPARK-6558 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Critical Utils.getCurrentUserName returns UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't set. It should return UserGroupInformation.getCurrentUser().getShortUserName() getUserName() returns the users full principal name (ie us...@corp.com). getShortUserName() returns just the users login name (user1). This just happens to work on YARN because the Client code sets: env(SPARK_USER) = UserGroupInformation.getCurrentUser().getShortUserName() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383856#comment-14383856 ] Thomas Graves commented on SPARK-5493: -- [~vanzin] I must be missing something. Why is this feature needed? I can run spark through oozie just fine without this on a secure yarn cluster. (and jobs run as the correct user) Support proxy users under kerberos -- Key: SPARK-5493 URL: https://issues.apache.org/jira/browse/SPARK-5493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brock Noland Assignee: Marcelo Vanzin Fix For: 1.3.0 When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like oozie might want to submit jobs as a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-1684) Merge script should standardize SPARK-XXX prefix
[ https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1684: --- Assignee: Apache Spark (was: Patrick Wendell) Merge script should standardize SPARK-XXX prefix Key: SPARK-1684 URL: https://issues.apache.org/jira/browse/SPARK-1684 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Apache Spark Priority: Minor Labels: starter Attachments: spark_pulls_before_after.txt If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK-XXX: Issue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6348: --- Assignee: Apache Spark Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Assignee: Apache Spark Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name
[ https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6558: --- Assignee: Thomas Graves (was: Apache Spark) Utils.getCurrentUserName returns the full principal name instead of login name -- Key: SPARK-6558 URL: https://issues.apache.org/jira/browse/SPARK-6558 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Critical Utils.getCurrentUserName returns UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't set. It should return UserGroupInformation.getCurrentUser().getShortUserName() getUserName() returns the users full principal name (ie us...@corp.com). getShortUserName() returns just the users login name (user1). This just happens to work on YARN because the Client code sets: env(SPARK_USER) = UserGroupInformation.getCurrentUser().getShortUserName() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name
[ https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383843#comment-14383843 ] Apache Spark commented on SPARK-6558: - User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/5229 Utils.getCurrentUserName returns the full principal name instead of login name -- Key: SPARK-6558 URL: https://issues.apache.org/jira/browse/SPARK-6558 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Critical Utils.getCurrentUserName returns UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't set. It should return UserGroupInformation.getCurrentUser().getShortUserName() getUserName() returns the users full principal name (ie us...@corp.com). getShortUserName() returns just the users login name (user1). This just happens to work on YARN because the Client code sets: env(SPARK_USER) = UserGroupInformation.getCurrentUser().getShortUserName() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383856#comment-14383856 ] Thomas Graves edited comment on SPARK-5493 at 3/27/15 2:00 PM: --- [~vanzin] I must be missing something. Why is this feature needed? I can run spark through oozie just fine without this on a secure yarn cluster. (and jobs run as the correct user) perhaps needed by hive? Or is it just to allow a proxy user to manually run things (ie not through oozie), which seems a bit odd to me. was (Author: tgraves): [~vanzin] I must be missing something. Why is this feature needed? I can run spark through oozie just fine without this on a secure yarn cluster. (and jobs run as the correct user) Support proxy users under kerberos -- Key: SPARK-5493 URL: https://issues.apache.org/jira/browse/SPARK-5493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brock Noland Assignee: Marcelo Vanzin Fix For: 1.3.0 When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like oozie might want to submit jobs as a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6544) Problem with Avro and Kryo Serialization
[ https://issues.apache.org/jira/browse/SPARK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6544. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5193 [https://github.com/apache/spark/pull/5193] Problem with Avro and Kryo Serialization Key: SPARK-6544 URL: https://issues.apache.org/jira/browse/SPARK-6544 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: Dean Chen Fix For: 1.4.0 We're running in to the following bug with Avro 1.7.6 and the Kryo serializer causing jobs to fail https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249 PR here https://github.com/apache/spark/pull/5193 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6572) When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1:
Frank Domoney created SPARK-6572: Summary: When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1: not found [error] Total time: 27 s, completed 27-Mar-2015 14:24:39 Key: SPARK-6572 URL: https://issues.apache.org/jira/browse/SPARK-6572 Project: Spark Issue Type: Bug Reporter: Frank Domoney -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Platon Potapov reopened SPARK-6569: --- Sean, please explain if the condition really mandates a warning being logged. The scenario in which this record gets logged seems to be just that there is no new data in Kafka topic (the kafka reader is at the head of the topic) - isn't that the case ? Kafka directInputStream logs what appear to be incorrect warnings - Key: SPARK-6569 URL: https://issues.apache.org/jira/browse/SPARK-6569 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: Spark 1.3.0 Reporter: Platon Potapov Priority: Minor During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning being logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)
Fabian Boehnlein created SPARK-6573: --- Summary: expect pandas null values as numpy.nan (not only as None) Key: SPARK-6573 URL: https://issues.apache.org/jira/browse/SPARK-6573 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Fabian Boehnlein In pandas it is common to use numpy.nan as the null value, for missing data or whatever. http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna createDataFrame however only works with None as null values, parsing them as None in the RDD. I suggest to add support for np.nan values in pandas DataFrames. current stracktrace when calling a DataFrame with object type columns with np.nan values (which are floats) {code} TypeError Traceback (most recent call last) ipython-input-38-34f0263f0bf4 in module() 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 339 schema = self._inferSchema(data.map(lambda r: row_cls(*r)), samplingRatio) 340 -- 341 return self.applySchema(data, schema) 342 343 def registerDataFrameAsTable(self, rdd, tableName): /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in applySchema(self, rdd, schema) 246 247 for row in rows: -- 248 _verify_type(row, schema) 249 250 # convert python objects to sql data /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1064 length of fields (%d) % (len(obj), len(dataType.fields))) 1065 for v, f in zip(obj, dataType.fields): - 1066 _verify_type(v, f.dataType) 1067 1068 _cached_cls = weakref.WeakValueDictionary() /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1048 if type(obj) not in _acceptable_types[_type]: 1049 raise TypeError(%s can not accept object in type %s - 1050 % (dataType, type(obj))) 1051 1052 if isinstance(dataType, ArrayType): TypeError: StringType can not accept object in type type 'float'{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383973#comment-14383973 ] Sean Owen commented on SPARK-6569: -- [~c...@koeninger.org] what do you think of the warning? if it's more suitable as info we can reopen and address that. The interpolation was already fixed separately. Kafka directInputStream logs what appear to be incorrect warnings - Key: SPARK-6569 URL: https://issues.apache.org/jira/browse/SPARK-6569 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: Spark 1.3.0 Reporter: Platon Potapov Priority: Minor During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning being logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)
[ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Boehnlein updated SPARK-6573: Issue Type: Sub-task (was: Improvement) Parent: SPARK-6116 expect pandas null values as numpy.nan (not only as None) - Key: SPARK-6573 URL: https://issues.apache.org/jira/browse/SPARK-6573 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Fabian Boehnlein In pandas it is common to use numpy.nan as the null value, for missing data or whatever. http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna createDataFrame however only works with None as null values, parsing them as None in the RDD. I suggest to add support for np.nan values in pandas DataFrames. current stracktrace when calling a DataFrame with object type columns with np.nan values (which are floats) {code} TypeError Traceback (most recent call last) ipython-input-38-34f0263f0bf4 in module() 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 339 schema = self._inferSchema(data.map(lambda r: row_cls(*r)), samplingRatio) 340 -- 341 return self.applySchema(data, schema) 342 343 def registerDataFrameAsTable(self, rdd, tableName): /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in applySchema(self, rdd, schema) 246 247 for row in rows: -- 248 _verify_type(row, schema) 249 250 # convert python objects to sql data /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1064 length of fields (%d) % (len(obj), len(dataType.fields))) 1065 for v, f in zip(obj, dataType.fields): - 1066 _verify_type(v, f.dataType) 1067 1068 _cached_cls = weakref.WeakValueDictionary() /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1048 if type(obj) not in _acceptable_types[_type]: 1049 raise TypeError(%s can not accept object in type %s - 1050 % (dataType, type(obj))) 1051 1052 if isinstance(dataType, ArrayType): TypeError: StringType can not accept object in type type 'float'{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6574) Python Example sql.py not working in version 1.3
[ https://issues.apache.org/jira/browse/SPARK-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6574: --- Assignee: Apache Spark (was: Davies Liu) Python Example sql.py not working in version 1.3 Key: SPARK-6574 URL: https://issues.apache.org/jira/browse/SPARK-6574 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Davies Liu Assignee: Apache Spark Priority: Blocker I downloaded spark version spark-1.3.0-bin-hadoop2.4. When the python version of sql.py is run the following error occurs: [root@nde-dev8-template python]# /root/spark-1.3.0-bin-hadoop2.4/bin/spark-submit sql.py Spark assembly has been built with Hive, including Datanucleus jars on classpath Traceback (most recent call last): File /root/spark-1.3.0-bin-hadoop2.4/examples/src/main/python/sql.py, line 22, in module from pyspark.sql import Row, StructField, StructType, StringType, IntegerType ImportError: cannot import name StructField -- The sql.py version, spark-1.2.1-bin-hadoop2.4, does not throw the error: [root@nde-dev8-template python]# /root/spark-1.2.1-bin-hadoop2.4/bin/spark-submit sql.py Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/27 14:18:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/27 14:19:41 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy? root |-- age: integer (nullable = true) |-- name: string (nullable = true) root |-- person_name: string (nullable = false) |-- person_age: integer (nullable = false) root |-- age: integer (nullable = true) |-- name: string (nullable = true) Justin - The OS/JAVA environments are: OS: Linux nde-dev8-template 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11 17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux JAVA: [root@nde-dev8-template bin]# java -version java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) The same error occurs when using bin/pyspark shell. from pyspark.sql import StructField Traceback (most recent call last): File stdin, line 1, in module ImportError: cannot import name StructField --- Any advice for resolving? Thanks in advance. Peter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384140#comment-14384140 ] Marcelo Vanzin commented on SPARK-5493: --- I'm not terribly familiar with how Oozie handles Spark, but Hive with impersonation enabled needs this. Support proxy users under kerberos -- Key: SPARK-5493 URL: https://issues.apache.org/jira/browse/SPARK-5493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brock Noland Assignee: Marcelo Vanzin Fix For: 1.3.0 When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like oozie might want to submit jobs as a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384145#comment-14384145 ] Marcelo Vanzin commented on SPARK-5493: --- Direct link: https://github.com/apache/hive/blob/spark/spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java#L370 Support proxy users under kerberos -- Key: SPARK-5493 URL: https://issues.apache.org/jira/browse/SPARK-5493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brock Noland Assignee: Marcelo Vanzin Fix For: 1.3.0 When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like oozie might want to submit jobs as a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4660) JavaSerializer uses wrong classloader
[ https://issues.apache.org/jira/browse/SPARK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384142#comment-14384142 ] sam commented on SPARK-4660: Furthermore it seems this issue is more likely to happen when I try to process more data. JavaSerializer uses wrong classloader - Key: SPARK-4660 URL: https://issues.apache.org/jira/browse/SPARK-4660 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.1.1 Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Priority: Critical Fix For: 1.1.2, 1.2.1, 1.3.0 Attachments: spark-serializer-classloader.patch During testing we found failures when trying to load some classes of the user application: {noformat} ERROR 2014-11-29 20:01:56 org.apache.spark.storage.BlockManagerWorker: Exception handling buffer message java.lang.ClassNotFoundException: org.apache.spark.demo.HttpReceiverCases$HttpRequest at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeseriali zationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:104) at org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:76) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:748) at org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:639) at org.apache.spark.storage.BlockManagerWorker.putBlock(BlockManagerWorker.scala:92) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:73) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:48) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:682) at org.apache.spark.network.ConnectionManager$$anon$10.run(ConnectionManager.scala:520) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Created] (SPARK-6574) Python Example sql.py not working in version 1.3
Davies Liu created SPARK-6574: - Summary: Python Example sql.py not working in version 1.3 Key: SPARK-6574 URL: https://issues.apache.org/jira/browse/SPARK-6574 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker I downloaded spark version spark-1.3.0-bin-hadoop2.4. When the python version of sql.py is run the following error occurs: [root@nde-dev8-template python]# /root/spark-1.3.0-bin-hadoop2.4/bin/spark-submit sql.py Spark assembly has been built with Hive, including Datanucleus jars on classpath Traceback (most recent call last): File /root/spark-1.3.0-bin-hadoop2.4/examples/src/main/python/sql.py, line 22, in module from pyspark.sql import Row, StructField, StructType, StringType, IntegerType ImportError: cannot import name StructField -- The sql.py version, spark-1.2.1-bin-hadoop2.4, does not throw the error: [root@nde-dev8-template python]# /root/spark-1.2.1-bin-hadoop2.4/bin/spark-submit sql.py Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/27 14:18:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/27 14:19:41 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy? root |-- age: integer (nullable = true) |-- name: string (nullable = true) root |-- person_name: string (nullable = false) |-- person_age: integer (nullable = false) root |-- age: integer (nullable = true) |-- name: string (nullable = true) Justin - The OS/JAVA environments are: OS: Linux nde-dev8-template 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11 17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux JAVA: [root@nde-dev8-template bin]# java -version java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) The same error occurs when using bin/pyspark shell. from pyspark.sql import StructField Traceback (most recent call last): File stdin, line 1, in module ImportError: cannot import name StructField --- Any advice for resolving? Thanks in advance. Peter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6565) Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF
[ https://issues.apache.org/jira/browse/SPARK-6565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384161#comment-14384161 ] Michael Armbrust commented on SPARK-6565: - It is not that it returns an RDD, it is that it takes an RDD of json data. Just like jsonFile does not return a file. Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF -- Key: SPARK-6565 URL: https://issues.apache.org/jira/browse/SPARK-6565 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Priority: Minor Since 1.3.0, {{SQLContext.jsonRDD}} actually returns a {{DataFrame}}, the original name becomes confusing. Would be better to deprecate it and add {{jsonDataFrame}} or {{jsonDF}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6574) Python Example sql.py not working in version 1.3
[ https://issues.apache.org/jira/browse/SPARK-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6574: --- Assignee: Davies Liu (was: Apache Spark) Python Example sql.py not working in version 1.3 Key: SPARK-6574 URL: https://issues.apache.org/jira/browse/SPARK-6574 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker I downloaded spark version spark-1.3.0-bin-hadoop2.4. When the python version of sql.py is run the following error occurs: [root@nde-dev8-template python]# /root/spark-1.3.0-bin-hadoop2.4/bin/spark-submit sql.py Spark assembly has been built with Hive, including Datanucleus jars on classpath Traceback (most recent call last): File /root/spark-1.3.0-bin-hadoop2.4/examples/src/main/python/sql.py, line 22, in module from pyspark.sql import Row, StructField, StructType, StringType, IntegerType ImportError: cannot import name StructField -- The sql.py version, spark-1.2.1-bin-hadoop2.4, does not throw the error: [root@nde-dev8-template python]# /root/spark-1.2.1-bin-hadoop2.4/bin/spark-submit sql.py Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/27 14:18:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/27 14:19:41 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy? root |-- age: integer (nullable = true) |-- name: string (nullable = true) root |-- person_name: string (nullable = false) |-- person_age: integer (nullable = false) root |-- age: integer (nullable = true) |-- name: string (nullable = true) Justin - The OS/JAVA environments are: OS: Linux nde-dev8-template 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11 17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux JAVA: [root@nde-dev8-template bin]# java -version java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) The same error occurs when using bin/pyspark shell. from pyspark.sql import StructField Traceback (most recent call last): File stdin, line 1, in module ImportError: cannot import name StructField --- Any advice for resolving? Thanks in advance. Peter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384218#comment-14384218 ] Brock Noland commented on SPARK-5493: - I don't know 100% about how oozie works but I believe it submits a map only task to do the actual job submission. The {{doas}} is done before submitting the map task which does the job submission. In the Hive case we do not have this infrastructure and it would introduce significant latency to HOS queries. While {{HADOOP_PROXY_USER}} might work for testing HOS will be used in production in the near future. This feature was created for those production use cases. Support proxy users under kerberos -- Key: SPARK-5493 URL: https://issues.apache.org/jira/browse/SPARK-5493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brock Noland Assignee: Marcelo Vanzin Fix For: 1.3.0 When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like hive might want to submit jobs as a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6574) Python Example sql.py not working in version 1.3
[ https://issues.apache.org/jira/browse/SPARK-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384172#comment-14384172 ] Apache Spark commented on SPARK-6574: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5230 Python Example sql.py not working in version 1.3 Key: SPARK-6574 URL: https://issues.apache.org/jira/browse/SPARK-6574 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker I downloaded spark version spark-1.3.0-bin-hadoop2.4. When the python version of sql.py is run the following error occurs: [root@nde-dev8-template python]# /root/spark-1.3.0-bin-hadoop2.4/bin/spark-submit sql.py Spark assembly has been built with Hive, including Datanucleus jars on classpath Traceback (most recent call last): File /root/spark-1.3.0-bin-hadoop2.4/examples/src/main/python/sql.py, line 22, in module from pyspark.sql import Row, StructField, StructType, StringType, IntegerType ImportError: cannot import name StructField -- The sql.py version, spark-1.2.1-bin-hadoop2.4, does not throw the error: [root@nde-dev8-template python]# /root/spark-1.2.1-bin-hadoop2.4/bin/spark-submit sql.py Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/27 14:18:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/27 14:19:41 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy? root |-- age: integer (nullable = true) |-- name: string (nullable = true) root |-- person_name: string (nullable = false) |-- person_age: integer (nullable = false) root |-- age: integer (nullable = true) |-- name: string (nullable = true) Justin - The OS/JAVA environments are: OS: Linux nde-dev8-template 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11 17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux JAVA: [root@nde-dev8-template bin]# java -version java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) The same error occurs when using bin/pyspark shell. from pyspark.sql import StructField Traceback (most recent call last): File stdin, line 1, in module ImportError: cannot import name StructField --- Any advice for resolving? Thanks in advance. Peter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-5493: - Description: When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like hive might want to submit jobs as a client user. (was: When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like oozie might want to submit jobs as a client user.) Support proxy users under kerberos -- Key: SPARK-5493 URL: https://issues.apache.org/jira/browse/SPARK-5493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brock Noland Assignee: Marcelo Vanzin Fix For: 1.3.0 When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like hive might want to submit jobs as a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6572) When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1
[ https://issues.apache.org/jira/browse/SPARK-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383999#comment-14383999 ] Cheng Lian commented on SPARK-6572: --- Would you please provide exact command line you used to invoke SBT? When I build Spark 1.3 sbt gives me to following error: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1: not found [error] Total time: 27 s, completed 27-Mar-2015 14:24:39 Key: SPARK-6572 URL: https://issues.apache.org/jira/browse/SPARK-6572 Project: Spark Issue Type: Bug Reporter: Frank Domoney -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384038#comment-14384038 ] Cody Koeninger commented on SPARK-6569: --- I set it as warn because an empty batch can be the source of non-obvious problems that would be obscured if it was at the info level. Streams that don't get even one item during a batch are relatively rare for my use cases. I don't feel super strongly about it, though, if there's a reason to reduce the log level. Kafka directInputStream logs what appear to be incorrect warnings - Key: SPARK-6569 URL: https://issues.apache.org/jira/browse/SPARK-6569 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: Spark 1.3.0 Reporter: Platon Potapov Priority: Minor During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning being logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6572) When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1
[ https://issues.apache.org/jira/browse/SPARK-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384037#comment-14384037 ] Sean Owen commented on SPARK-6572: -- It builds correctly for me in branch 1.3 with {{build/sbt -Pyarn -Phadoop-2.3 assembly}}. [~Panzerfrank] that is not a URL, it's just a printing of Maven coordinates and is correct. Is your SBT somehow not picking up compiler plugins? that would cause this, I think. When I build Spark 1.3 sbt gives me to following error: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1: not found [error] Total time: 27 s, completed 27-Mar-2015 14:24:39 Key: SPARK-6572 URL: https://issues.apache.org/jira/browse/SPARK-6572 Project: Spark Issue Type: Bug Reporter: Frank Domoney -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6572) When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1
[ https://issues.apache.org/jira/browse/SPARK-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384006#comment-14384006 ] Frank Domoney commented on SPARK-6572: -- the correct URL for the kafka is kafka_2.11-0.8.1.1 When I build Spark 1.3 sbt gives me to following error: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1: not found [error] Total time: 27 s, completed 27-Mar-2015 14:24:39 Key: SPARK-6572 URL: https://issues.apache.org/jira/browse/SPARK-6572 Project: Spark Issue Type: Bug Reporter: Frank Domoney -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6576) DenseMatrix in PySpark should support indexing
[ https://issues.apache.org/jira/browse/SPARK-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384432#comment-14384432 ] Apache Spark commented on SPARK-6576: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/5232 DenseMatrix in PySpark should support indexing -- Key: SPARK-6576 URL: https://issues.apache.org/jira/browse/SPARK-6576 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6576) DenseMatrix in PySpark should support indexing
[ https://issues.apache.org/jira/browse/SPARK-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6576: --- Assignee: (was: Apache Spark) DenseMatrix in PySpark should support indexing -- Key: SPARK-6576 URL: https://issues.apache.org/jira/browse/SPARK-6576 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4069) [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM
[ https://issues.apache.org/jira/browse/SPARK-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4069: --- Assignee: (was: Apache Spark) [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM Key: SPARK-4069 URL: https://issues.apache.org/jira/browse/SPARK-4069 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Min Zhou Curently, ApplciationMaster in yarn mode simply unregister itself from yarn master , a.k.a resourcemanager. Itnever release executors' containers before that. Yarn's master will make a decision to kill all the executors' containers if it face such scenario. so the log of resourcemanager is like below {noformat} 2014-10-22 23:39:09,903 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1414003182949_0004_01 of type UNREGISTERED 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1414003182949_0004_01 State change from RUNNING to FINAL_SAVING 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1414003182949_0004 with final state: FINISHING 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1414003182949_0004 State change from RUNNING to FINAL_SAVING 2014-10-22 23:39:09,903 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1414003182949_0004_01 of type ATTEMPT_UPDATE_SAVED 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1414003182949_0004 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1414003182949_0004_01 State change from FINAL_SAVING to FINISHING 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1414003182949_0004 State change from FINAL_SAVING to FINISHING 2014-10-22 23:39:10,485 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1414003182949_0004_01 of type CONTAINER_FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1414003182949_0004_01_01 Container Transitioned from RUNNING to COMPLETED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1414003182949_0004_01 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: Completed container: container_1414003182949_0004_01_01 in state: COMPLETED event:FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Finish information of container container_1414003182949_0004_01_01 is written 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1414003182949_0004_01 State change from FINISHING to FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1414003182949_0004 CONTAINERID=container_1414003182949_0004_01_01 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: Stored the finish data of container container_1414003182949_0004_01_01 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Released container container_1414003182949_0004_01_01 of capacity memory:3072, vCores:1 on host host1, which currently has 0 containers, memory:0, vCores:0 used and memory:241901, vCores:32 available, release resources=true 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1414003182949_0004 State change from FINISHING to FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Finish information of application attempt appattempt_1414003182949_0004_01 is written 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim OPERATION=Application
[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384520#comment-14384520 ] Steve Loughran commented on SPARK-6479: --- Henry: utterly unrelated. I was merely offering to help define this API more formally derive tests from it. Create off-heap block storage API (internal) Key: SPARK-6479 URL: https://issues.apache.org/jira/browse/SPARK-6479 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Reporter: Reynold Xin Attachments: SparkOffheapsupportbyHDFS.pdf Would be great to create APIs for off-heap block stores, rather than doing a bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384525#comment-14384525 ] Henry Saputra commented on SPARK-6479: -- @Steve: Ah cool, thanks for clarifying =) Create off-heap block storage API (internal) Key: SPARK-6479 URL: https://issues.apache.org/jira/browse/SPARK-6479 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Reporter: Reynold Xin Attachments: SparkOffheapsupportbyHDFS.pdf Would be great to create APIs for off-heap block stores, rather than doing a bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org