[jira] [Created] (SPARK-9114) The returned value is not converted into internal type in Python UDF
Davies Liu created SPARK-9114: - Summary: The returned value is not converted into internal type in Python UDF Key: SPARK-9114 URL: https://issues.apache.org/jira/browse/SPARK-9114 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker The returned value is not converted into internal type in Python UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6941) Provide a better error message to explain that tables created from RDDs are immutable
[ https://issues.apache.org/jira/browse/SPARK-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6941: Shepherd: Yin Huai Provide a better error message to explain that tables created from RDDs are immutable - Key: SPARK-6941 URL: https://issues.apache.org/jira/browse/SPARK-6941 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yijie Shen Priority: Blocker We should explicitly let users know that tables created from RDDs are immutable and new rows cannot be inserted into it. We can add a better error message and also explain it in the programming guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9082) Filter using non-deterministic expressions should not be pushed down
[ https://issues.apache.org/jira/browse/SPARK-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9082: Shepherd: Yin Huai Filter using non-deterministic expressions should not be pushed down Key: SPARK-9082 URL: https://issues.apache.org/jira/browse/SPARK-9082 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Wenchen Fan For example, {code} val df = sqlContext.range(1, 10).select($id, rand(0).as('r)) df.as(a).join(df.filter($r 0.5).as(b), $a.id === $b.id).explain(true) {code} The plan is {code} == Physical Plan == ShuffledHashJoin [id#55323L], [id#55327L], BuildRight Exchange (HashPartitioning 200) Project [id#55323L,Rand 0 AS r#55324] PhysicalRDD [id#55323L], MapPartitionsRDD[42268] at range at console:37 Exchange (HashPartitioning 200) Project [id#55327L,Rand 0 AS r#55325] Filter (LessThan) PhysicalRDD [id#55327L], MapPartitionsRDD[42268] at range at console:37 {code} The rand get evaluated twice instead of once. This is caused by when we push down predicates we replace the attribute reference in the predicate with the actual expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9112) Implement LogisticRegressionSummary similar to LinearRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630053#comment-14630053 ] Manoj Kumar commented on SPARK-9112: Yes, that is the idea. Also we need not port it to ML right now, we could convert the transformed the dataframe to the required input type in mllib Also it might be useful to return the probability for the predicted class (as done by predict_proba in scikit-learn) . how does that sound? Implement LogisticRegressionSummary similar to LinearRegressionSummary -- Key: SPARK-9112 URL: https://issues.apache.org/jira/browse/SPARK-9112 Project: Spark Issue Type: New Feature Components: ML Reporter: Manoj Kumar Priority: Minor Since the API for LinearRegressionSummary has been merged, other models should follow suit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9102) Improve project collapse with nondeterministic expressions
[ https://issues.apache.org/jira/browse/SPARK-9102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9102: Shepherd: Yin Huai Improve project collapse with nondeterministic expressions -- Key: SPARK-9102 URL: https://issues.apache.org/jira/browse/SPARK-9102 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9102) Improve project collapse with nondeterministic expressions
[ https://issues.apache.org/jira/browse/SPARK-9102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9102: Assignee: Wenchen Fan Improve project collapse with nondeterministic expressions -- Key: SPARK-9102 URL: https://issues.apache.org/jira/browse/SPARK-9102 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9015) Maven cleanup / Clean Project Import in scala-ide
[ https://issues.apache.org/jira/browse/SPARK-9015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9015. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7375 [https://github.com/apache/spark/pull/7375] Maven cleanup / Clean Project Import in scala-ide - Key: SPARK-9015 URL: https://issues.apache.org/jira/browse/SPARK-9015 Project: Spark Issue Type: Improvement Components: Build Reporter: Jan Prach Priority: Minor Fix For: 1.5.0 Cleanup maven for a clean import in scala-ide / eclipse. The outstanging PR contains things like removal of groovy plugin and some more maven cleanup goes here. In order to make it a seamless experience two more things have to be merged upstream: 1) ide automatically generate jva sources from idl - https://issues.apache.org/jira/browse/AVRO-1671 2) set scala version in ide based on maven config - https://github.com/sonatype/m2eclipse-scala/issues/30 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9015) Maven cleanup / Clean Project Import in scala-ide
[ https://issues.apache.org/jira/browse/SPARK-9015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9015: - Assignee: Jan Prach Maven cleanup / Clean Project Import in scala-ide - Key: SPARK-9015 URL: https://issues.apache.org/jira/browse/SPARK-9015 Project: Spark Issue Type: Improvement Components: Build Reporter: Jan Prach Assignee: Jan Prach Priority: Minor Fix For: 1.5.0 Cleanup maven for a clean import in scala-ide / eclipse. The outstanging PR contains things like removal of groovy plugin and some more maven cleanup goes here. In order to make it a seamless experience two more things have to be merged upstream: 1) ide automatically generate jva sources from idl - https://issues.apache.org/jira/browse/AVRO-1671 2) set scala version in ide based on maven config - https://github.com/sonatype/m2eclipse-scala/issues/30 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6217) insertInto doesn't work in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-6217. -- Assignee: Wenchen Fan insertInto doesn't work in PySpark -- Key: SPARK-6217 URL: https://issues.apache.org/jira/browse/SPARK-6217 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Environment: Mac OS X Yosemite 10.10.2 Python 2.7.9 Spark 1.3.0 Reporter: Charles Cloud Assignee: Wenchen Fan The following code, running in an IPython shell throws an error: {code:none} In [1]: from pyspark import SparkContext, HiveContext In [2]: sc = SparkContext('local[*]', 'test') Spark assembly has been built with Hive, including Datanucleus jars on classpath In [3]: sql = HiveContext(sc) In [4]: import pandas as pd In [5]: df = pd.DataFrame({'a': [1.0, 2.0, 3.0], 'b': [1, 2, 3], 'c': list('abc')}) In [6]: df2 = pd.DataFrame({'a': [2.0, 3.0, 4.0], 'b': [4, 5, 6], 'c': list('def')}) In [7]: sdf = sql.createDataFrame(df) In [8]: sdf2 = sql.createDataFrame(df2) In [9]: sql.registerDataFrameAsTable(sdf, 'sdf') In [10]: sql.registerDataFrameAsTable(sdf2, 'sdf2') In [11]: sql.cacheTable('sdf') In [12]: sql.cacheTable('sdf2') In [13]: sdf2.insertInto('sdf') # throws an error {code} Here's the Java traceback: {code:none} Py4JJavaError: An error occurred while calling o270.insertInto. : java.lang.AssertionError: assertion failed: No plan for InsertIntoTable (LogicalRDD [a#0,b#1L,c#2], MapPartitionsRDD[13] at mapPartitions at SQLContext.scala:1167), Map(), false InMemoryRelation [a#6,b#7L,c#8], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [a#6,b#7L,c#8], MapPartitionsRDD[41] at mapPartitions at SQLContext.scala:1167), Some(sdf2) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:1085) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:1083) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:1089) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:1089) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) at org.apache.spark.sql.DataFrame.insertInto(DataFrame.scala:1134) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} I'd be ecstatic if this was my own fault, and I'm somehow using it incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6941) Provide a better error message to explain that tables created from RDDs are immutable
[ https://issues.apache.org/jira/browse/SPARK-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-6941. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7342 [https://github.com/apache/spark/pull/7342] Provide a better error message to explain that tables created from RDDs are immutable - Key: SPARK-6941 URL: https://issues.apache.org/jira/browse/SPARK-6941 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yijie Shen Priority: Blocker Fix For: 1.5.0 We should explicitly let users know that tables created from RDDs are immutable and new rows cannot be inserted into it. We can add a better error message and also explain it in the programming guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8682) Range Join for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-8682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630449#comment-14630449 ] Herman van Hovell edited comment on SPARK-8682 at 7/16/15 10:31 PM: I have attached some performance testing code. In this setup RangeJoin is 13-50 times faster than the Cartesian/Filter combination. However the performance profile is a bit unexpected. The fewer records in the broadcasted, side the faster it is. This is opposite to my expectations, because RangeJoin should have a bigger advantage when the number of broadcasted rows are larger. I am looking into this. was (Author: hvanhovell): Some Performance Testing code. Range Join for Spark SQL Key: SPARK-8682 URL: https://issues.apache.org/jira/browse/SPARK-8682 Project: Spark Issue Type: Improvement Components: SQL Reporter: Herman van Hovell Attachments: perf_testing.scala Currently Spark SQL uses a Broadcast Nested Loop join (or a filtered Cartesian Join) when it has to execute the following range query: {noformat} SELECT A.*, B.* FROM tableA A JOIN tableB B ON A.start = B.end AND A.end B.start {noformat} This is horribly inefficient. The performance of this query can be greatly improved, when one of the tables can be broadcasted, by creating a range index. A range index is basically a sorted map containing the rows of the smaller table, indexed by both the high and low keys. using this structure the complexity of the query would go from O(N * M) to O(N * 2 * LOG(M)), N = number of records in the larger table, M = number of records in the smaller (indexed) table. I have created a pull request for this. According to the [Spark SQL: Relational Data Processing in Spark|http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf] paper similar work (page 11, section 7.2) has already been done by the ADAM project (cannot locate the code though). Any comments and/or feedback are greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9119) In some cases, we may save wrong decimal values to parquet
[ https://issues.apache.org/jira/browse/SPARK-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630556#comment-14630556 ] Yin Huai commented on SPARK-9119: - Actually, the impact of this issue is that whenever we store decimal values using its unscaled value and type's scale information, we may store the wrong value. In some cases, we may save wrong decimal values to parquet -- Key: SPARK-9119 URL: https://issues.apache.org/jira/browse/SPARK-9119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker {code} import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructType,StructField,StringType,DecimalType} import org.apache.spark.sql.types.Decimal val schema = StructType(Array(StructField(name, DecimalType(10, 5), false))) val rowRDD = sc.parallelize(Array(Row(Decimal(67123.45 val df = sqlContext.createDataFrame(rowRDD, schema) df.registerTempTable(test) df.show() // ++ // |name| // ++ // |67123.45| // ++ sqlContext.sql(create table testDecimal as select * from test) sqlContext.table(testDecimal).show() // ++ // |name| // ++ // |67.12345| // ++ {code} The problem is when we do conversions, we do not use precision/scale info in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4134) Dynamic allocation: tone down scary executor lost messages when killing on purpose
[ https://issues.apache.org/jira/browse/SPARK-4134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4134: - Summary: Dynamic allocation: tone down scary executor lost messages when killing on purpose (was: Tone down scary executor lost messages when killing on purpose) Dynamic allocation: tone down scary executor lost messages when killing on purpose -- Key: SPARK-4134 URL: https://issues.apache.org/jira/browse/SPARK-4134 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or After SPARK-3822 goes in, we are now able to dynamically kill executors after an application has started. However, when we do that we get a ton of scary error messages telling us that we've done wrong somehow. It would be good to detect when this is the case and prevent these messages from surfacing. This maybe difficult, however, because the connection manager tends to be quite verbose in unconditionally logging disconnection messages. This is a very nice-to-have for 1.2 but certainly not a blocker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9116) Class in __main__ cannot be serialized by PySpark
Davies Liu created SPARK-9116: - Summary: Class in __main__ cannot be serialized by PySpark Key: SPARK-9116 URL: https://issues.apache.org/jira/browse/SPARK-9116 Project: Spark Issue Type: Bug Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Priority: Critical It's bad that we could not support classes defined in __main__. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9113) remove unnecessary analysis check code for self join
[ https://issues.apache.org/jira/browse/SPARK-9113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9113: Shepherd: Michael Armbrust Assignee: Wenchen Fan remove unnecessary analysis check code for self join Key: SPARK-9113 URL: https://issues.apache.org/jira/browse/SPARK-9113 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent
[ https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630307#comment-14630307 ] Joseph K. Bradley commented on SPARK-9073: -- Yes, thanks! I'll look at the PR as soon as I can. spark.ml Models copy() should call setParent when there is a parent --- Key: SPARK-9073 URL: https://issues.apache.org/jira/browse/SPARK-9073 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Priority: Minor Examples with this mistake include: * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119] * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220] Whomever writes a PR for this JIRA should check all spark.ml Model's copy() methods and set copy's {{Model.parent}} when available. Also verify in unit tests (possibly in a standard method checking Models to share code). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent
[ https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9073: - Shepherd: Joseph K. Bradley spark.ml Models copy() should call setParent when there is a parent --- Key: SPARK-9073 URL: https://issues.apache.org/jira/browse/SPARK-9073 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Assignee: Kai Sasaki Priority: Minor Examples with this mistake include: * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119] * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220] Whomever writes a PR for this JIRA should check all spark.ml Model's copy() methods and set copy's {{Model.parent}} when available. Also verify in unit tests (possibly in a standard method checking Models to share code). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8807) Add between operator in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8807. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7356 [https://github.com/apache/spark/pull/7356] Add between operator in SparkR -- Key: SPARK-8807 URL: https://issues.apache.org/jira/browse/SPARK-8807 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yu Ishikawa Fix For: 1.5.0 Add between operator in SparkR ``` df$age between c(1, 2) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8807) Add between operator in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8807: - Assignee: Liang-Chi Hsieh Add between operator in SparkR -- Key: SPARK-8807 URL: https://issues.apache.org/jira/browse/SPARK-8807 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yu Ishikawa Assignee: Liang-Chi Hsieh Fix For: 1.5.0 Add between operator in SparkR ``` df$age between c(1, 2) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8972) Incorrect result for rollup
[ https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8972: Assignee: Cheng Hao Incorrect result for rollup --- Key: SPARK-8972 URL: https://issues.apache.org/jira/browse/SPARK-8972 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Critical Fix For: 1.5.0 {code:java} import sqlContext.implicits._ case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group by key%100 with rollup).show(100) // output +---+---++ |cnt|_c1|GROUPING__ID| +---+---++ | 1| 4| 0| | 1| 4| 1| | 1| 5| 0| | 1| 5| 1| | 1| 1| 0| | 1| 1| 1| | 1| 2| 0| | 1| 2| 1| | 1| 3| 0| | 1| 3| 1| +---+---++ {code} After checking with the code, seems we does't support the complex expressions (not just simple column names) for GROUP BY keys for rollup, as well as the cube. And it even will not report it if we have complex expression in the rollup keys, hence we get very confusing result as the example above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8972) Incorrect result for rollup
[ https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-8972. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7343 [https://github.com/apache/spark/pull/7343] Incorrect result for rollup --- Key: SPARK-8972 URL: https://issues.apache.org/jira/browse/SPARK-8972 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Critical Fix For: 1.5.0 {code:java} import sqlContext.implicits._ case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group by key%100 with rollup).show(100) // output +---+---++ |cnt|_c1|GROUPING__ID| +---+---++ | 1| 4| 0| | 1| 4| 1| | 1| 5| 0| | 1| 5| 1| | 1| 1| 0| | 1| 1| 1| | 1| 2| 0| | 1| 2| 1| | 1| 3| 0| | 1| 3| 1| +---+---++ {code} After checking with the code, seems we does't support the complex expressions (not just simple column names) for GROUP BY keys for rollup, as well as the cube. And it even will not report it if we have complex expression in the rollup keys, hence we get very confusing result as the example above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409 ] Lianhui Wang commented on SPARK-8646: - yes, when i use this command: ./bin/spark-submit ./pi.py yarn-client 10, yarn' client do not upload pyspark.zip, so that can not be worked. i submit a PR that resolve this problem based on master branch. PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9091) Add the codec interface to DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9091. -- Resolution: Invalid [~carlmartin] I think you're familiar with https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -- please don't open a JIRA until you can fill it out correctly. This isn't a Bug, Major, can't affect 1.5, has no detail. Add the codec interface to DStream. --- Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.5.0 Reporter: SaintBacchus Add description later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9093) Fix single-quotes strings in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629477#comment-14629477 ] Apache Spark commented on SPARK-9093: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/7439 Fix single-quotes strings in SparkR --- Key: SPARK-9093 URL: https://issues.apache.org/jira/browse/SPARK-9093 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa We should get rid of the warnings about using single-quotes like that. {noformat} inst/tests/test_sparkSQL.R:60:28: style: Only use double-quotes. list(type = 'array', elementType = integer, containsNull = TRUE)) ^~~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9091) Add the codec interface to Text DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629483#comment-14629483 ] SaintBacchus commented on SPARK-9091: - [~srowen] Sorry for forgetting to change the type and add the description immediately, it's a small improvement in the DStream. I will reopen it after I had the PR. Add the codec interface to Text DStream. Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.5.0 Reporter: SaintBacchus Priority: Minor Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()
[ https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gisle Ytrestøl updated SPARK-9096: -- Description: When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. I've reproduced this bug in the attached Java file, which I submit with spark-submit. The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks in the count job takes a lot of time: 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 4659) in 708 ms on 148.251.190.217 (1597/1600) 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 4786) in 772 ms on 148.251.190.217 (1598/1600) 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 4582) in 275019 ms on 148.251.190.217 (1599/1600) 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 4430) in 407020 ms on 148.251.190.217 (1600/1600) 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at ReproduceBug.java:56) finished in 420.024 s 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at ReproduceBug.java:56, took 442.941395 s In comparison, all tasks are more or less equal in size when running the same application in Spark 1.3.1. In overall, this attached application (ReproduceBug.java) takes about 7 minutes on Spark 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1. Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue. was: When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. Unevenly distributed task loads after using JavaRDD.subtract() -- Key: SPARK-9096 URL: https://issues.apache.org/jira/browse/SPARK-9096 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.4.0, 1.4.1 Reporter: Gisle Ytrestøl Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz, reproduce.1.4.1.log.gz When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. I've reproduced this bug in the attached Java file, which I submit with spark-submit. The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks in the count job takes a lot of time: 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 4659) in 708 ms on 148.251.190.217 (1597/1600) 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 4786) in 772 ms on 148.251.190.217 (1598/1600) 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 4582) in 275019 ms on 148.251.190.217 (1599/1600) 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 4430) in 407020 ms on 148.251.190.217 (1600/1600) 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at ReproduceBug.java:56) finished in 420.024 s 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at ReproduceBug.java:56, took 442.941395 s In comparison, all tasks are more or less equal in size when running the same application in Spark 1.3.1. In overall, this attached application (ReproduceBug.java) takes about 7 minutes on Spark 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1. Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9097) Tasks are not completed but the number of executor is zero
KaiXinXIaoLei created SPARK-9097: Summary: Tasks are not completed but the number of executor is zero Key: SPARK-9097 URL: https://issues.apache.org/jira/browse/SPARK-9097 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Fix For: 1.5.0 I set the value of spark.dynamicAllocation.enabled is true. I submit tasks to run. Tasks are not completed, but the number of executor is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()
[ https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9096: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) I am not sure it is a bug, yet. It's worth explaining the difference though, but we need to rule out environment factors, and know more about the cause. Can you say more about why the data is not evenly distributed? it looks like it should be in your sample. Unevenly distributed task loads after using JavaRDD.subtract() -- Key: SPARK-9096 URL: https://issues.apache.org/jira/browse/SPARK-9096 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.4.0, 1.4.1 Reporter: Gisle Ytrestøl Priority: Minor Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz, reproduce.1.4.1.log.gz When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. I've reproduced this bug in the attached Java file, which I submit with spark-submit. The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks in the count job takes a lot of time: 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 4659) in 708 ms on 148.251.190.217 (1597/1600) 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 4786) in 772 ms on 148.251.190.217 (1598/1600) 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 4582) in 275019 ms on 148.251.190.217 (1599/1600) 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 4430) in 407020 ms on 148.251.190.217 (1600/1600) 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at ReproduceBug.java:56) finished in 420.024 s 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at ReproduceBug.java:56, took 442.941395 s In comparison, all tasks are more or less equal in size when running the same application in Spark 1.3.1. In overall, this attached application (ReproduceBug.java) takes about 7 minutes on Spark 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1. Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9093) Fix single-quotes strings in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629474#comment-14629474 ] Yu Ishikawa commented on SPARK-9093: I'm working this issue. Fix single-quotes strings in SparkR --- Key: SPARK-9093 URL: https://issues.apache.org/jira/browse/SPARK-9093 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa We should get rid of the warnings about using single-quotes like that. {noformat} inst/tests/test_sparkSQL.R:60:28: style: Only use double-quotes. list(type = 'array', elementType = integer, containsNull = TRUE)) ^~~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9091: - Affects Version/s: (was: 1.5.0) Much better, though it can't affect 1.5.0 as this version does not exist yet. I think TD's opinion is that DStream's methods are mostly redundant, since they are just operations you can access with DStream.foreachRDD and then calling any RDD methods you like. So I think this will not be worth adding to DStream's API. Add the codec interface to Text DStream. Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Reporter: SaintBacchus Priority: Minor Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very useful for user to have this interface in DStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent
[ https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629527#comment-14629527 ] Kai Sasaki commented on SPARK-9073: --- [~josephkb] Hi, if possible, can I work on this JIRA? Thank you. spark.ml Models copy() should call setParent when there is a parent --- Key: SPARK-9073 URL: https://issues.apache.org/jira/browse/SPARK-9073 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Priority: Minor Examples with this mistake include: * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119] * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220] Whomever writes a PR for this JIRA should check all spark.ml Model's copy() methods and set copy's {{Model.parent}} when available. Also verify in unit tests (possibly in a standard method checking Models to share code). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9095) Removes old Parquet support code
Cheng Lian created SPARK-9095: - Summary: Removes old Parquet support code Key: SPARK-9095 URL: https://issues.apache.org/jira/browse/SPARK-9095 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian As the new Parquet external data source matures, we should remove the old Parquet support now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()
[ https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gisle Ytrestøl updated SPARK-9096: -- Description: When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. was:When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. Unevenly distributed task loads after using JavaRDD.subtract() -- Key: SPARK-9096 URL: https://issues.apache.org/jira/browse/SPARK-9096 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.4.0, 1.4.1 Reporter: Gisle Ytrestøl Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz, reproduce.1.4.1.log.gz When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()
[ https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gisle Ytrestøl updated SPARK-9096: -- Attachment: reproduce.1.4.1.log.gz reproduce.1.3.1.log.gz ReproduceBug.java Unevenly distributed task loads after using JavaRDD.subtract() -- Key: SPARK-9096 URL: https://issues.apache.org/jira/browse/SPARK-9096 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.4.0, 1.4.1 Reporter: Gisle Ytrestøl Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz, reproduce.1.4.1.log.gz When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()
Gisle Ytrestøl created SPARK-9096: - Summary: Unevenly distributed task loads after using JavaRDD.subtract() Key: SPARK-9096 URL: https://issues.apache.org/jira/browse/SPARK-9096 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.4.0, 1.4.1 Reporter: Gisle Ytrestøl Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz, reproduce.1.4.1.log.gz When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9095) Removes old Parquet support code
[ https://issues.apache.org/jira/browse/SPARK-9095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9095: --- Assignee: Cheng Lian (was: Apache Spark) Removes old Parquet support code Key: SPARK-9095 URL: https://issues.apache.org/jira/browse/SPARK-9095 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian As the new Parquet external data source matures, we should remove the old Parquet support now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9093) Fix single-quotes strings in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9093: --- Assignee: Apache Spark Fix single-quotes strings in SparkR --- Key: SPARK-9093 URL: https://issues.apache.org/jira/browse/SPARK-9093 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Apache Spark We should get rid of the warnings about using single-quotes like that. {noformat} inst/tests/test_sparkSQL.R:60:28: style: Only use double-quotes. list(type = 'array', elementType = integer, containsNull = TRUE)) ^~~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-9091: Description: Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream (was: Add description later.) Add the codec interface to Text DStream. Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.5.0 Reporter: SaintBacchus Priority: Minor Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9093) Fix single-quotes strings in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9093: --- Assignee: (was: Apache Spark) Fix single-quotes strings in SparkR --- Key: SPARK-9093 URL: https://issues.apache.org/jira/browse/SPARK-9093 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa We should get rid of the warnings about using single-quotes like that. {noformat} inst/tests/test_sparkSQL.R:60:28: style: Only use double-quotes. list(type = 'array', elementType = integer, containsNull = TRUE)) ^~~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
[ https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629482#comment-14629482 ] Liang-Chi Hsieh commented on SPARK-9067: Thanks for reporting that. I updated the PR. Besides calling close(), I also release reader now. Can you check if it can solve this problem? Thanks. Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD - Key: SPARK-9067 URL: https://issues.apache.org/jira/browse/SPARK-9067 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 1.3.0, 1.4.0 Environment: Target system: Linux, 16 cores, 400Gb RAM Spark is started locally using the following command: {{ spark-submit --master local[16] --driver-memory 64G --executor-cores 16 --num-executors 1 --executor-memory 64G }} Reporter: konstantin knizhnik If coalesce transformation with small number of output partitions (in my case 16) is applied to large Parquet file (in my has about 150Gb with 215k partitions), then it case OutOfMemory exceptions 250Gb is not enough) and open file limit exhaustion (with limit set to 8k). The source of the problem is in SqlNewHad\oopRDD.compute method: {quote} val reader = format.createRecordReader( split.serializableHadoopSplit.value, hadoopAttemptContext) reader.initialize(split.serializableHadoopSplit.value, hadoopAttemptContext) // Register an on-task-completion callback to close the input stream. context.addTaskCompletionListener(context = close()) {quote} Created Parquet file reader is intended to be closed at task completion time. This reader contains a lot of references to parquet.bytes.BytesInput object which in turn contains reference sot large byte arrays (some of them are several megabytes). As far as in case of CoalescedRDD task is completed only after processing larger number of parquet files, it cause file handles exhaustion and memory overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-9091: Description: Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very useful for user to have this interface in DStream. was: Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very useful for user to had this interface. Add the codec interface to Text DStream. Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.5.0 Reporter: SaintBacchus Priority: Minor Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very useful for user to have this interface in DStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9095) Removes old Parquet support code
[ https://issues.apache.org/jira/browse/SPARK-9095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629546#comment-14629546 ] Apache Spark commented on SPARK-9095: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/7441 Removes old Parquet support code Key: SPARK-9095 URL: https://issues.apache.org/jira/browse/SPARK-9095 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian As the new Parquet external data source matures, we should remove the old Parquet support now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9095) Removes old Parquet support code
[ https://issues.apache.org/jira/browse/SPARK-9095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9095: --- Assignee: Apache Spark (was: Cheng Lian) Removes old Parquet support code Key: SPARK-9095 URL: https://issues.apache.org/jira/browse/SPARK-9095 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Apache Spark As the new Parquet external data source matures, we should remove the old Parquet support now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9093) Fix single-quotes strings in SparkR
Yu Ishikawa created SPARK-9093: -- Summary: Fix single-quotes strings in SparkR Key: SPARK-9093 URL: https://issues.apache.org/jira/browse/SPARK-9093 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa We should get rid of the warnings about using single-quotes like that. {noformat} inst/tests/test_sparkSQL.R:60:28: style: Only use double-quotes. list(type = 'array', elementType = integer, containsNull = TRUE)) ^~~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-9091: Summary: Add the codec interface to Text DStream. (was: Add the codec interface to DStream.) Add the codec interface to Text DStream. Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.5.0 Reporter: SaintBacchus Priority: Minor Add description later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9091) Add the codec interface to DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-9091: Priority: Minor (was: Major) Add the codec interface to DStream. --- Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.5.0 Reporter: SaintBacchus Priority: Minor Add description later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9091) Add the codec interface to DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-9091: Issue Type: Improvement (was: Bug) Add the codec interface to DStream. --- Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.5.0 Reporter: SaintBacchus Add description later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-9091: Description: Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very useful for user to had this interface. was: Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very Add the codec interface to Text DStream. Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.5.0 Reporter: SaintBacchus Priority: Minor Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very useful for user to had this interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-9091: Description: Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very was:Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream Add the codec interface to Text DStream. Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.5.0 Reporter: SaintBacchus Priority: Minor Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9094) Increase io.dropwizard.metrics dependency to 3.1.2
Carl Anders Düvel created SPARK-9094: Summary: Increase io.dropwizard.metrics dependency to 3.1.2 Key: SPARK-9094 URL: https://issues.apache.org/jira/browse/SPARK-9094 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Carl Anders Düvel Priority: Minor This change is described in pull request: https://github.com/apache/spark/pull/7422 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9052) Fix comments after curly braces
[ https://issues.apache.org/jira/browse/SPARK-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629510#comment-14629510 ] Apache Spark commented on SPARK-9052: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/7440 Fix comments after curly braces --- Key: SPARK-9052 URL: https://issues.apache.org/jira/browse/SPARK-9052 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Right now we have a number of style check errors of the form {code} Opening curly braces should never go on their own line and should always and be followed by a new line. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9052) Fix comments after curly braces
[ https://issues.apache.org/jira/browse/SPARK-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9052: --- Assignee: (was: Apache Spark) Fix comments after curly braces --- Key: SPARK-9052 URL: https://issues.apache.org/jira/browse/SPARK-9052 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Right now we have a number of style check errors of the form {code} Opening curly braces should never go on their own line and should always and be followed by a new line. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9052) Fix comments after curly braces
[ https://issues.apache.org/jira/browse/SPARK-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9052: --- Assignee: Apache Spark Fix comments after curly braces --- Key: SPARK-9052 URL: https://issues.apache.org/jira/browse/SPARK-9052 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Apache Spark Right now we have a number of style check errors of the form {code} Opening curly braces should never go on their own line and should always and be followed by a new line. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9097) Tasks are not completed but the number of executor is zero
[ https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9097: - Fix Version/s: (was: 1.5.0) [~KaiXinXIaoLei] Again please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Don't set Fix Version. You need to provide more information. It's normal for a short time for tasks to be waiting, before an executor spins up. You need to clarify exactly how you are running this and what you observe or else this is not a helpful JIRA. Tasks are not completed but the number of executor is zero -- Key: SPARK-9097 URL: https://issues.apache.org/jira/browse/SPARK-9097 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei I set the value of spark.dynamicAllocation.enabled is true. I submit tasks to run. Tasks are not completed, but the number of executor is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9094) Increase io.dropwizard.metrics dependency to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-9094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629558#comment-14629558 ] Sean Owen commented on SPARK-9094: -- Yes, you also need to update your PR title. The request was as well to describe any potentially incompatible changes (if any) and why the update is needed. Increase io.dropwizard.metrics dependency to 3.1.2 -- Key: SPARK-9094 URL: https://issues.apache.org/jira/browse/SPARK-9094 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Carl Anders Düvel Priority: Minor This change is described in pull request: https://github.com/apache/spark/pull/7422 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8807) Add between operator in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629391#comment-14629391 ] Yu Ishikawa commented on SPARK-8807: [~yalamart] Sorry for the delay of my reply. And great work! Thanks! Add between operator in SparkR -- Key: SPARK-8807 URL: https://issues.apache.org/jira/browse/SPARK-8807 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yu Ishikawa Assignee: Liang-Chi Hsieh Fix For: 1.5.0 Add between operator in SparkR ``` df$age between c(1, 2) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9092) Make --num-executors compatible with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9092: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) Make --num-executors compatible with dynamic allocation --- Key: SPARK-9092 URL: https://issues.apache.org/jira/browse/SPARK-9092 Project: Spark Issue Type: Improvement Components: YARN Reporter: Niranjan Padmanabhan Priority: Minor Currently when you enable dynamic allocation, you can't use --num-executors or the property spark.executor.instances. If we are to enable dynamic allocation by default, we should make these work so that existing workloads don't fail -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6442) MLlib Local Linear Algebra Package
[ https://issues.apache.org/jira/browse/SPARK-6442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629320#comment-14629320 ] Sean Owen commented on SPARK-6442: -- [~mengxr] for Commons Math, and point #2: actually they decided to un-deprecate the sparse implementations in 3.3 onwards, and keep supporting them: http://commons.apache.org/proper/commons-math/changes-report.html I think it's a good option. But I also am not sure why *Spark* has to decide this for users. Spark can do whatever it likes internally; apps can do whatever they like externally; both can and should use a library. From an API perspective, all that's needed is a representation of the data that thunks easily into other libraries, rather than provide a library of functions again. MLlib Local Linear Algebra Package -- Key: SPARK-6442 URL: https://issues.apache.org/jira/browse/SPARK-6442 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Burak Yavuz Priority: Critical MLlib's local linear algebra package doesn't have any support for any type of matrix operations. With 1.5, we wish to add support to a complete package of optimized linear algebra operations for Scala/Java users. The main goal is to support lazy operations so that element-wise can be implemented in a single for-loop, and complex operations can be interfaced through BLAS. The design doc: http://goo.gl/sf5LCE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8646: --- Assignee: Apache Spark PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Assignee: Apache Spark Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8646: --- Assignee: (was: Apache Spark) PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629407#comment-14629407 ] Apache Spark commented on SPARK-8646: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/7438 PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9019) spark-submit fails on yarn with kerberos enabled
[ https://issues.apache.org/jira/browse/SPARK-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629432#comment-14629432 ] Bolke de Bruin edited comment on SPARK-9019 at 7/16/15 8:39 AM: I tried running this on an updated environment where YARN-3103 was fixed, however it still fails although behavior is a bit different now. The task is now being accepted but stays in the running state forever without executing anything. Please note that the trace below is without key tab usage, but with an authorized user (kinit admin/admin) 15/07/16 04:27:34 DEBUG Client: getting client out of cache: org.apache.hadoop.ipc.Client@53abb73 15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message AkkaMessage(ReviveOffers,false) from Actor[akka://sparkDriver/deadLetters] 15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: AkkaMessage(ReviveOffers,false) 15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message (1.632126 ms) AkkaMessage(ReviveOffers,false) from Actor[akka://sparkDriver/deadLetters] 15/07/16 04:27:34 DEBUG AbstractService: Service org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl is started 15/07/16 04:27:34 DEBUG AbstractService: Service org.apache.hadoop.yarn.client.api.impl.YarnClientImpl is started 15/07/16 04:27:34 DEBUG Client: The ping interval is 6 ms. 15/07/16 04:27:34 DEBUG Client: Connecting to node6.local/10.79.10.6:8050 15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedAction as:admin (auth:SIMPLE) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717) 15/07/16 04:27:34 DEBUG SaslRpcClient: Sending sasl message state: NEGOTIATE 15/07/16 04:27:34 DEBUG SaslRpcClient: Received SASL message state: NEGOTIATE auths { method: TOKEN mechanism: DIGEST-MD5 protocol: serverId: default challenge: realm=\default\,nonce=\wjgFp9L22uDJt41FNtY9M8CP/T+dswfBoF48r9+s\,qop=\auth\,charset=utf-8,algorithm=md5-sess } auths { method: KERBEROS mechanism: GSSAPI protocol: rm serverId: node6.local } 15/07/16 04:27:34 DEBUG SaslRpcClient: Get token info proto:interface org.apache.hadoop.yarn.api.ApplicationClientProtocolPB info:org.apache.hadoop.yarn.security.client.ClientRMSecurityInfo$2@69990fa7 15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Looking for a token with service 10.79.10.6:8050 15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is YARN_AM_RM_TOKEN and the token's service name is 15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is HIVE_DELEGATION_TOKEN and the token's service name is 15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is TIMELINE_DELEGATION_TOKEN and the token's service name is 10.79.10.6:8188 15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is HDFS_DELEGATION_TOKEN and the token's service name is 10.79.10.4:8020 15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedActionException as:admin (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedAction as:admin (auth:SIMPLE) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643) 15/07/16 04:27:34 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedActionException as:admin (auth:SIMPLE) cause:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 15/07/16 04:27:34 DEBUG Client: closing ipc connection to node6.local/10.79.10.6:8050: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at
[jira] [Updated] (SPARK-8893) Require positive partition counts in RDD.repartition
[ https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8893: - Assignee: Daniel Darabos Require positive partition counts in RDD.repartition Key: SPARK-8893 URL: https://issues.apache.org/jira/browse/SPARK-8893 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Daniel Darabos Assignee: Daniel Darabos Priority: Trivial Fix For: 1.5.0 What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}} 1, it returns {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}. I think the case is pretty clear for {{p}} 0. But the behavior for {{p}} = 0 is also error prone. In fact that's how I found this strange behavior. I used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results. I'm happy to send a pull request for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8893) Require positive partition counts in RDD.repartition
[ https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8893. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7285 [https://github.com/apache/spark/pull/7285] Require positive partition counts in RDD.repartition Key: SPARK-8893 URL: https://issues.apache.org/jira/browse/SPARK-8893 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Daniel Darabos Priority: Trivial Fix For: 1.5.0 What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}} 1, it returns {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}. I think the case is pretty clear for {{p}} 0. But the behavior for {{p}} = 0 is also error prone. In fact that's how I found this strange behavior. I used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results. I'm happy to send a pull request for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
[ https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629399#comment-14629399 ] konstantin knizhnik commented on SPARK-9067: I have found workaround for the problem: substitute task context with fake context. I have implemented CombineRDD as replacement of CoalescedRDD and create separate task context for processing each partition: {quote} class CombineRDD[T: ClassTag](prev: RDD[T], maxPartitions: Int) extends RDD[T](prev) { val inputPartitions = prev.partitions class CombineIterator(partitions: Array[Partition], index: Int, context: TaskContext) extends Iterator[T] { var iter : Iterator[T] = null var i = index def hasNext() : Boolean = { while ((iter == null || !iter.hasNext) i partitions.length) { val ctx = new CombineTaskContext(context.stageId, context.partitionId, context.taskAttemptId, context.attemptNumber, null/*context.taskMemoryManager*/, context.isRunningLocally, context.taskMetrics) iter = firstParent[T].compute(partitions(i), ctx) //ctx.complete() partitions(i) = null i = i + maxPartitions } iter != null iter.hasNext } def next() = { iter.next } } class CombineTaskContext(val stageId: Int, val partitionId: Int, override val taskAttemptId: Long, override val attemptNumber: Int, override val taskMemoryManager: TaskMemoryManager, val runningLocally: Boolean = true, val taskMetrics: TaskMetrics = null) extends TaskContext { @transient private val onCompleteCallbacks = new ArrayBuffer[TaskCompletionListener] override def attemptId(): Long = taskAttemptId override def addTaskCompletionListener(listener: TaskCompletionListener): this.type = { onCompleteCallbacks += listener this } def complete(): Unit = { // Process complete callbacks in the reverse order of registration onCompleteCallbacks.reverse.foreach { listener = listener.onTaskCompletion(this) } } override def addTaskCompletionListener(f: TaskContext = Unit): this.type = { onCompleteCallbacks += new TaskCompletionListener { override def onTaskCompletion(context: TaskContext): Unit = f(context) } this } override def addOnCompleteCallback(f: () = Unit) { onCompleteCallbacks += new TaskCompletionListener { override def onTaskCompletion(context: TaskContext): Unit = f() } } override def isCompleted(): Boolean = false override def isRunningLocally(): Boolean = true override def isInterrupted(): Boolean = false } case class CombinePartition(index : Int) extends Partition protected def getPartitions: Array[Partition] = Array.tabulate(maxPartitions){i = CombinePartition(i)} override def compute(partition: Partition, context: TaskContext): Iterator[T] = { new CombineIterator(inputPartitions, partition.index, context) } } {quote} I works: no memory overflow or file limit exhaustion. But certainly it can not be considered as solution of the problem. Also please notice that I have to comment call of ctx.complete(), otherwise I got exception caused by access to closed stream. It is strange because I think that partition corresponds to single parquet file and so it can be proceeded independently. But looks like GCfinalization do their work. Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD - Key: SPARK-9067 URL: https://issues.apache.org/jira/browse/SPARK-9067 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 1.3.0, 1.4.0 Environment: Target system: Linux, 16 cores, 400Gb RAM Spark is started locally using the following command: {{ spark-submit --master local[16] --driver-memory 64G --executor-cores 16 --num-executors 1 --executor-memory 64G }} Reporter: konstantin knizhnik If coalesce transformation with small number of output partitions (in my case 16) is applied to large Parquet file (in my has about 150Gb with 215k partitions), then it case OutOfMemory exceptions 250Gb is not enough) and open file limit exhaustion (with limit set to 8k). The source of the problem is in SqlNewHad\oopRDD.compute method: {quote} val reader = format.createRecordReader( split.serializableHadoopSplit.value, hadoopAttemptContext) reader.initialize(split.serializableHadoopSplit.value, hadoopAttemptContext) // Register an
[jira] [Commented] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
[ https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629358#comment-14629358 ] konstantin knizhnik commented on SPARK-9067: Sorry, but this patch doesn't help. Looks like close is not closing everything... For example there is reference *parquet.io.RecordReaderT recordReader* in the class *parquet.hadoop.InternalParquetRecordReader* and according to hprof dump at OutOfMemory exception it contains references to array of *parquet.column.impl.ColumnReaderImpl* and after few indirections we reach *parquet.column.values.bitpacking.ByteBitPackingValuesReader* which field +encoded+ references 9Mb array. And InternalParquetRecordReader.close method doesn't close recordReader: {quote} public void close() throws IOException { if (reader != null) { reader.close(); } } {quote} Unfortunately I am not sure that it is the single place where close is not releasing all resources. Moreover I am not sure that even if close if close is done, it clears references to all used buffers. Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD - Key: SPARK-9067 URL: https://issues.apache.org/jira/browse/SPARK-9067 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 1.3.0, 1.4.0 Environment: Target system: Linux, 16 cores, 400Gb RAM Spark is started locally using the following command: {{ spark-submit --master local[16] --driver-memory 64G --executor-cores 16 --num-executors 1 --executor-memory 64G }} Reporter: konstantin knizhnik If coalesce transformation with small number of output partitions (in my case 16) is applied to large Parquet file (in my has about 150Gb with 215k partitions), then it case OutOfMemory exceptions 250Gb is not enough) and open file limit exhaustion (with limit set to 8k). The source of the problem is in SqlNewHad\oopRDD.compute method: {quote} val reader = format.createRecordReader( split.serializableHadoopSplit.value, hadoopAttemptContext) reader.initialize(split.serializableHadoopSplit.value, hadoopAttemptContext) // Register an on-task-completion callback to close the input stream. context.addTaskCompletionListener(context = close()) {quote} Created Parquet file reader is intended to be closed at task completion time. This reader contains a lot of references to parquet.bytes.BytesInput object which in turn contains reference sot large byte arrays (some of them are several megabytes). As far as in case of CoalescedRDD task is completed only after processing larger number of parquet files, it cause file handles exhaustion and memory overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409 ] Lianhui Wang edited comment on SPARK-8646 at 7/16/15 8:25 AM: -- yes, when i use this command: ./bin/spark-submit ./pi.py yarn-client 10, yarn' client do not upload pyspark.zip, so that can not be worked. i submit a PR that resolve this problem based on master branch. there is some problems on spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in Client. was (Author: lianhuiwang): yes, when i use this command: ./bin/spark-submit ./pi.py yarn-client 10, yarn' client do not upload pyspark.zip, so that can not be worked. i submit a PR that resolve this problem based on master branch. PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN if master not provided in command line
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409 ] Lianhui Wang edited comment on SPARK-8646 at 7/16/15 8:26 AM: -- yes, when i use this command: ./bin/spark-submit ./pi.py yarn-client 10, yarn' client do not upload pyspark.zip, so that can not be worked. i submit a PR that resolve this problem based on master branch. there is some problems on spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in Client. if this must be needed in spark-1.4.0, latter i will take a look at it. was (Author: lianhuiwang): yes, when i use this command: ./bin/spark-submit ./pi.py yarn-client 10, yarn' client do not upload pyspark.zip, so that can not be worked. i submit a PR that resolve this problem based on master branch. there is some problems on spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in Client. PySpark does not run on YARN if master not provided in command line --- Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9019) spark-submit fails on yarn with kerberos enabled
[ https://issues.apache.org/jira/browse/SPARK-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629432#comment-14629432 ] Bolke de Bruin commented on SPARK-9019: --- I tried running this on an update environment, however it still fails although behavior is a bit different now. The task is now being accepted but stays in the running state forever without executing anything. Please note that the trace below is without key tab usage, but with an authorized user (kinit admin/admin) 15/07/16 04:27:34 DEBUG Client: getting client out of cache: org.apache.hadoop.ipc.Client@53abb73 15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message AkkaMessage(ReviveOffers,false) from Actor[akka://sparkDriver/deadLetters] 15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: AkkaMessage(ReviveOffers,false) 15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message (1.632126 ms) AkkaMessage(ReviveOffers,false) from Actor[akka://sparkDriver/deadLetters] 15/07/16 04:27:34 DEBUG AbstractService: Service org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl is started 15/07/16 04:27:34 DEBUG AbstractService: Service org.apache.hadoop.yarn.client.api.impl.YarnClientImpl is started 15/07/16 04:27:34 DEBUG Client: The ping interval is 6 ms. 15/07/16 04:27:34 DEBUG Client: Connecting to node6.local/10.79.10.6:8050 15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedAction as:admin (auth:SIMPLE) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717) 15/07/16 04:27:34 DEBUG SaslRpcClient: Sending sasl message state: NEGOTIATE 15/07/16 04:27:34 DEBUG SaslRpcClient: Received SASL message state: NEGOTIATE auths { method: TOKEN mechanism: DIGEST-MD5 protocol: serverId: default challenge: realm=\default\,nonce=\wjgFp9L22uDJt41FNtY9M8CP/T+dswfBoF48r9+s\,qop=\auth\,charset=utf-8,algorithm=md5-sess } auths { method: KERBEROS mechanism: GSSAPI protocol: rm serverId: node6.local } 15/07/16 04:27:34 DEBUG SaslRpcClient: Get token info proto:interface org.apache.hadoop.yarn.api.ApplicationClientProtocolPB info:org.apache.hadoop.yarn.security.client.ClientRMSecurityInfo$2@69990fa7 15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Looking for a token with service 10.79.10.6:8050 15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is YARN_AM_RM_TOKEN and the token's service name is 15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is HIVE_DELEGATION_TOKEN and the token's service name is 15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is TIMELINE_DELEGATION_TOKEN and the token's service name is 10.79.10.6:8188 15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is HDFS_DELEGATION_TOKEN and the token's service name is 10.79.10.4:8020 15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedActionException as:admin (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedAction as:admin (auth:SIMPLE) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643) 15/07/16 04:27:34 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedActionException as:admin (auth:SIMPLE) cause:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 15/07/16 04:27:34 DEBUG Client: closing ipc connection to node6.local/10.79.10.6:8050: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at
[jira] [Created] (SPARK-9098) Inconsistent Dense Vectors hashing between PySpark and Scala
Maciej Szymkiewicz created SPARK-9098: - Summary: Inconsistent Dense Vectors hashing between PySpark and Scala Key: SPARK-9098 URL: https://issues.apache.org/jira/browse/SPARK-9098 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.4.0, 1.3.1 Reporter: Maciej Szymkiewicz Priority: Minor When using Scala it is possible to group a RDD using DenseVector as a key: {code} import org.apache.spark.mllib.linalg.Vectors val rdd = sc.parallelize( (Vectors.dense(1, 2, 3), 10) :: (Vectors.dense(1, 2, 3), 20) :: Nil) rdd.groupByKey.count {code} returns 1 as expected. In PySpark {{DenseVector}} {{___hash___}} seems to be inherited from the {{object}} and based on memory address: {code} from pyspark.mllib.linalg import DenseVector rdd = sc.parallelize( [(DenseVector([1, 2, 3]), 10), (DenseVector([1, 2, 3]), 20)]) rdd.groupByKey().count() {code} returns 2. Since underlaying `numpy.ndarray` can be used to mutate DenseVector hashing doesn't look meaningful at all: {code} dv = DenseVector([1, 2, 3]) hdv1 = hash(dv) dv.array[0] = 3.0 hdv2 = hash(dv) hdv1 == hdv2 True dv == DenseVector([1, 2, 3]) False {code} In my opinion the best approach would be to enforce immutability and provide a meaningful hashing. An alternative is to make {{DenseVector}} unhashable same as {{numpy.ndarray}}. Source: http://stackoverflow.com/questions/31449412/how-to-groupbykey-a-rdd-with-densevector-as-key-in-spark/31451752 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9099) spark-ec2 does not add important ports to security group
Brian Sung-jin Hong created SPARK-9099: -- Summary: spark-ec2 does not add important ports to security group Key: SPARK-9099 URL: https://issues.apache.org/jira/browse/SPARK-9099 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0, 1.4.1 Reporter: Brian Sung-jin Hong spark-ec2 scripts misses to add some few important ports to the security group, including: Master 6066: Needed to submit jobs outside of the cluster Slave 4040: Needed to view worker state Slave 8082: Needed to view some worker logs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9044) Updated RDD name does not reflect under Storage tab
[ https://issues.apache.org/jira/browse/SPARK-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang, Liye updated SPARK-9044: --- Comment: was deleted (was: Well, I think the component is correct, still it's business of Web UI) Updated RDD name does not reflect under Storage tab - Key: SPARK-9044 URL: https://issues.apache.org/jira/browse/SPARK-9044 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.1, 1.4.0 Environment: Mac OSX Reporter: Wenjie Zhang Priority: Minor I was playing the spark-shell in my macbook, here is what I did: scala val textFile = sc.textFile(/Users/jackzhang/Downloads/ProdPart.txt); scala textFile.cache scala textFile.setName(test1) scala textFile.collect scala textFile.name res10: String = test1 After this four commands, I can see the test1 RDD listed in the Storage tab. However, if I continually run following commands, nothing will happen from the Storage tab: scala textFile.setName(test2) scala textFile.cache scala textFile.collect scala textFile.name res10: String = test2 I am expecting the name of the RDD shows in Storage tab should be test2, is this a bug? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9091) Add the codec interface to Text DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629632#comment-14629632 ] Apache Spark commented on SPARK-9091: - User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/7442 Add the codec interface to Text DStream. Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Reporter: SaintBacchus Priority: Minor Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very useful for user to have this interface in DStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9100) DataFrame reader/writer shortcut methods for ORC
Cheng Lian created SPARK-9100: - Summary: DataFrame reader/writer shortcut methods for ORC Key: SPARK-9100 URL: https://issues.apache.org/jira/browse/SPARK-9100 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9098) Inconsistent Dense Vectors hashing between PySpark and Scala
[ https://issues.apache.org/jira/browse/SPARK-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629685#comment-14629685 ] Abou Haydar Elias commented on SPARK-9098: -- This issues creates an inconsistency in the API. So I totally agree with [~zero323] with enforcing immutability and providing a meaningful hashing. That can be a good approach. Inconsistent Dense Vectors hashing between PySpark and Scala Key: SPARK-9098 URL: https://issues.apache.org/jira/browse/SPARK-9098 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.3.1, 1.4.0 Reporter: Maciej Szymkiewicz Priority: Minor When using Scala it is possible to group a RDD using DenseVector as a key: {code} import org.apache.spark.mllib.linalg.Vectors val rdd = sc.parallelize( (Vectors.dense(1, 2, 3), 10) :: (Vectors.dense(1, 2, 3), 20) :: Nil) rdd.groupByKey.count {code} returns 1 as expected. In PySpark {{DenseVector}} {{___hash___}} seems to be inherited from the {{object}} and based on memory address: {code} from pyspark.mllib.linalg import DenseVector rdd = sc.parallelize( [(DenseVector([1, 2, 3]), 10), (DenseVector([1, 2, 3]), 20)]) rdd.groupByKey().count() {code} returns 2. Since underlaying `numpy.ndarray` can be used to mutate DenseVector hashing doesn't look meaningful at all: {code} dv = DenseVector([1, 2, 3]) hdv1 = hash(dv) dv.array[0] = 3.0 hdv2 = hash(dv) hdv1 == hdv2 True dv == DenseVector([1, 2, 3]) False {code} In my opinion the best approach would be to enforce immutability and provide a meaningful hashing. An alternative is to make {{DenseVector}} unhashable same as {{numpy.ndarray}}. Source: http://stackoverflow.com/questions/31449412/how-to-groupbykey-a-rdd-with-densevector-as-key-in-spark/31451752 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9101) Can't use null in selectExpr
Mateusz Buśkiewicz created SPARK-9101: - Summary: Can't use null in selectExpr Key: SPARK-9101 URL: https://issues.apache.org/jira/browse/SPARK-9101 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Mateusz Buśkiewicz In 1.3.1 this worked: {code:python} df = sqlContext.createDataFrame([[1]], schema=['col']) df.selectExpr('null as newCol').collect() {code} In 1.4.0 it fails with the following stacktrace: {code} Traceback (most recent call last): File input, line 1, in module File /opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/dataframe.py, line 316, in collect cls = _create_cls(self.schema) File /opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/dataframe.py, line 229, in schema self._schema = _parse_datatype_json_string(self._jdf.schema().json()) File /opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/types.py, line 519, in _parse_datatype_json_string return _parse_datatype_json_value(json.loads(json_string)) File /opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/types.py, line 539, in _parse_datatype_json_value return _all_complex_types[tpe].fromJson(json_value) File /opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/types.py, line 386, in fromJson return StructType([StructField.fromJson(f) for f in json[fields]]) File /opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/types.py, line 347, in fromJson _parse_datatype_json_value(json[type]), File /opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/types.py, line 535, in _parse_datatype_json_value raise ValueError(Could not parse datatype: %s % json_value) ValueError: Could not parse datatype: null {code} https://github.com/apache/spark/blob/v1.4.0/python/pyspark/sql/types.py#L461 The cause:_atomic_types doesn't contain NullType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9100) DataFrame reader/writer shortcut methods for ORC
[ https://issues.apache.org/jira/browse/SPARK-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9100: --- Assignee: Cheng Lian (was: Apache Spark) DataFrame reader/writer shortcut methods for ORC Key: SPARK-9100 URL: https://issues.apache.org/jira/browse/SPARK-9100 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9100) DataFrame reader/writer shortcut methods for ORC
[ https://issues.apache.org/jira/browse/SPARK-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9100: --- Assignee: Apache Spark (was: Cheng Lian) DataFrame reader/writer shortcut methods for ORC Key: SPARK-9100 URL: https://issues.apache.org/jira/browse/SPARK-9100 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9091) Add the codec interface to Text DStream.
[ https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629639#comment-14629639 ] SaintBacchus commented on SPARK-9091: - [~sowen] I agree user can design the output by DStream.foreachRDD, I purpose it for convenience to use. In my case, I had copy a bit code from Spark to adapt this function and I guess others may also have this scenario, so I open this Jira to push it into Spark. Add the codec interface to Text DStream. Key: SPARK-9091 URL: https://issues.apache.org/jira/browse/SPARK-9091 Project: Spark Issue Type: Improvement Components: Streaming Reporter: SaintBacchus Priority: Minor Since the RDD has the function *saveAsTextFile* which can use *CompressionCodec* to compress the data, so it's better to add a similar interface in DStream. In some IO-bottleneck scenario, it's very useful for user to have this interface in DStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9097) Tasks are not completed but the number of executor is zero
[ https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-9097: - Attachment: number of executor is zero.png Tasks are not completed but the number of executor is zero -- Key: SPARK-9097 URL: https://issues.apache.org/jira/browse/SPARK-9097 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Attachments: number of executor is zero.png I set the value of spark.dynamicAllocation.enabled is true. I submit tasks to run. Tasks are not completed, but the number of executor is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9099) spark-ec2 does not add important ports to security group
[ https://issues.apache.org/jira/browse/SPARK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629649#comment-14629649 ] Apache Spark commented on SPARK-9099: - User 'serialx' has created a pull request for this issue: https://github.com/apache/spark/pull/7443 spark-ec2 does not add important ports to security group Key: SPARK-9099 URL: https://issues.apache.org/jira/browse/SPARK-9099 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0, 1.4.1 Reporter: Brian Sung-jin Hong spark-ec2 scripts misses to add some few important ports to the security group, including: Master 6066: Needed to submit jobs outside of the cluster Slave 4040: Needed to view worker state Slave 8082: Needed to view some worker logs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9097) Tasks are not completed but the number of executor is zero
[ https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629673#comment-14629673 ] KaiXinXIaoLei commented on SPARK-9097: -- I run a big job. During running tasks, five tasks failed. Then executors are killed. But there are many tasks to run. The log info: 2015-07-08 15:03:30,583 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1568.0 in stage 167.0 (TID 25557, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1549.0 in stage 167.0 (TID 25538, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1552.0 in stage 167.0 (TID 25541, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1569.0 in stage 167.0 (TID 25558, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1548.0 in stage 167.0 (TID 25537, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | INFO | [dag-scheduler-event-loop] | Executor lost: 52 (epoch 29) 2015-07-08 15:03:30,584 | INFO | [kill-executor-thread] | Requesting to kill executor(s) 52 2015-07-08 15:03:30,585 | INFO | [sparkDriver-akka.actor.default-dispatcher-30] | Trying to remove executor 52 from BlockManagerMaster. 2015-07-08 15:03:30,585 | INFO | [sparkDriver-akka.actor.default-dispatcher-30] | Removing block manager BlockManagerId(52, 9.91.8.174, 23424) 2015-07-08 15:03:30,585 | INFO | [dag-scheduler-event-loop] | Removed 52 successfully in removeExecutor 2015-07-08 15:03:30,585 | INFO | [dag-scheduler-event-loop] | Host added was in lost list earlier: hostname Then I can't find executors to add, and not find failed task to re-submit. Thanks. Tasks are not completed but the number of executor is zero -- Key: SPARK-9097 URL: https://issues.apache.org/jira/browse/SPARK-9097 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Attachments: number of executor is zero.png, tasks are not completed.png I set the value of spark.dynamicAllocation.enabled is true. I submit tasks to run. Tasks are not completed, but the number of executor is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9097) Tasks are not completed but the number of executor is zero
[ https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629673#comment-14629673 ] KaiXinXIaoLei edited comment on SPARK-9097 at 7/16/15 12:57 PM: I run a big job. During running tasks, five tasks failed. Then executors are killed. But there are many tasks to run. The log info: 2015-07-08 15:03:30,583 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1568.0 in stage 167.0 (TID 25557, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1549.0 in stage 167.0 (TID 25538, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1552.0 in stage 167.0 (TID 25541, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1569.0 in stage 167.0 (TID 25558, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1548.0 in stage 167.0 (TID 25537, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | INFO | [dag-scheduler-event-loop] | Executor lost: 52 (epoch 29) 2015-07-08 15:03:30,584 | INFO | [kill-executor-thread] | Requesting to kill executor(s) 52 2015-07-08 15:03:30,585 | INFO | [sparkDriver-akka.actor.default-dispatcher-30] | Trying to remove executor 52 from BlockManagerMaster. 2015-07-08 15:03:30,585 | INFO | [sparkDriver-akka.actor.default-dispatcher-30] | Removing block manager BlockManagerId(52, 9.91.8.174, 23424) 2015-07-08 15:03:30,585 | INFO | [dag-scheduler-event-loop] | Removed 52 successfully in removeExecutor 2015-07-08 15:03:30,585 | INFO | [dag-scheduler-event-loop] | Host added was in lost list earlier: hostname Then I can't find executors to add, and not find failed task to re-submit in log. Thanks. was (Author: kaixinxiaolei): I run a big job. During running tasks, five tasks failed. Then executors are killed. But there are many tasks to run. The log info: 2015-07-08 15:03:30,583 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1568.0 in stage 167.0 (TID 25557, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1549.0 in stage 167.0 (TID 25538, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1552.0 in stage 167.0 (TID 25541, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1569.0 in stage 167.0 (TID 25558, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | WARN | [sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1548.0 in stage 167.0 (TID 25537, linux-174): ExecutorLostFailure (executor 52 lost) 2015-07-08 15:03:30,584 | INFO | [dag-scheduler-event-loop] | Executor lost: 52 (epoch 29) 2015-07-08 15:03:30,584 | INFO | [kill-executor-thread] | Requesting to kill executor(s) 52 2015-07-08 15:03:30,585 | INFO | [sparkDriver-akka.actor.default-dispatcher-30] | Trying to remove executor 52 from BlockManagerMaster. 2015-07-08 15:03:30,585 | INFO | [sparkDriver-akka.actor.default-dispatcher-30] | Removing block manager BlockManagerId(52, 9.91.8.174, 23424) 2015-07-08 15:03:30,585 | INFO | [dag-scheduler-event-loop] | Removed 52 successfully in removeExecutor 2015-07-08 15:03:30,585 | INFO | [dag-scheduler-event-loop] | Host added was in lost list earlier: hostname Then I can't find executors to add, and not find failed task to re-submit. Thanks. Tasks are not completed but the number of executor is zero -- Key: SPARK-9097 URL: https://issues.apache.org/jira/browse/SPARK-9097 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Attachments: number of executor is zero.png, tasks are not completed.png I set the value of spark.dynamicAllocation.enabled is true. I submit tasks to run. Tasks are not completed, but the number of executor is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()
[ https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gisle Ytrestøl updated SPARK-9096: -- Attachment: hanging-one-task.jpg Unevenly distributed task loads after using JavaRDD.subtract() -- Key: SPARK-9096 URL: https://issues.apache.org/jira/browse/SPARK-9096 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.4.0, 1.4.1 Reporter: Gisle Ytrestøl Priority: Minor Attachments: ReproduceBug.java, hanging-one-task.jpg, reproduce.1.3.1.log.gz, reproduce.1.4.1.log.gz When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. I've reproduced this bug in the attached Java file, which I submit with spark-submit. The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks in the count job takes a lot of time: 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 4659) in 708 ms on 148.251.190.217 (1597/1600) 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 4786) in 772 ms on 148.251.190.217 (1598/1600) 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 4582) in 275019 ms on 148.251.190.217 (1599/1600) 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 4430) in 407020 ms on 148.251.190.217 (1600/1600) 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at ReproduceBug.java:56) finished in 420.024 s 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at ReproduceBug.java:56, took 442.941395 s In comparison, all tasks are more or less equal in size when running the same application in Spark 1.3.1. In overall, this attached application (ReproduceBug.java) takes about 7 minutes on Spark 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1. Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9097) Tasks are not completed but the number of executor is zero
[ https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-9097: - Attachment: tasks are not completed.png Tasks are not completed but the number of executor is zero -- Key: SPARK-9097 URL: https://issues.apache.org/jira/browse/SPARK-9097 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Attachments: number of executor is zero.png, tasks are not completed.png I set the value of spark.dynamicAllocation.enabled is true. I submit tasks to run. Tasks are not completed, but the number of executor is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9097) Tasks are not completed but the number of executor is zero
[ https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-9097: - Target Version/s: 1.5.0 Tasks are not completed but the number of executor is zero -- Key: SPARK-9097 URL: https://issues.apache.org/jira/browse/SPARK-9097 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Attachments: number of executor is zero.png, tasks are not completed.png I set the value of spark.dynamicAllocation.enabled is true. I submit tasks to run. Tasks are not completed, but the number of executor is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9044) Updated RDD name does not reflect under Storage tab
[ https://issues.apache.org/jira/browse/SPARK-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629704#comment-14629704 ] Zhang, Liye commented on SPARK-9044: Well, I think the component is correct, still it's business of Web UI Updated RDD name does not reflect under Storage tab - Key: SPARK-9044 URL: https://issues.apache.org/jira/browse/SPARK-9044 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.1, 1.4.0 Environment: Mac OSX Reporter: Wenjie Zhang Priority: Minor I was playing the spark-shell in my macbook, here is what I did: scala val textFile = sc.textFile(/Users/jackzhang/Downloads/ProdPart.txt); scala textFile.cache scala textFile.setName(test1) scala textFile.collect scala textFile.name res10: String = test1 After this four commands, I can see the test1 RDD listed in the Storage tab. However, if I continually run following commands, nothing will happen from the Storage tab: scala textFile.setName(test2) scala textFile.cache scala textFile.collect scala textFile.name res10: String = test2 I am expecting the name of the RDD shows in Storage tab should be test2, is this a bug? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9044) Updated RDD name does not reflect under Storage tab
[ https://issues.apache.org/jira/browse/SPARK-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629703#comment-14629703 ] Zhang, Liye commented on SPARK-9044: Well, I think the component is correct, still it's business of Web UI Updated RDD name does not reflect under Storage tab - Key: SPARK-9044 URL: https://issues.apache.org/jira/browse/SPARK-9044 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.1, 1.4.0 Environment: Mac OSX Reporter: Wenjie Zhang Priority: Minor I was playing the spark-shell in my macbook, here is what I did: scala val textFile = sc.textFile(/Users/jackzhang/Downloads/ProdPart.txt); scala textFile.cache scala textFile.setName(test1) scala textFile.collect scala textFile.name res10: String = test1 After this four commands, I can see the test1 RDD listed in the Storage tab. However, if I continually run following commands, nothing will happen from the Storage tab: scala textFile.setName(test2) scala textFile.cache scala textFile.collect scala textFile.name res10: String = test2 I am expecting the name of the RDD shows in Storage tab should be test2, is this a bug? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()
[ https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629584#comment-14629584 ] Gisle Ytrestøl commented on SPARK-9096: --- Hi, thanks for responding. I've added a screenshot (hanging-on-task.jpg) where we see that one of the tasks is processing a lot of data (nearly all of it), while the other tasks were assigned just tiny amounts of data. This screenshot is from 1.4.0. If I run the same application in Spark 1.3.1, all the tasks would get roughly the same amount of data, and spend about the same amount of time to finish. Unevenly distributed task loads after using JavaRDD.subtract() -- Key: SPARK-9096 URL: https://issues.apache.org/jira/browse/SPARK-9096 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.4.0, 1.4.1 Reporter: Gisle Ytrestøl Priority: Minor Attachments: ReproduceBug.java, hanging-one-task.jpg, reproduce.1.3.1.log.gz, reproduce.1.4.1.log.gz When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed in the the following operations on the new JavaRDD which is created by subtract. The result is that in the following operation on the new JavaRDD, a few tasks process almost all the data, and these tasks will take a long time to finish. I've reproduced this bug in the attached Java file, which I submit with spark-submit. The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks in the count job takes a lot of time: 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 4659) in 708 ms on 148.251.190.217 (1597/1600) 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 4786) in 772 ms on 148.251.190.217 (1598/1600) 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 4582) in 275019 ms on 148.251.190.217 (1599/1600) 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 4430) in 407020 ms on 148.251.190.217 (1600/1600) 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at ReproduceBug.java:56) finished in 420.024 s 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at ReproduceBug.java:56, took 442.941395 s In comparison, all tasks are more or less equal in size when running the same application in Spark 1.3.1. In overall, this attached application (ReproduceBug.java) takes about 7 minutes on Spark 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1. Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6001) K-Means clusterer should return the assignments of input points to clusters
[ https://issues.apache.org/jira/browse/SPARK-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629808#comment-14629808 ] Manoj Kumar commented on SPARK-6001: I just started to work on this. K-Means clusterer should return the assignments of input points to clusters --- Key: SPARK-6001 URL: https://issues.apache.org/jira/browse/SPARK-6001 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: Derrick Burns Priority: Minor The K-Means clusterer returns a KMeansModel that contains the cluster centers. However, when available, I suggest that the K-Means clusterer also return an RDD of the assignments of the input data to the clusters. While the assignments can be computed given the KMeansModel, why not return assignments if they are available to save re-computation costs. The K-means implementation at https://github.com/derrickburns/generalized-kmeans-clustering returns the assignments when available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9104) expose network layer memory usage in shuffle part
Zhang, Liye created SPARK-9104: -- Summary: expose network layer memory usage in shuffle part Key: SPARK-9104 URL: https://issues.apache.org/jira/browse/SPARK-9104 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Zhang, Liye The default network transportation is netty, and when transfering blocks for shuffle, the network layer will consume a decent size of memory, we shall collect the memory usage of this part and expose it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9105) Add an additional WebUi Tab for Memory Usage
Zhang, Liye created SPARK-9105: -- Summary: Add an additional WebUi Tab for Memory Usage Key: SPARK-9105 URL: https://issues.apache.org/jira/browse/SPARK-9105 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Zhang, Liye Add a spark a WebUI Tab for Memory usage, the Tab should expose memory usage status in different spark components. It should show the summary for each executors and may also the details for each tasks. On this Tab, there may be some duplicated information with Storage Tab, but they are in different showing format, take RDD cache for example, the RDD cached size showed on Storage Tab is indexed with RDD name, while on memory usage Tab, the RDD can be indexed with Executors, or tasks. Also, the two Tabs can share some same Web Pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent
[ https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9073: --- Assignee: (was: Apache Spark) spark.ml Models copy() should call setParent when there is a parent --- Key: SPARK-9073 URL: https://issues.apache.org/jira/browse/SPARK-9073 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Priority: Minor Examples with this mistake include: * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119] * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220] Whomever writes a PR for this JIRA should check all spark.ml Model's copy() methods and set copy's {{Model.parent}} when available. Also verify in unit tests (possibly in a standard method checking Models to share code). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent
[ https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9073: --- Assignee: Apache Spark spark.ml Models copy() should call setParent when there is a parent --- Key: SPARK-9073 URL: https://issues.apache.org/jira/browse/SPARK-9073 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Minor Examples with this mistake include: * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119] * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220] Whomever writes a PR for this JIRA should check all spark.ml Model's copy() methods and set copy's {{Model.parent}} when available. Also verify in unit tests (possibly in a standard method checking Models to share code). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9102) Improve project collapse with nondeterministic expressions
Wenchen Fan created SPARK-9102: -- Summary: Improve project collapse with nondeterministic expressions Key: SPARK-9102 URL: https://issues.apache.org/jira/browse/SPARK-9102 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9103) Tracking spark's memory usage
Zhang, Liye created SPARK-9103: -- Summary: Tracking spark's memory usage Key: SPARK-9103 URL: https://issues.apache.org/jira/browse/SPARK-9103 Project: Spark Issue Type: Umbrella Components: Spark Core, Web UI Reporter: Zhang, Liye Currently spark only provides little memory usage information (RDD cache on webUI) for the executors. User have no idea on what is the memory consumption when they are running spark applications with a lot of memory used in spark executors. Especially when they encounter the OOM, it’s really hard to know what is the cause of the problem. So it would be helpful to give out the detail memory consumption information for each part of spark, so that user can clearly have a picture of where the memory is exactly used. The memory usage info to expose should include but not limited to shuffle, cache, network, serializer, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9082) Filter using non-deterministic expressions should not be pushed down
[ https://issues.apache.org/jira/browse/SPARK-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9082: --- Assignee: Apache Spark (was: Wenchen Fan) Filter using non-deterministic expressions should not be pushed down Key: SPARK-9082 URL: https://issues.apache.org/jira/browse/SPARK-9082 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Apache Spark For example, {code} val df = sqlContext.range(1, 10).select($id, rand(0).as('r)) df.as(a).join(df.filter($r 0.5).as(b), $a.id === $b.id).explain(true) {code} The plan is {code} == Physical Plan == ShuffledHashJoin [id#55323L], [id#55327L], BuildRight Exchange (HashPartitioning 200) Project [id#55323L,Rand 0 AS r#55324] PhysicalRDD [id#55323L], MapPartitionsRDD[42268] at range at console:37 Exchange (HashPartitioning 200) Project [id#55327L,Rand 0 AS r#55325] Filter (LessThan) PhysicalRDD [id#55327L], MapPartitionsRDD[42268] at range at console:37 {code} The rand get evaluated twice instead of once. This is caused by when we push down predicates we replace the attribute reference in the predicate with the actual expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9082) Filter using non-deterministic expressions should not be pushed down
[ https://issues.apache.org/jira/browse/SPARK-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629816#comment-14629816 ] Apache Spark commented on SPARK-9082: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7446 Filter using non-deterministic expressions should not be pushed down Key: SPARK-9082 URL: https://issues.apache.org/jira/browse/SPARK-9082 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Wenchen Fan For example, {code} val df = sqlContext.range(1, 10).select($id, rand(0).as('r)) df.as(a).join(df.filter($r 0.5).as(b), $a.id === $b.id).explain(true) {code} The plan is {code} == Physical Plan == ShuffledHashJoin [id#55323L], [id#55327L], BuildRight Exchange (HashPartitioning 200) Project [id#55323L,Rand 0 AS r#55324] PhysicalRDD [id#55323L], MapPartitionsRDD[42268] at range at console:37 Exchange (HashPartitioning 200) Project [id#55327L,Rand 0 AS r#55325] Filter (LessThan) PhysicalRDD [id#55327L], MapPartitionsRDD[42268] at range at console:37 {code} The rand get evaluated twice instead of once. This is caused by when we push down predicates we replace the attribute reference in the predicate with the actual expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9100) DataFrame reader/writer shortcut methods for ORC
[ https://issues.apache.org/jira/browse/SPARK-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629752#comment-14629752 ] Apache Spark commented on SPARK-9100: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/7444 DataFrame reader/writer shortcut methods for ORC Key: SPARK-9100 URL: https://issues.apache.org/jira/browse/SPARK-9100 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9059) Update Direct Kafka Word count examples to show the use of HasOffsetRanges
[ https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629770#comment-14629770 ] Benjamin Fradet commented on SPARK-9059: I've started working on this. Update Direct Kafka Word count examples to show the use of HasOffsetRanges -- Key: SPARK-9059 URL: https://issues.apache.org/jira/browse/SPARK-9059 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Labels: starter Update Scala, Java and Python examples of Direct Kafka word count to access the offset ranges using HasOffsetRanges and print it. For example in Scala, {code} var offsetRanges: Array[OffsetRange] = _ ... directKafkaDStream.foreachRDD { rdd = offsetRanges = rdd.asInstanceOf[HasOffsetRanges] } ... transformedDStream.foreachRDD { rdd = // some operation println(Processed ranges: + offsetRanges) } {code} See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for more info, and the master source code for more updated information on python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org