[jira] [Created] (SPARK-9814) EqualNotNull not passing to data sources
Hyukjin Kwon created SPARK-9814: --- Summary: EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: Input/Output Environment: Centos 6.6 Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in `org.apache.spark.sql.sources`, which are appropriately built and picked up by `selectFilters()` in `org.apache.spark.sql.sources.DataSourceStrategy`. On the other hand, it does not pass `EqualNullSafe` filter in `org.apache.spark.sql.catalyst.expressions` even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass to (below) `buildScan` in `PrunedFilteredScan` and `PrunedScan`, ``` def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] ``` even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that `CatalystScan` can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. ``` SELECT * FROM table WHERE field = 1; ``` 2. ``` SELECT * FROM table WHERE field = 1; ``` The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-9340. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8070 [https://github.com/apache/spark/pull/8070] CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly --- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Assignee: Cheng Lian Fix For: 1.5.0 Attachments: ParquetTypesConverterTest.scala SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement backwards-compatibility rules defined in {{parquet-format}} spec. However, both Spark SQL and {{parquet-avro}} neglected the following statement in {{parquet-format}}: {quote} This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of required elements where the element type is the type of the field. {quote} One of the consequences is that, Parquet files generated by {{parquet-protobuf}} containing unannotated repeated fields are not correctly converted to Catalyst arrays. For example, the following Parquet schema {noformat} message root { repeated int32 f1 } {noformat} should be converted to {noformat} StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), nullable = false) :: Nil) {noformat} But now it triggers an {{AnalysisException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-9814: Description: When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in org.apache.spark.sql.sources, which are appropriately built and picked up by selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy. On the other hand, it does not pass EqualNullSafe filter in org.apache.spark.sql.catalyst.expressions even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that CatalystScan can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. SELECT * FROM table WHERE field = 1; 2. SELECT * FROM table WHERE field = 1; The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. was: When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in `org.apache.spark.sql.sources`, which are appropriately built and picked up by `selectFilters()` in `org.apache.spark.sql.sources.DataSourceStrategy`. On the other hand, it does not pass `EqualNullSafe` filter in `org.apache.spark.sql.catalyst.expressions` even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass to (below) `buildScan` in `PrunedFilteredScan` and `PrunedScan`, ``` def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] ``` even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that `CatalystScan` can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. ``` SELECT * FROM table WHERE field = 1; ``` 2. ``` SELECT * FROM table WHERE field = 1; ``` The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: Input/Output Environment: Centos 6.6 Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in org.apache.spark.sql.sources, which are appropriately built and picked up by selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy. On the other hand, it does not pass EqualNullSafe filter in org.apache.spark.sql.catalyst.expressions even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that CatalystScan can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. SELECT * FROM table WHERE field = 1; 2. SELECT * FROM table WHERE field = 1; The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9813) Incorrect UNION ALL behavior
[ https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681246#comment-14681246 ] Herman van Hovell commented on SPARK-9813: -- So I am not to sure if we want to maintain that level of Hive compatibility. It seems a bit too strict. Any kind of union should be fine as long as the data types match (IMHO). Is there a realistic use case for this? Incorrect UNION ALL behavior Key: SPARK-9813 URL: https://issues.apache.org/jira/browse/SPARK-9813 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.4.1 Environment: Ubuntu on AWS Reporter: Simeon Simeonov Labels: sql, union According to the [Hive Language Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] for UNION ALL: {quote} The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown. {quote} Spark SQL silently swallows an error when the tables being joined with UNION ALL have the same number of columns but different names. Reproducible example: {code} // This test is meant to run in spark-shell import java.io.File import java.io.PrintWriter import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SaveMode val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines def tempTable(name: String, json: String) = { val path = dataPath(name) new PrintWriter(path) { write(json); close } ctx.read.json(file:// + path).registerTempTable(name) } // Note category vs. cat names of first column tempTable(test_one, {category : A, num : 5}) tempTable(test_another, {cat : A, num : 5}) // ++---+ // |category|num| // ++---+ // | A| 5| // | A| 5| // ++---+ // // Instead, an error should have been generated due to incompatible schema ctx.sql(select * from test_one union all select * from test_another).show // Cleanup new File(dataPath(test_one)).delete() new File(dataPath(test_another)).delete() {code} When the number of columns is different, Spark can even mix in datatypes. Reproducible example (requires a new spark-shell session): {code} // This test is meant to run in spark-shell import java.io.File import java.io.PrintWriter import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SaveMode val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines def tempTable(name: String, json: String) = { val path = dataPath(name) new PrintWriter(path) { write(json); close } ctx.read.json(file:// + path).registerTempTable(name) } // Note test_another is missing category column tempTable(test_one, {category : A, num : 5}) tempTable(test_another, {num : 5}) // ++ // |category| // ++ // | A| // | 5| // ++ // // Instead, an error should have been generated due to incompatible schema ctx.sql(select * from test_one union all select * from test_another).show // Cleanup new File(dataPath(test_one)).delete() new File(dataPath(test_another)).delete() {code} At other times, when the schema are complex, Spark SQL produces a misleading error about an unresolved Union operator: {code} scala ctx.sql(select * from view_clicks | union all | select * from view_clicks_aug | ) 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks union all select * from view_clicks_aug 15/08/11 02:40:25 INFO ParseDriver: Parse Completed 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks_aug 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks_aug org.apache.spark.sql.AnalysisException: unresolved operator 'Union; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42) at
[jira] [Assigned] (SPARK-9815) Rename PlatformDependent.UNSAFE - Platform
[ https://issues.apache.org/jira/browse/SPARK-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9815: --- Assignee: Apache Spark (was: Reynold Xin) Rename PlatformDependent.UNSAFE - Platform --- Key: SPARK-9815 URL: https://issues.apache.org/jira/browse/SPARK-9815 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Apache Spark PlatformDependent.UNSAFE is way too verbose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9815) Rename PlatformDependent.UNSAFE - Platform
Reynold Xin created SPARK-9815: -- Summary: Rename PlatformDependent.UNSAFE - Platform Key: SPARK-9815 URL: https://issues.apache.org/jira/browse/SPARK-9815 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin PlatformDependent.UNSAFE is way too verbose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9815) Rename PlatformDependent.UNSAFE - Platform
[ https://issues.apache.org/jira/browse/SPARK-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681250#comment-14681250 ] Apache Spark commented on SPARK-9815: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/8094 Rename PlatformDependent.UNSAFE - Platform --- Key: SPARK-9815 URL: https://issues.apache.org/jira/browse/SPARK-9815 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin PlatformDependent.UNSAFE is way too verbose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9815) Rename PlatformDependent.UNSAFE - Platform
[ https://issues.apache.org/jira/browse/SPARK-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9815: --- Assignee: Reynold Xin (was: Apache Spark) Rename PlatformDependent.UNSAFE - Platform --- Key: SPARK-9815 URL: https://issues.apache.org/jira/browse/SPARK-9815 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin PlatformDependent.UNSAFE is way too verbose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-9814: Description: When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. was: When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in org.apache.spark.sql.sources, which are appropriately built and picked up by selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy. On the other hand, it does not pass EqualNullSafe filter in org.apache.spark.sql.catalyst.expressions even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that CatalystScan can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. SELECT * FROM table WHERE field = 1; 2. SELECT * FROM table WHERE field = 1; The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: Input/Output Environment: Centos 6.6 Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
[ https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681249#comment-14681249 ] Apache Spark commented on SPARK-9790: - User 'markgrover' has created a pull request for this issue: https://github.com/apache/spark/pull/8093 [YARN] Expose in WebUI if NodeManager is the reason why executors were killed. -- Key: SPARK-9790 URL: https://issues.apache.org/jira/browse/SPARK-9790 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.1 Reporter: Mark Grover When an executor is killed by yarn because it exceeds the memory overhead, the only thing spark knows is that the executor is lost. The user has to go track search through the NM logs to figure out that its been killed by yarn. It would be much nicer if the spark-driver could be notified why the executor was killed. Ideally it could both log an explanatory message, and update the UI (and the eventLog) so that it was clear why the executor was lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
[ https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9790: --- Assignee: (was: Apache Spark) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed. -- Key: SPARK-9790 URL: https://issues.apache.org/jira/browse/SPARK-9790 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.1 Reporter: Mark Grover When an executor is killed by yarn because it exceeds the memory overhead, the only thing spark knows is that the executor is lost. The user has to go track search through the NM logs to figure out that its been killed by yarn. It would be much nicer if the spark-driver could be notified why the executor was killed. Ideally it could both log an explanatory message, and update the UI (and the eventLog) so that it was clear why the executor was lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
[ https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9790: --- Assignee: Apache Spark [YARN] Expose in WebUI if NodeManager is the reason why executors were killed. -- Key: SPARK-9790 URL: https://issues.apache.org/jira/browse/SPARK-9790 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.1 Reporter: Mark Grover Assignee: Apache Spark When an executor is killed by yarn because it exceeds the memory overhead, the only thing spark knows is that the executor is lost. The user has to go track search through the NM logs to figure out that its been killed by yarn. It would be much nicer if the spark-driver could be notified why the executor was killed. Ideally it could both log an explanatory message, and update the UI (and the eventLog) so that it was clear why the executor was lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9363) SortMergeJoin operator should support UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9363. Resolution: Fixed Fix Version/s: 1.5.0 SortMergeJoin operator should support UnsafeRow --- Key: SPARK-9363 URL: https://issues.apache.org/jira/browse/SPARK-9363 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.5.0 The SortMergeJoin operator should implement the suppotsUnsafeRow and outputsUnsafeRow settings when appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9729) Sort Merge Join for Left and Right Outer Join
[ https://issues.apache.org/jira/browse/SPARK-9729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9729. Resolution: Fixed Fix Version/s: 1.5.0 Sort Merge Join for Left and Right Outer Join - Key: SPARK-9729 URL: https://issues.apache.org/jira/browse/SPARK-9729 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
[ https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Grover updated SPARK-9790: --- Attachment: error_showing_in_UI.png Attaching an image of what the error message in the UI would now look like. [YARN] Expose in WebUI if NodeManager is the reason why executors were killed. -- Key: SPARK-9790 URL: https://issues.apache.org/jira/browse/SPARK-9790 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.1 Reporter: Mark Grover Attachments: error_showing_in_UI.png When an executor is killed by yarn because it exceeds the memory overhead, the only thing spark knows is that the executor is lost. The user has to go track search through the NM logs to figure out that its been killed by yarn. It would be much nicer if the spark-driver could be notified why the executor was killed. Ideally it could both log an explanatory message, and update the UI (and the eventLog) so that it was clear why the executor was lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
[ https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681263#comment-14681263 ] Mark Grover edited comment on SPARK-9790 at 8/11/15 5:17 AM: - Attaching an [image|https://issues.apache.org/jira/secure/attachment/12749771/error_showing_in_UI.png] of what the error message in the UI would now look like. was (Author: mgrover): Attaching an image of what the error message in the UI would now look like. [YARN] Expose in WebUI if NodeManager is the reason why executors were killed. -- Key: SPARK-9790 URL: https://issues.apache.org/jira/browse/SPARK-9790 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.1 Reporter: Mark Grover Attachments: error_showing_in_UI.png When an executor is killed by yarn because it exceeds the memory overhead, the only thing spark knows is that the executor is lost. The user has to go track search through the NM logs to figure out that its been killed by yarn. It would be much nicer if the spark-driver could be notified why the executor was killed. Ideally it could both log an explanatory message, and update the UI (and the eventLog) so that it was clear why the executor was lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-9814: Environment: (was: Centos 6.6) EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: Input/Output Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-9814: Component/s: (was: Input/Output) SQL EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7726) Maven Install Breaks When Upgrading Scala 2.11.2--[2.11.3 or higher]
[ https://issues.apache.org/jira/browse/SPARK-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681277#comment-14681277 ] Apache Spark commented on SPARK-7726: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/8095 Maven Install Breaks When Upgrading Scala 2.11.2--[2.11.3 or higher] - Key: SPARK-7726 URL: https://issues.apache.org/jira/browse/SPARK-7726 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Iulian Dragos Priority: Blocker Fix For: 1.4.0 This one took a long time to track down. The Maven install phase is part of our release process. It runs the scala:doc target to generate doc jars. Between Scala 2.11.2 and Scala 2.11.3, the behavior of this plugin changed in a way that breaks our build. In both cases, it returned an error (there has been a long running error here that we've always ignored), however in 2.11.3 that error became fatal and failed the entire build process. The upgrade occurred in SPARK-7092. Here is a simple reproduction: {code} ./dev/change-version-to-2.11.sh mvn clean install -pl network/common -pl network/shuffle -DskipTests -Dscala-2.11 {code} This command exits success when Spark is at Scala 2.11.2 and fails with 2.11.3 or higher. In either case an error is printed: {code} [INFO] [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } ^ model contains 22 documentable templates four errors found {code} Ideally we'd just dig in and fix this error. Unfortunately it's a very confusing error and I have no idea why it is appearing. I'd propose reverting SPARK-7092 in the mean time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9743) Scanning a HadoopFsRelation shouldn't requrire refreshing
[ https://issues.apache.org/jira/browse/SPARK-9743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-9743. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8035 [https://github.com/apache/spark/pull/8035] Scanning a HadoopFsRelation shouldn't requrire refreshing - Key: SPARK-9743 URL: https://issues.apache.org/jira/browse/SPARK-9743 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Fix For: 1.5.0 PR #7969 added {{HadoopFsRelation.refresh()}} calls in {{DataSourceStrategy}} to make test case {{InsertSuite.save directly to the path of a JSON table}} pass. However, this forces every {{HadoopFsRelation}} table scan to do a refreshing, which can be super expensive for tables with large number of partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9714) Cannot insert into a table using pySpark
[ https://issues.apache.org/jira/browse/SPARK-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9714: Sprint: Spark 1.5 doc/QA sprint Cannot insert into a table using pySpark Key: SPARK-9714 URL: https://issues.apache.org/jira/browse/SPARK-9714 Project: Spark Issue Type: Bug Components: SQL Reporter: Yun Park Assignee: Yin Huai Priority: Blocker This is a bug on the master branch. After creating the table (yun is the table name) with the corresponding fields, I ran the following command. {code} from pyspark.sql import * sc.parallelize([Row(id=1, name=test, description=)]).toDF().write.mode(append).saveAsTable(yun) {code} I get the following error: {code} Py4JJavaError: An error occurred while calling o100.saveAsTable. : org.apache.spark.SparkException: Task not serializable Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.Path Serialization stack: - object not serializable (class: org.apache.hadoop.fs.Path, value: /user/hive/warehouse/yun) - field (class: org.apache.hadoop.hive.ql.metadata.Table, name: path, type: class org.apache.hadoop.fs.Path) - object (class org.apache.hadoop.hive.ql.metadata.Table, yun) - field (class: org.apache.hadoop.hive.ql.metadata.Partition, name: table, type: class org.apache.hadoop.hive.ql.metadata.Table) - object (class org.apache.hadoop.hive.ql.metadata.Partition, yun()) - field (class: scala.collection.immutable.Stream$Cons, name: hd, type: class java.lang.Object) - object (class scala.collection.immutable.Stream$Cons, Stream(yun())) - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: $outer, type: class scala.collection.immutable.Stream) - object (class scala.collection.immutable.Stream$$anonfun$map$1, function0) - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0) - object (class scala.collection.immutable.Stream$Cons, Stream(HivePartition(List(),HiveStorageDescriptor(/user/hive/warehouse/yun,org.apache.hadoop.mapred.TextInputFormat,org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,Map(serialization.format - 1) - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: $outer, type: class scala.collection.immutable.Stream) - object (class scala.collection.immutable.Stream$$anonfun$map$1, function0) - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0) - object (class scala.collection.immutable.Stream$Cons, Stream(/user/hive/warehouse/yun)) - field (class: org.apache.spark.sql.hive.MetastoreRelation, name: paths, type: interface scala.collection.Seq) - object (class org.apache.spark.sql.hive.MetastoreRelation, MetastoreRelation default, yun, None ) - field (class: org.apache.spark.sql.hive.execution.InsertIntoHiveTable, name: table, type: class org.apache.spark.sql.hive.MetastoreRelation) - object (class org.apache.spark.sql.hive.execution.InsertIntoHiveTable, InsertIntoHiveTable (MetastoreRelation default, yun, None), Map(), false, false ConvertToSafe TungstenProject [CAST(description#10, FloatType) AS description#16,CAST(id#11L, StringType) AS id#17,name#12] PhysicalRDD [description#10,id#11L,name#12], MapPartitionsRDD[17] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2 ) - field (class: org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3, name: $outer, type: class org.apache.spark.sql.hive.execution.InsertIntoHiveTable) - object (class org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3, function2) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301) ... 30 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9781) KCL Workers should be configurable from Spark configuration
[ https://issues.apache.org/jira/browse/SPARK-9781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Nekhaev updated SPARK-9781: - Description: Currently the KinesisClientLibConfiguration for KCL Workers is created within the KinesisReceiver and user is allowed to change only basic settings such as endpoint URL, stream name, credentials, etc. However, there is no way to tune some advanced settings, e.g. MaxRecords, IdleTimeBetweenReads, FailoverTime, etc. We can add these settings to the Spark configuration and parametrize KinesisClientLibConfiguration with them in KinesisReceiver. was: Currently the KinesisClientLibConfiguration for KCL Workers is created withing the KinesisReceiver and user is allowed to change only basic settings such as endpoint URL, stream name, credentials, etc. However, there is no way to tune some advanced settings, e.g. MaxRecords, IdleTimeBetweenReads, FailoverTime, etc. We can add this settings to the Spark configuration and parametrize KinesisClientLibConfiguration with them in KinesisReceiver. KCL Workers should be configurable from Spark configuration --- Key: SPARK-9781 URL: https://issues.apache.org/jira/browse/SPARK-9781 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: Anton Nekhaev Labels: kinesis Currently the KinesisClientLibConfiguration for KCL Workers is created within the KinesisReceiver and user is allowed to change only basic settings such as endpoint URL, stream name, credentials, etc. However, there is no way to tune some advanced settings, e.g. MaxRecords, IdleTimeBetweenReads, FailoverTime, etc. We can add these settings to the Spark configuration and parametrize KinesisClientLibConfiguration with them in KinesisReceiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680355#comment-14680355 ] Ryan Blue commented on SPARK-9340: -- Sorry to jump in late on this issue... I think you're on the right track here, but just to be sure I'll clarify things as I see them. The specs written for PARQUET-113 allow non-LIST/MAP repeated fields because that's what parquet-protobuf uses. But, we didn't implement support for unannotated repeated groups because we wanted to address the compatibility issues between Hive, Thrift, and Avro as quickly as possible (which are still being cleaned up). So for now, unannotated repeated groups throw the AnalysisException noted above. Those should eventually map to required lists of required elements to give the exact same view of the data that you have in parquet-protobuf. I believe [~damianguy], would like to discuss a different mapping from the protobuf schema to a parquet schema, which is a great discussion to have in the upstream Parquet project. That sounds like a reasonable extension to me, but I want to see what the protobuf model maintainers think of it. ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9600) DataFrameWriter.saveAsTable always writes data to /user/hive/warehouse
[ https://issues.apache.org/jira/browse/SPARK-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9600: Priority: Blocker (was: Critical) DataFrameWriter.saveAsTable always writes data to /user/hive/warehouse Key: SPARK-9600 URL: https://issues.apache.org/jira/browse/SPARK-9600 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1, 1.5.0 Reporter: Cheng Lian Assignee: Sudhakar Thota Priority: Blocker Attachments: SPARK-9600-fl1.txt Get a clean Spark 1.4.1 build: {noformat} $ git checkout v1.4.1 $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 clean assembly/assembly {noformat} Stop any running local Hadoop instance and unset all Hadoop environment variables, so that we force Spark run with local file system only: {noformat} $ unset HADOOP_CONF_DIR $ unset HADOOP_PREFIX $ unset HADOOP_LIBEXEC_DIR $ unset HADOOP_CLASSPATH {noformat} In this way we also ensure that the default Hive warehouse location points to local file system {{file:///user/hive/warehouse}}. Now we create warehouse directories for testing: {noformat} $ sudo rm -rf /user # !! WARNING: IT'S /user RATHER THAN /usr !! $ sudo mkdir -p /user/hive/{warehouse,warehouse_hive13} $ sudo chown -R lian:staff /user $ tree /user /user └── hive ├── warehouse └── warehouse_hive13 {noformat} Create a minimal {{hive-site.xml}}, only override the warehouse location, put it under {{$SPARK_HOME/conf}}: {noformat} ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehive.metastore.warehouse.dir/name valuefile:///user/hive/warehouse_hive13/value /property /configuration {noformat} Now run our test snippets with {{pyspark}}: {noformat} $ ./bin/pyspark In [1]: sqlContext.range(10).coalesce(1).write.saveAsTable(ds) {noformat} Check warehouse directories: {noformat} $ tree /user /user └── hive ├── warehouse │ └── ds │ ├── _SUCCESS │ ├── _common_metadata │ ├── _metadata │ └── part-r-0-46e4b32a-5c4d-4dba-b8d6-8d30ae910dc9.gz.parquet └── warehouse_hive13 └── ds {noformat} Here you may notice the weird part: we have {{ds}} under both {{warehouse}} and {{warehouse_hive13}}, but data are only written into the former. Now let's try HiveQl: {noformat} In [2]: sqlContext.range(10).coalesce(1).registerTempTable(t) In [3]: sqlContext.sql(CREATE TABLE ds_ctas AS SELECT * FROM t) {noformat} Check the directories again: {noformat} $ tree /user /user └── hive ├── warehouse │ └── ds │ ├── _SUCCESS │ ├── _common_metadata │ ├── _metadata │ └── part-r-0-46e4b32a-5c4d-4dba-b8d6-8d30ae910dc9.gz.parquet └── warehouse_hive13 ├── ds └── ds_ctas ├── _SUCCESS └── part-0 {noformat} So HiveQl works fine. (Hive never writes Parquet summary files, so {{_common_metadata}} and {{_metadata}} are missing in {{ds_ctas}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9782) Add support for YARN application tags running Spark on YARN
Dennis Huo created SPARK-9782: - Summary: Add support for YARN application tags running Spark on YARN Key: SPARK-9782 URL: https://issues.apache.org/jira/browse/SPARK-9782 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.1 Reporter: Dennis Huo https://issues.apache.org/jira/browse/YARN-1390 originally added the new “Application Tags” feature to YARN to help track the sources of applications among many possible YARN clients. https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a set of tags to be applied, and for comparison, https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for MapReduce to easily propagate tags through to YARN via Configuration settings. Since the ApplicationSubmissionContext.setApplicationTags method was only added in Hadoop 2.4+, Spark support will invoke the method via reflection the same way other such version-specific methods are called in elsewhere in the YARN client. Since the usage of tags is generally not critical to the functionality of older YARN setups, it should be safe to handle NoSuchMethodException with just a logWarning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib
[ https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680423#comment-14680423 ] Joseph K. Bradley commented on SPARK-7751: -- [~mengxr] I haven't reviewed these PRs, but are people copying docs whenever they add since tags to overridden methods? Before, the overridden methods would inherit documentation, but with a one-line since tag added, they no longer inherit docs. This PR brought the problem to my attention: [https://github.com/apache/spark/pull/8045/files]. Adding since tags to all methods in MLlib will mean we always copy documentation and never rely on it being inherited. Add @since to stable and experimental methods in MLlib -- Key: SPARK-7751 URL: https://issues.apache.org/jira/browse/SPARK-7751 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Labels: starter This is useful to check whether a feature exists in some version of Spark. This is an umbrella JIRA to track the progress. We want to have @since tag for both stable (those without any Experimental/DeveloperApi/AlphaComponent annotations) and experimental methods in MLlib: (Do NOT tag private or package private classes or methods.) * an example PR for Scala: https://github.com/apache/spark/pull/6101 * an example PR for Python: https://github.com/apache/spark/pull/6295 We need to dig the history of git commit to figure out what was the Spark version when a method was first introduced. Take `NaiveBayes.setModelType` as an example. We can grep `def setModelType` at different version git tags. {code} meng@xm:~/src/spark $ git show v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType meng@xm:~/src/spark $ git show v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType def setModelType(modelType: String): NaiveBayes = { {code} If there are better ways, please let us know. We cannot add all @since tags in a single PR, which is hard to review. So we made some subtasks for each package, for example `org.apache.spark.classification`. Feel free to add more sub-tasks for Python and the `spark.ml` package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
[ https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9783: -- Sprint: Spark 1.5 doc/QA sprint Environment: (was: PR #8035 made a quick fix for SPARK-9743 by introducing an extra {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and override {{listStatus()}} to inject cached {{FileStatus}} instances, similar as what we did in {{ParquetRelation}}.) Description: PR #8035 made a quick fix for SPARK-9743 by introducing an extra {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and override {{listStatus()}} to inject cached {{FileStatus}} instances, similar as what we did in {{ParquetRelation}}. Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call - Key: SPARK-9783 URL: https://issues.apache.org/jira/browse/SPARK-9783 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker PR #8035 made a quick fix for SPARK-9743 by introducing an extra {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and override {{listStatus()}} to inject cached {{FileStatus}} instances, similar as what we did in {{ParquetRelation}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9622) DecisionTreeRegressor: provide variance of prediction
[ https://issues.apache.org/jira/browse/SPARK-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680510#comment-14680510 ] Joseph K. Bradley commented on SPARK-9622: -- OK, before you do though, it'd be worth discussing how those variances should be returned. E.g., just a Double column of variances? Pros: Simple, applicable to other distributions if we ever move beyond Variance (=Gaussian) as an impurity. Cons: Not extensible if we use other distributions and want to return more details about the distribution. Those are my thoughts. Currently, a Double column of variances seems best to me. But it'd be nice to hear your thoughts. DecisionTreeRegressor: provide variance of prediction - Key: SPARK-9622 URL: https://issues.apache.org/jira/browse/SPARK-9622 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Priority: Minor Variance of predicted value, as estimated from training data. Analogous to class probabilities for classification. See [SPARK-3727] for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9340: -- Sprint: Spark 1.5 doc/QA sprint Target Version/s: 1.5.0 CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly --- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Assignee: Cheng Lian Attachments: ParquetTypesConverterTest.scala SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement backwards-compatibility rules defined in {{parquet-format}} spec. However, both Spark SQL and {{parquet-avro}} neglected the following statement in {{parquet-format}}: {quote} This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of required elements where the element type is the type of the field. {quote} One of the consequences is that, Parquet files generated by {{parquet-protobuf}} containing unannotated repeated fields are not correctly converted to Catalyst arrays. For example, the following Parquet schema {noformat} message root { repeated int32 f1 } {noformat} should be converted to {noformat} StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), nullable = false) :: Nil) {noformat} But now it triggers an {{AnalysisException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9785) HashPartitioning guarantees / compatibleWith violate those methods' contracts
Josh Rosen created SPARK-9785: - Summary: HashPartitioning guarantees / compatibleWith violate those methods' contracts Key: SPARK-9785 URL: https://issues.apache.org/jira/browse/SPARK-9785 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but in other contexts the ordering of those expressions matters. This is illustrated by the following regression test: {code} test(HashPartitioning compatibility) { val expressions = Seq(Literal(2), Literal(3)) // Consider two HashPartitionings that have the same _set_ of hash expressions but which are // created with different orderings of those expressions: val partitioningA = HashPartitioning(expressions, 100) val partitioningB = HashPartitioning(expressions.reverse, 100) // These partitionings are not considered equal: assert(partitioningA != partitioningB) // However, they both satisfy the same clustered distribution: val distribution = ClusteredDistribution(expressions) assert(partitioningA.satisfies(distribution)) assert(partitioningB.satisfies(distribution)) // Both partitionings are compatible with and guarantee each other: assert(partitioningA.compatibleWith(partitioningB)) assert(partitioningB.compatibleWith(partitioningA)) assert(partitioningA.guarantees(partitioningB)) assert(partitioningB.guarantees(partitioningA)) // Given all of this, we would expect these partitionings to compute the same hashcode for // any given row: def computeHashCode(partitioning: HashPartitioning): Int = { val hashExprProj = new InterpretedMutableProjection(partitioning.expressions, Seq.empty) hashExprProj.apply(InternalRow.empty).hashCode() } assert(computeHashCode(partitioningA) === computeHashCode(partitioningB)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9788) LDA docConcentration, gammaShape 1.5 binary incompatibility fixes
Joseph K. Bradley created SPARK-9788: Summary: LDA docConcentration, gammaShape 1.5 binary incompatibility fixes Key: SPARK-9788 URL: https://issues.apache.org/jira/browse/SPARK-9788 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley From [SPARK-9658]: 1. LDA.docConcentration It will be nice to keep the old APIs unchanged. Proposal: * Add “asymmetricDocConcentration” and revert docConcentration changes. * If the (internal) doc concentration vector is a single value, “getDocConcentration returns it. If it is a constant vector, getDocConcentration returns the first item, and fails otherwise. 2. LDAModel.gammaShape This should be given a default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9782) Add support for YARN application tags running Spark on YARN
[ https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680413#comment-14680413 ] Apache Spark commented on SPARK-9782: - User 'dennishuo' has created a pull request for this issue: https://github.com/apache/spark/pull/8072 Add support for YARN application tags running Spark on YARN --- Key: SPARK-9782 URL: https://issues.apache.org/jira/browse/SPARK-9782 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.1 Reporter: Dennis Huo https://issues.apache.org/jira/browse/YARN-1390 originally added the new “Application Tags” feature to YARN to help track the sources of applications among many possible YARN clients. https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a set of tags to be applied, and for comparison, https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for MapReduce to easily propagate tags through to YARN via Configuration settings. Since the ApplicationSubmissionContext.setApplicationTags method was only added in Hadoop 2.4+, Spark support will invoke the method via reflection the same way other such version-specific methods are called in elsewhere in the YARN client. Since the usage of tags is generally not critical to the functionality of older YARN setups, it should be safe to handle NoSuchMethodException with just a logWarning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9755) Add method documentation to MultivariateOnlineSummarizer
[ https://issues.apache.org/jira/browse/SPARK-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9755: - Shepherd: Joseph K. Bradley Assignee: Feynman Liang Add method documentation to MultivariateOnlineSummarizer Key: SPARK-9755 URL: https://issues.apache.org/jira/browse/SPARK-9755 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor Docs present in 1.4 are lost in current 1.5 branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
[ https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680457#comment-14680457 ] Cheng Lian commented on SPARK-9783: --- cc [~yhuai] Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call - Key: SPARK-9783 URL: https://issues.apache.org/jira/browse/SPARK-9783 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker PR #8035 made a quick fix for SPARK-9743 by introducing an extra {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and override {{listStatus()}} to inject cached {{FileStatus}} instances, similar as what we did in {{ParquetRelation}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9720: - Description: It would be nice to include the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not include the UID. (was: It would be nice to print the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not print the UID.) spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to include the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not include the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9720: - Assignee: Bertrand Dechoux spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Bertrand Dechoux Priority: Minor Labels: starter It would be nice to include the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not include the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680520#comment-14680520 ] Joseph K. Bradley commented on SPARK-9720: -- Oh sorry! I shouldn't have said print. spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to include the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not include the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9785) HashPartitioning compatibility should consider expression ordering
[ https://issues.apache.org/jira/browse/SPARK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9785: --- Assignee: Apache Spark (was: Josh Rosen) HashPartitioning compatibility should consider expression ordering -- Key: SPARK-9785 URL: https://issues.apache.org/jira/browse/SPARK-9785 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Apache Spark Priority: Blocker HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but in other contexts the ordering of those expressions matters. This is illustrated by the following regression test: {code} test(HashPartitioning compatibility) { val expressions = Seq(Literal(2), Literal(3)) // Consider two HashPartitionings that have the same _set_ of hash expressions but which are // created with different orderings of those expressions: val partitioningA = HashPartitioning(expressions, 100) val partitioningB = HashPartitioning(expressions.reverse, 100) // These partitionings are not considered equal: assert(partitioningA != partitioningB) // However, they both satisfy the same clustered distribution: val distribution = ClusteredDistribution(expressions) assert(partitioningA.satisfies(distribution)) assert(partitioningB.satisfies(distribution)) // Both partitionings are compatible with and guarantee each other: assert(partitioningA.compatibleWith(partitioningB)) assert(partitioningB.compatibleWith(partitioningA)) assert(partitioningA.guarantees(partitioningB)) assert(partitioningB.guarantees(partitioningA)) // Given all of this, we would expect these partitionings to compute the same hashcode for // any given row: def computeHashCode(partitioning: HashPartitioning): Int = { val hashExprProj = new InterpretedMutableProjection(partitioning.expressions, Seq.empty) hashExprProj.apply(InternalRow.empty).hashCode() } assert(computeHashCode(partitioningA) === computeHashCode(partitioningB)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
Mark Grover created SPARK-9790: -- Summary: [YARN] Expose in WebUI if NodeManager is the reason why executors were killed. Key: SPARK-9790 URL: https://issues.apache.org/jira/browse/SPARK-9790 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.1 Reporter: Mark Grover When an executor is killed by yarn because it exceeds the memory overhead, the only thing spark knows is that the executor is lost. The user has to go track search through the NM logs to figure out that its been killed by yarn. It would be much nicer if the spark-driver could be notified why the executor was killed. Ideally it could both log an explanatory message, and update the UI (and the eventLog) so that it was clear why the executor was lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9570) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.
[ https://issues.apache.org/jira/browse/SPARK-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680379#comment-14680379 ] Guru Medasani commented on SPARK-9570: -- I won't have time to look at this. Neelesh can you look at it? Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'. - Key: SPARK-9570 URL: https://issues.apache.org/jira/browse/SPARK-9570 Project: Spark Issue Type: Improvement Components: Documentation, Spark Submit, YARN Affects Versions: 1.4.1 Reporter: Neelesh Srinivas Salian Priority: Minor Labels: starter There are still some inconsistencies in the documentation regarding submission of the applications for yarn. SPARK-3629 was done to correct the same but http://spark.apache.org/docs/latest/submitting-applications.html#master-urls still has yarn-client and yarn-client as opposed to the nor of having --master yarn and --deploy-mode cluster / client Need to change this appropriately (if needed) to avoid confusion: https://spark.apache.org/docs/latest/running-on-yarn.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680385#comment-14680385 ] Apache Spark commented on SPARK-9340: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8070 ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9570) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.
[ https://issues.apache.org/jira/browse/SPARK-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9570: --- Assignee: (was: Apache Spark) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'. - Key: SPARK-9570 URL: https://issues.apache.org/jira/browse/SPARK-9570 Project: Spark Issue Type: Improvement Components: Documentation, Spark Submit, YARN Affects Versions: 1.4.1 Reporter: Neelesh Srinivas Salian Priority: Minor Labels: starter There are still some inconsistencies in the documentation regarding submission of the applications for yarn. SPARK-3629 was done to correct the same but http://spark.apache.org/docs/latest/submitting-applications.html#master-urls still has yarn-client and yarn-client as opposed to the nor of having --master yarn and --deploy-mode cluster / client Need to change this appropriately (if needed) to avoid confusion: https://spark.apache.org/docs/latest/running-on-yarn.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9570) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.
[ https://issues.apache.org/jira/browse/SPARK-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680392#comment-14680392 ] Apache Spark commented on SPARK-9570: - User 'nssalian' has created a pull request for this issue: https://github.com/apache/spark/pull/8071 Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'. - Key: SPARK-9570 URL: https://issues.apache.org/jira/browse/SPARK-9570 Project: Spark Issue Type: Improvement Components: Documentation, Spark Submit, YARN Affects Versions: 1.4.1 Reporter: Neelesh Srinivas Salian Priority: Minor Labels: starter There are still some inconsistencies in the documentation regarding submission of the applications for yarn. SPARK-3629 was done to correct the same but http://spark.apache.org/docs/latest/submitting-applications.html#master-urls still has yarn-client and yarn-client as opposed to the nor of having --master yarn and --deploy-mode cluster / client Need to change this appropriately (if needed) to avoid confusion: https://spark.apache.org/docs/latest/running-on-yarn.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9570) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.
[ https://issues.apache.org/jira/browse/SPARK-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9570: --- Assignee: Apache Spark Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'. - Key: SPARK-9570 URL: https://issues.apache.org/jira/browse/SPARK-9570 Project: Spark Issue Type: Improvement Components: Documentation, Spark Submit, YARN Affects Versions: 1.4.1 Reporter: Neelesh Srinivas Salian Assignee: Apache Spark Priority: Minor Labels: starter There are still some inconsistencies in the documentation regarding submission of the applications for yarn. SPARK-3629 was done to correct the same but http://spark.apache.org/docs/latest/submitting-applications.html#master-urls still has yarn-client and yarn-client as opposed to the nor of having --master yarn and --deploy-mode cluster / client Need to change this appropriately (if needed) to avoid confusion: https://spark.apache.org/docs/latest/running-on-yarn.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680418#comment-14680418 ] Cheng Lian commented on SPARK-9340: --- Thanks for the clarification. In [PR #8070|https://github.com/apache/spark/pull/8070] I just try to do the required list of required elements conversion. I understand that cleaning up all those compatibility stuff can be super time consuming, and making sure the most common scenarios work first totally makes sense. I'm so glad that all the backwards-compatibility rules had already been figured out there when I started to investigate these issues. These rules definitely saved my world! ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9340: -- Summary: CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly (was: ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly --- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
Cheng Lian created SPARK-9783: - Summary: Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call Key: SPARK-9783 URL: https://issues.apache.org/jira/browse/SPARK-9783 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: PR #8035 made a quick fix for SPARK-9743 by introducing an extra {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and override {{listStatus()}} to inject cached {{FileStatus}} instances, similar as what we did in {{ParquetRelation}}. Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9784) Exchange.isUnsafe should check whether codegen and unsafe are enabled
[ https://issues.apache.org/jira/browse/SPARK-9784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9784: --- Assignee: Apache Spark (was: Josh Rosen) Exchange.isUnsafe should check whether codegen and unsafe are enabled - Key: SPARK-9784 URL: https://issues.apache.org/jira/browse/SPARK-9784 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Apache Spark Exchange needs to check whether unsafe mode is enabled in its {{tungstenMode}} method: {code} override def nodeName: String = if (tungstenMode) TungstenExchange else Exchange /** * Returns true iff we can support the data type, and we are not doing range partitioning. */ private lazy val tungstenMode: Boolean = { GenerateUnsafeProjection.canSupport(child.schema) !newPartitioning.isInstanceOf[RangePartitioning] } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680523#comment-14680523 ] Cheng Lian commented on SPARK-9340: --- Great, would you mind to leave a LGTM on the GitHub PR page? Appreciated! CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly --- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Assignee: Cheng Lian Attachments: ParquetTypesConverterTest.scala SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement backwards-compatibility rules defined in {{parquet-format}} spec. However, both Spark SQL and {{parquet-avro}} neglected the following statement in {{parquet-format}}: {quote} This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of required elements where the element type is the type of the field. {quote} One of the consequences is that, Parquet files generated by {{parquet-protobuf}} containing unannotated repeated fields are not correctly converted to Catalyst arrays. For example, the following Parquet schema {noformat} message root { repeated int32 f1 } {noformat} should be converted to {noformat} StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), nullable = false) :: Nil) {noformat} But now it triggers an {{AnalysisException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9785) HashPartitioning compatibility should be sensitive to expression ordering
[ https://issues.apache.org/jira/browse/SPARK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9785: -- Summary: HashPartitioning compatibility should be sensitive to expression ordering (was: HashPartitioning guarantees / compatibleWith violate those methods' contracts) HashPartitioning compatibility should be sensitive to expression ordering - Key: SPARK-9785 URL: https://issues.apache.org/jira/browse/SPARK-9785 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but in other contexts the ordering of those expressions matters. This is illustrated by the following regression test: {code} test(HashPartitioning compatibility) { val expressions = Seq(Literal(2), Literal(3)) // Consider two HashPartitionings that have the same _set_ of hash expressions but which are // created with different orderings of those expressions: val partitioningA = HashPartitioning(expressions, 100) val partitioningB = HashPartitioning(expressions.reverse, 100) // These partitionings are not considered equal: assert(partitioningA != partitioningB) // However, they both satisfy the same clustered distribution: val distribution = ClusteredDistribution(expressions) assert(partitioningA.satisfies(distribution)) assert(partitioningB.satisfies(distribution)) // Both partitionings are compatible with and guarantee each other: assert(partitioningA.compatibleWith(partitioningB)) assert(partitioningB.compatibleWith(partitioningA)) assert(partitioningA.guarantees(partitioningB)) assert(partitioningB.guarantees(partitioningA)) // Given all of this, we would expect these partitionings to compute the same hashcode for // any given row: def computeHashCode(partitioning: HashPartitioning): Int = { val hashExprProj = new InterpretedMutableProjection(partitioning.expressions, Seq.empty) hashExprProj.apply(InternalRow.empty).hashCode() } assert(computeHashCode(partitioningA) === computeHashCode(partitioningB)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9574) Review the contents of uber JARs spark-streaming-XXX-assembly
[ https://issues.apache.org/jira/browse/SPARK-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9574: --- Assignee: Apache Spark (was: Shixiong Zhu) Review the contents of uber JARs spark-streaming-XXX-assembly - Key: SPARK-9574 URL: https://issues.apache.org/jira/browse/SPARK-9574 Project: Spark Issue Type: Task Components: Streaming Reporter: Tathagata Das Assignee: Apache Spark It should not contain Spark core and its dependencies, especially the following. - Hadoop and its dependencies - Scala libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9574) Review the contents of uber JARs spark-streaming-XXX-assembly
[ https://issues.apache.org/jira/browse/SPARK-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680362#comment-14680362 ] Apache Spark commented on SPARK-9574: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/8069 Review the contents of uber JARs spark-streaming-XXX-assembly - Key: SPARK-9574 URL: https://issues.apache.org/jira/browse/SPARK-9574 Project: Spark Issue Type: Task Components: Streaming Reporter: Tathagata Das Assignee: Shixiong Zhu It should not contain Spark core and its dependencies, especially the following. - Hadoop and its dependencies - Scala libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9782) Add support for YARN application tags running Spark on YARN
[ https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9782: --- Assignee: (was: Apache Spark) Add support for YARN application tags running Spark on YARN --- Key: SPARK-9782 URL: https://issues.apache.org/jira/browse/SPARK-9782 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.1 Reporter: Dennis Huo https://issues.apache.org/jira/browse/YARN-1390 originally added the new “Application Tags” feature to YARN to help track the sources of applications among many possible YARN clients. https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a set of tags to be applied, and for comparison, https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for MapReduce to easily propagate tags through to YARN via Configuration settings. Since the ApplicationSubmissionContext.setApplicationTags method was only added in Hadoop 2.4+, Spark support will invoke the method via reflection the same way other such version-specific methods are called in elsewhere in the YARN client. Since the usage of tags is generally not critical to the functionality of older YARN setups, it should be safe to handle NoSuchMethodException with just a logWarning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9782) Add support for YARN application tags running Spark on YARN
[ https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9782: --- Assignee: Apache Spark Add support for YARN application tags running Spark on YARN --- Key: SPARK-9782 URL: https://issues.apache.org/jira/browse/SPARK-9782 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.1 Reporter: Dennis Huo Assignee: Apache Spark https://issues.apache.org/jira/browse/YARN-1390 originally added the new “Application Tags” feature to YARN to help track the sources of applications among many possible YARN clients. https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a set of tags to be applied, and for comparison, https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for MapReduce to easily propagate tags through to YARN via Configuration settings. Since the ApplicationSubmissionContext.setApplicationTags method was only added in Hadoop 2.4+, Spark support will invoke the method via reflection the same way other such version-specific methods are called in elsewhere in the YARN client. Since the usage of tags is generally not critical to the functionality of older YARN setups, it should be safe to handle NoSuchMethodException with just a logWarning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9782) Add support for YARN application tags running Spark on YARN
[ https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680415#comment-14680415 ] Sean Owen commented on SPARK-9782: -- This is different from https://issues.apache.org/jira/browse/SPARK-7173 ? I also doubt this will go in any time soon if it needs Hadoop 2.x, since even 1.x is still supported, even with reflection -- the complexity may not be worth it. Add support for YARN application tags running Spark on YARN --- Key: SPARK-9782 URL: https://issues.apache.org/jira/browse/SPARK-9782 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.1 Reporter: Dennis Huo https://issues.apache.org/jira/browse/YARN-1390 originally added the new “Application Tags” feature to YARN to help track the sources of applications among many possible YARN clients. https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a set of tags to be applied, and for comparison, https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for MapReduce to easily propagate tags through to YARN via Configuration settings. Since the ApplicationSubmissionContext.setApplicationTags method was only added in Hadoop 2.4+, Spark support will invoke the method via reflection the same way other such version-specific methods are called in elsewhere in the YARN client. Since the usage of tags is generally not critical to the functionality of older YARN setups, it should be safe to handle NoSuchMethodException with just a logWarning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-9340: - Assignee: Cheng Lian CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly --- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Assignee: Cheng Lian Attachments: ParquetTypesConverterTest.scala SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement backwards-compatibility rules defined in {{parquet-format}} spec. However, both Spark SQL and {{parquet-avro}} neglected the following statement in {{parquet-format}}: {quote} This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of required elements where the element type is the type of the field. {quote} One of the consequences is that, Parquet files generated by {{parquet-protobuf}} containing unannotated repeated fields are not correctly converted to Catalyst arrays. For example, the following Parquet schema {noformat} message root { repeated int32 f1 } {noformat} should be converted to {noformat} StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), nullable = false) :: Nil) {noformat} But now it triggers an {{AnalysisException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9450) HashedRelation.get() could return an Iterator[Row] instead of Seq[Row]
[ https://issues.apache.org/jira/browse/SPARK-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-9450. --- Resolution: Invalid I'm going to resolve this as Invalid, since it turns out that we need to return an Iterable of Rows in order to support full outer join. HashedRelation.get() could return an Iterator[Row] instead of Seq[Row] -- Key: SPARK-9450 URL: https://issues.apache.org/jira/browse/SPARK-9450 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Andrew Or While looking through some HashedRelation code, [~andrewor14] and I noticed that it looks like HashedRelation.get() could return an Iterator of rows instead of a sequence. If we do this, we can reduce object allocation in UnsafeHashedRelation.get(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9784) Exchange.isUnsafe should check whether codegen and unsafe are enabled
Josh Rosen created SPARK-9784: - Summary: Exchange.isUnsafe should check whether codegen and unsafe are enabled Key: SPARK-9784 URL: https://issues.apache.org/jira/browse/SPARK-9784 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Exchange needs to check whether unsafe mode is enabled in its {{tungstenMode}} method: {code} override def nodeName: String = if (tungstenMode) TungstenExchange else Exchange /** * Returns true iff we can support the data type, and we are not doing range partitioning. */ private lazy val tungstenMode: Boolean = { GenerateUnsafeProjection.canSupport(child.schema) !newPartitioning.isInstanceOf[RangePartitioning] } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9658) ML 1.5 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659182#comment-14659182 ] Joseph K. Bradley edited comment on SPARK-9658 at 8/10/15 6:49 PM: --- 1 was intentional. I'm OK with creating a new param though. 2. gammaShape could be given a default. topicConcentration is necessary, I'd say. 3. AFAIK this is a Scala compiler bug, not something we can fix easily. 4. That was intentional. We could put it back, though that would create duplicate parameters with sort of confusing semantics. 5. I like having it in a single place to share the implementation. I know it's simple, but it's easy to mess up by swapping the 2 values. was (Author: josephkb): 1 was intentional. I'm OK with creating a new param though. 2. gammaShape could be given a default. topicConcentration is necessary, I'd say. 3. AFAIK this is a Scala compiler bug, not something we can fix easily. 4. That was intentional. We could put it back, though that would create duplicate parameters with sort of confusing semantics. 5. Sounds good. ML 1.5 QA: API: Binary incompatible changes --- Key: SPARK-9658 URL: https://issues.apache.org/jira/browse/SPARK-9658 Project: Spark Issue Type: Sub-task Components: ML, MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Blocker Generated a list of binary incompatible changes using MiMa and filter out some false positives: 1. LDA.docConcentration It will be nice to keep the old APIs unchanged. For example, we can use “asymmetricDocConcentration”. Then “getDocConcentration would return the first item if the concentration vector is a constant vector. 2. LDAModel.gammaShape / topicConcentration Should be okay if we assume that no one extends LDAModel. 3. Params.setDefault If we have time to investigate this issue. We should put it back. 4. LogisticRegressionModel.threshold is missing. 5. LogisticRegression.setThreshold shouldn't be in the Params trait. We need to override it anyway. Will create sub-tasks for each item. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9785) HashPartitioning compatibility should consider expression ordering
[ https://issues.apache.org/jira/browse/SPARK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9785: -- Summary: HashPartitioning compatibility should consider expression ordering (was: HashPartitioning compatibility should be sensitive to expression ordering) HashPartitioning compatibility should consider expression ordering -- Key: SPARK-9785 URL: https://issues.apache.org/jira/browse/SPARK-9785 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but in other contexts the ordering of those expressions matters. This is illustrated by the following regression test: {code} test(HashPartitioning compatibility) { val expressions = Seq(Literal(2), Literal(3)) // Consider two HashPartitionings that have the same _set_ of hash expressions but which are // created with different orderings of those expressions: val partitioningA = HashPartitioning(expressions, 100) val partitioningB = HashPartitioning(expressions.reverse, 100) // These partitionings are not considered equal: assert(partitioningA != partitioningB) // However, they both satisfy the same clustered distribution: val distribution = ClusteredDistribution(expressions) assert(partitioningA.satisfies(distribution)) assert(partitioningB.satisfies(distribution)) // Both partitionings are compatible with and guarantee each other: assert(partitioningA.compatibleWith(partitioningB)) assert(partitioningB.compatibleWith(partitioningA)) assert(partitioningA.guarantees(partitioningB)) assert(partitioningB.guarantees(partitioningA)) // Given all of this, we would expect these partitionings to compute the same hashcode for // any given row: def computeHashCode(partitioning: HashPartitioning): Int = { val hashExprProj = new InterpretedMutableProjection(partitioning.expressions, Seq.empty) hashExprProj.apply(InternalRow.empty).hashCode() } assert(computeHashCode(partitioningA) === computeHashCode(partitioningB)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9789) Reinstate LogisticRegression threshold Param
Joseph K. Bradley created SPARK-9789: Summary: Reinstate LogisticRegression threshold Param Key: SPARK-9789 URL: https://issues.apache.org/jira/browse/SPARK-9789 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley From [SPARK-9658]: LogisticRegression.threshold was replaced by thresholds, but we could keep threshold for backwards compatibility. We should add it back, but we should maintain the current semantics whereby thresholds overrides threshold. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9710) RPackageUtilsSuite fails if R is not installed
[ https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9710. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8008 [https://github.com/apache/spark/pull/8008] RPackageUtilsSuite fails if R is not installed -- Key: SPARK-9710 URL: https://issues.apache.org/jira/browse/SPARK-9710 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Fix For: 1.5.0 That's because there's a bug in RUtils.scala. PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9710) RPackageUtilsSuite fails if R is not installed
[ https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9710: - Assignee: Marcelo Vanzin RPackageUtilsSuite fails if R is not installed -- Key: SPARK-9710 URL: https://issues.apache.org/jira/browse/SPARK-9710 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.5.0 That's because there's a bug in RUtils.scala. PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7536) Audit MLlib Python API for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7536. -- Resolution: Done Thanks [~yanboliang] for putting this (very successful) list together and copying over the few remaining items to the next release list! Audit MLlib Python API for 1.4 -- Key: SPARK-7536 URL: https://issues.apache.org/jira/browse/SPARK-7536 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yanbo Liang **NOTE: This is targeted at 1.5.0 because it has so many useful links for JIRAs targeted for 1.5.0. In the future, we should create a _new_ JIRA for linking future items.** For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? SPARK-7667 * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. [SPARK-7666], [SPARK-6173] * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. SPARK-7665 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9784) Exchange.isUnsafe should check whether codegen and unsafe are enabled
[ https://issues.apache.org/jira/browse/SPARK-9784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680482#comment-14680482 ] Apache Spark commented on SPARK-9784: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/8073 Exchange.isUnsafe should check whether codegen and unsafe are enabled - Key: SPARK-9784 URL: https://issues.apache.org/jira/browse/SPARK-9784 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Exchange needs to check whether unsafe mode is enabled in its {{tungstenMode}} method: {code} override def nodeName: String = if (tungstenMode) TungstenExchange else Exchange /** * Returns true iff we can support the data type, and we are not doing range partitioning. */ private lazy val tungstenMode: Boolean = { GenerateUnsafeProjection.canSupport(child.schema) !newPartitioning.isInstanceOf[RangePartitioning] } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9784) Exchange.isUnsafe should check whether codegen and unsafe are enabled
[ https://issues.apache.org/jira/browse/SPARK-9784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9784: --- Assignee: Josh Rosen (was: Apache Spark) Exchange.isUnsafe should check whether codegen and unsafe are enabled - Key: SPARK-9784 URL: https://issues.apache.org/jira/browse/SPARK-9784 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Exchange needs to check whether unsafe mode is enabled in its {{tungstenMode}} method: {code} override def nodeName: String = if (tungstenMode) TungstenExchange else Exchange /** * Returns true iff we can support the data type, and we are not doing range partitioning. */ private lazy val tungstenMode: Boolean = { GenerateUnsafeProjection.canSupport(child.schema) !newPartitioning.isInstanceOf[RangePartitioning] } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7778) Add standard deviation aggregate expression
[ https://issues.apache.org/jira/browse/SPARK-7778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680491#comment-14680491 ] Rakesh Chalasani commented on SPARK-7778: - Hi SsuTing: The aggregate expression interface has changed in 1.5 and the above PR is obsolete. SPARK-6548 which is tracking this now is still open. I guess that is a better place to keep track of it. Rakesh Add standard deviation aggregate expression Key: SPARK-7778 URL: https://issues.apache.org/jira/browse/SPARK-7778 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Rakesh Chalasani Add standard deviation aggregate expression over data frame columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9711) Unable to run spark after restarting cluster with spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-9711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangyang Li updated SPARK-9711: Description: With Spark 1.4.1 and YARN client mode, my application works at the first time the cluster is built. While if I stop and start the cluster with using spark-ec2, the same command fails. At the end of the spark logs, it's shown that it just keeps trying to connect to master node repeatedly: INFO Client: Retrying connect to server: ec2-54-174-232-129.compute-1.amazonaws.com/172.31.36.29:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) I restarted YARN and dfs manually after restarting the cluster, however, I was unable to restart Tachyon and it fails when running ./bin/tachyon runTests, which might be the possible reason. was: With Spark 1.4.1 and YARN client mode, my application works at the first time the cluster is built. While if I stop and start the cluster with using spark-ec2, the same command fails. At the end of the spark logs, it's shown that it just keeps trying to connect to master node repeatedly: INFO Client: Retrying connect to server: ec2-54-174-232-129.compute-1.amazonaws.com/172.31.36.29:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) Unable to run spark after restarting cluster with spark-ec2 --- Key: SPARK-9711 URL: https://issues.apache.org/jira/browse/SPARK-9711 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.1 Reporter: Guangyang Li With Spark 1.4.1 and YARN client mode, my application works at the first time the cluster is built. While if I stop and start the cluster with using spark-ec2, the same command fails. At the end of the spark logs, it's shown that it just keeps trying to connect to master node repeatedly: INFO Client: Retrying connect to server: ec2-54-174-232-129.compute-1.amazonaws.com/172.31.36.29:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) I restarted YARN and dfs manually after restarting the cluster, however, I was unable to restart Tachyon and it fails when running ./bin/tachyon runTests, which might be the possible reason. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9663) ML Python API coverage issues found during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659207#comment-14659207 ] Joseph K. Bradley edited comment on SPARK-9663 at 8/10/15 6:32 PM: --- (complete): Linked unfinished items from previous release [SPARK-7536] here. was (Author: josephkb): *TODO: We need to link unfinished items from [SPARK-7536] here (linked as contains those items).* ML Python API coverage issues found during 1.5 QA - Key: SPARK-9663 URL: https://issues.apache.org/jira/browse/SPARK-9663 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-9770 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-9771 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9574) Review the contents of uber JARs spark-streaming-XXX-assembly
[ https://issues.apache.org/jira/browse/SPARK-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9574: --- Assignee: Shixiong Zhu (was: Apache Spark) Review the contents of uber JARs spark-streaming-XXX-assembly - Key: SPARK-9574 URL: https://issues.apache.org/jira/browse/SPARK-9574 Project: Spark Issue Type: Task Components: Streaming Reporter: Tathagata Das Assignee: Shixiong Zhu It should not contain Spark core and its dependencies, especially the following. - Hadoop and its dependencies - Scala libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680389#comment-14680389 ] Cheng Lian commented on SPARK-9340: --- [~damianguy] Would you mind to help reviewing [PR #8070|https://github.com/apache/spark/pull/8070] and check whether it works for your case? Thanks in advance! ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9782) Add support for YARN application tags running Spark on YARN
[ https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680431#comment-14680431 ] Dennis Huo commented on SPARK-9782: --- Correct, from what I understand, the node labels JIRA is a more heavyweight behavioral-change feature, for being able to control packing of requested containers onto machines based on node labels. YARN application tags are distinct from node labels, and are only used by workflow orchestrators on top of YARN, without affecting how YARN does packing at all. Add support for YARN application tags running Spark on YARN --- Key: SPARK-9782 URL: https://issues.apache.org/jira/browse/SPARK-9782 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.1 Reporter: Dennis Huo https://issues.apache.org/jira/browse/YARN-1390 originally added the new “Application Tags” feature to YARN to help track the sources of applications among many possible YARN clients. https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a set of tags to be applied, and for comparison, https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for MapReduce to easily propagate tags through to YARN via Configuration settings. Since the ApplicationSubmissionContext.setApplicationTags method was only added in Hadoop 2.4+, Spark support will invoke the method via reflection the same way other such version-specific methods are called in elsewhere in the YARN client. Since the usage of tags is generally not critical to the functionality of older YARN setups, it should be safe to handle NoSuchMethodException with just a logWarning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9340: -- Description: SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement backwards-compatibility rules defined in {{parquet-format}} spec. However, both Spark SQL and {{parquet-avro}} neglected the following statement in {{parquet-format}}: {quote} This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of required elements where the element type is the type of the field. {quote} One of the consequences is that, Parquet files generated by {{parquet-protobuf}} containing unannotated repeated fields are not correctly converted to Catalyst arrays. For example, the following Parquet schema {noformat} message root { repeated int32 f1 } {noformat} should be converted to {noformat} StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), nullable = false) :: Nil) {noformat} But now it triggers an {{AnalysisException}}. was: The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly --- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement backwards-compatibility rules defined in {{parquet-format}} spec. However, both Spark SQL and {{parquet-avro}} neglected the following statement in {{parquet-format}}: {quote} This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of required elements where the element type is the type of the field. {quote} One of the consequences is that, Parquet files generated by {{parquet-protobuf}} containing unannotated repeated fields are not correctly converted to Catalyst arrays. For example, the following Parquet schema {noformat} message root { repeated int32 f1 } {noformat} should be converted to {noformat} StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), nullable = false) :: Nil) {noformat} But now it triggers an {{AnalysisException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9450) [INVALID] HashedRelation.get() could return an Iterator[Row] instead of Seq[Row]
[ https://issues.apache.org/jira/browse/SPARK-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9450: -- Summary: [INVALID] HashedRelation.get() could return an Iterator[Row] instead of Seq[Row] (was: HashedRelation.get() could return an Iterator[Row] instead of Seq[Row]) [INVALID] HashedRelation.get() could return an Iterator[Row] instead of Seq[Row] Key: SPARK-9450 URL: https://issues.apache.org/jira/browse/SPARK-9450 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Andrew Or While looking through some HashedRelation code, [~andrewor14] and I noticed that it looks like HashedRelation.get() could return an Iterator of rows instead of a sequence. If we do this, we can reduce object allocation in UnsafeHashedRelation.get(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9755) Add method documentation to MultivariateOnlineSummarizer
[ https://issues.apache.org/jira/browse/SPARK-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9755. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8045 [https://github.com/apache/spark/pull/8045] Add method documentation to MultivariateOnlineSummarizer Key: SPARK-9755 URL: https://issues.apache.org/jira/browse/SPARK-9755 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor Fix For: 1.5.0 Docs present in 1.4 are lost in current 1.5 branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680492#comment-14680492 ] Damian Guy commented on SPARK-9340: --- Code looks good and it works as expected. Tests pass. Thanks for your assistance with this. CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly --- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Assignee: Cheng Lian Attachments: ParquetTypesConverterTest.scala SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement backwards-compatibility rules defined in {{parquet-format}} spec. However, both Spark SQL and {{parquet-avro}} neglected the following statement in {{parquet-format}}: {quote} This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of required elements where the element type is the type of the field. {quote} One of the consequences is that, Parquet files generated by {{parquet-protobuf}} containing unannotated repeated fields are not correctly converted to Catalyst arrays. For example, the following Parquet schema {noformat} message root { repeated int32 f1 } {noformat} should be converted to {noformat} StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), nullable = false) :: Nil) {noformat} But now it triggers an {{AnalysisException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9786) Test backpressure
Tathagata Das created SPARK-9786: Summary: Test backpressure Key: SPARK-9786 URL: https://issues.apache.org/jira/browse/SPARK-9786 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9787) Test for memory leaks using the streaming tests in spark-perf
[ https://issues.apache.org/jira/browse/SPARK-9787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-9787: - Assignee: Shixiong Zhu Test for memory leaks using the streaming tests in spark-perf - Key: SPARK-9787 URL: https://issues.apache.org/jira/browse/SPARK-9787 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9787) Test for memory leaks using the streaming tests in spark-perf
Tathagata Das created SPARK-9787: Summary: Test for memory leaks using the streaming tests in spark-perf Key: SPARK-9787 URL: https://issues.apache.org/jira/browse/SPARK-9787 Project: Spark Issue Type: Sub-task Reporter: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9785) HashPartitioning compatibility should consider expression ordering
[ https://issues.apache.org/jira/browse/SPARK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680550#comment-14680550 ] Apache Spark commented on SPARK-9785: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/8074 HashPartitioning compatibility should consider expression ordering -- Key: SPARK-9785 URL: https://issues.apache.org/jira/browse/SPARK-9785 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but in other contexts the ordering of those expressions matters. This is illustrated by the following regression test: {code} test(HashPartitioning compatibility) { val expressions = Seq(Literal(2), Literal(3)) // Consider two HashPartitionings that have the same _set_ of hash expressions but which are // created with different orderings of those expressions: val partitioningA = HashPartitioning(expressions, 100) val partitioningB = HashPartitioning(expressions.reverse, 100) // These partitionings are not considered equal: assert(partitioningA != partitioningB) // However, they both satisfy the same clustered distribution: val distribution = ClusteredDistribution(expressions) assert(partitioningA.satisfies(distribution)) assert(partitioningB.satisfies(distribution)) // Both partitionings are compatible with and guarantee each other: assert(partitioningA.compatibleWith(partitioningB)) assert(partitioningB.compatibleWith(partitioningA)) assert(partitioningA.guarantees(partitioningB)) assert(partitioningB.guarantees(partitioningA)) // Given all of this, we would expect these partitionings to compute the same hashcode for // any given row: def computeHashCode(partitioning: HashPartitioning): Int = { val hashExprProj = new InterpretedMutableProjection(partitioning.expressions, Seq.empty) hashExprProj.apply(InternalRow.empty).hashCode() } assert(computeHashCode(partitioningA) === computeHashCode(partitioningB)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9785) HashPartitioning compatibility should consider expression ordering
[ https://issues.apache.org/jira/browse/SPARK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9785: --- Assignee: Josh Rosen (was: Apache Spark) HashPartitioning compatibility should consider expression ordering -- Key: SPARK-9785 URL: https://issues.apache.org/jira/browse/SPARK-9785 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but in other contexts the ordering of those expressions matters. This is illustrated by the following regression test: {code} test(HashPartitioning compatibility) { val expressions = Seq(Literal(2), Literal(3)) // Consider two HashPartitionings that have the same _set_ of hash expressions but which are // created with different orderings of those expressions: val partitioningA = HashPartitioning(expressions, 100) val partitioningB = HashPartitioning(expressions.reverse, 100) // These partitionings are not considered equal: assert(partitioningA != partitioningB) // However, they both satisfy the same clustered distribution: val distribution = ClusteredDistribution(expressions) assert(partitioningA.satisfies(distribution)) assert(partitioningB.satisfies(distribution)) // Both partitionings are compatible with and guarantee each other: assert(partitioningA.compatibleWith(partitioningB)) assert(partitioningB.compatibleWith(partitioningA)) assert(partitioningA.guarantees(partitioningB)) assert(partitioningB.guarantees(partitioningA)) // Given all of this, we would expect these partitionings to compute the same hashcode for // any given row: def computeHashCode(partitioning: HashPartitioning): Int = { val hashExprProj = new InterpretedMutableProjection(partitioning.expressions, Seq.empty) hashExprProj.apply(InternalRow.empty).hashCode() } assert(computeHashCode(partitioningA) === computeHashCode(partitioningB)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9750) SparseMatrix should override equals
[ https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9750: - Assignee: Feynman Liang SparseMatrix should override equals --- Key: SPARK-9750 URL: https://issues.apache.org/jira/browse/SPARK-9750 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Blocker [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479] should override equals to ensure that two instances of the same matrix are equal. This implementation should take into account the {{isTransposed}} flag and {{values}} may not be in the same order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680277#comment-14680277 ] Cheng Lian commented on SPARK-9340: --- Ah, thanks a lot! I see the problem now. {{parquet-avro}} doesn't allow {{repeated}} fields outside {{LIST}} or {{MAP}}, and I was following {{parquet-avro}} when implementing all the compatibility rules. So I think the real problematic position is [here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala#L102-L104] (and [here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L217] in {{parquet-avro}}). This issue could have a simpler solution, especially the schema conversion part. Row converter needs bigger changes though. I'm working on a simplified version of PR #8063. Will attribute this issue to you since you spot this issue and #8063 inspired me a lot! ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9781) KCL Workers should be configurable from Spark configuration
Anton Nekhaev created SPARK-9781: Summary: KCL Workers should be configurable from Spark configuration Key: SPARK-9781 URL: https://issues.apache.org/jira/browse/SPARK-9781 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: Anton Nekhaev Currently the KinesisClientLibConfiguration for KCL Workers is created withing the KinesisReceiver and user is allowed to change only basic settings such as endpoint URL, stream name, credentials, etc. However, there is no way to tune some advanced settings, e.g. MaxRecords, IdleTimeBetweenReads, FailoverTime, etc. We can add this settings to the Spark configuration and parametrize KinesisClientLibConfiguration with them in KinesisReceiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680281#comment-14680281 ] Damian Guy commented on SPARK-9340: --- Thanks. I'm sure there is a simpler solution to someone more familiar with the code! ;-) Thanks for looking further into it, appreciated. ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9788) LDA docConcentration, gammaShape 1.5 binary incompatibility fixes
[ https://issues.apache.org/jira/browse/SPARK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680568#comment-14680568 ] Feynman Liang commented on SPARK-9788: -- Assign to me LDA docConcentration, gammaShape 1.5 binary incompatibility fixes - Key: SPARK-9788 URL: https://issues.apache.org/jira/browse/SPARK-9788 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley From [SPARK-9658]: 1. LDA.docConcentration It will be nice to keep the old APIs unchanged. Proposal: * Add “asymmetricDocConcentration” and revert docConcentration changes. * If the (internal) doc concentration vector is a single value, “getDocConcentration returns it. If it is a constant vector, getDocConcentration returns the first item, and fails otherwise. 2. LDAModel.gammaShape This should be given a default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
[ https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680570#comment-14680570 ] Mark Grover commented on SPARK-9790: I am working on this, will file a Work-In-Progress pull request soon. [YARN] Expose in WebUI if NodeManager is the reason why executors were killed. -- Key: SPARK-9790 URL: https://issues.apache.org/jira/browse/SPARK-9790 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.1 Reporter: Mark Grover When an executor is killed by yarn because it exceeds the memory overhead, the only thing spark knows is that the executor is lost. The user has to go track search through the NM logs to figure out that its been killed by yarn. It would be much nicer if the spark-driver could be notified why the executor was killed. Ideally it could both log an explanatory message, and update the UI (and the eventLog) so that it was clear why the executor was lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9793) PySpark DenseVector, SparseVector should override __eq__
Joseph K. Bradley created SPARK-9793: Summary: PySpark DenseVector, SparseVector should override __eq__ Key: SPARK-9793 URL: https://issues.apache.org/jira/browse/SPARK-9793 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Priority: Critical See [SPARK-9750]. PySpark DenseVector and SparseVector do not override the equality operator properly. They should use semantics, not representation, for comparison. (This is what Scala currently does.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9795) Dynamic allocation: avoid double counting when killing same executor
Andrew Or created SPARK-9795: Summary: Dynamic allocation: avoid double counting when killing same executor Key: SPARK-9795 URL: https://issues.apache.org/jira/browse/SPARK-9795 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Currently, if we kill the same executor twice in rapid succession, we will lower the executor target by 2 instead of 1. In cases where we don't re-adjust the target upwards frequently, this will result in jobs hanging. This may or may not be the same as SPARK-9745. Until we can verify the correlation, however, this will remain a separate issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9788) LDA docConcentration, gammaShape 1.5 binary incompatibility fixes
[ https://issues.apache.org/jira/browse/SPARK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680666#comment-14680666 ] Joseph K. Bradley commented on SPARK-9788: -- Yeah, I guess we should revert getAlpha and setAlpha as well. We can add asymmetric versions. We can fix this duplication for the Pipelines API. LDA docConcentration, gammaShape 1.5 binary incompatibility fixes - Key: SPARK-9788 URL: https://issues.apache.org/jira/browse/SPARK-9788 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Feynman Liang From [SPARK-9658]: 1. LDA.docConcentration It will be nice to keep the old APIs unchanged. Proposal: * Add “asymmetricDocConcentration” and revert docConcentration changes. * If the (internal) doc concentration vector is a single value, “getDocConcentration returns it. If it is a constant vector, getDocConcentration returns the first item, and fails otherwise. 2. LDAModel.gammaShape This should be given a default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9784) Exchange.isUnsafe should check whether codegen and unsafe are enabled
[ https://issues.apache.org/jira/browse/SPARK-9784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-9784. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8073 [https://github.com/apache/spark/pull/8073] Exchange.isUnsafe should check whether codegen and unsafe are enabled - Key: SPARK-9784 URL: https://issues.apache.org/jira/browse/SPARK-9784 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Fix For: 1.5.0 Exchange needs to check whether unsafe mode is enabled in its {{tungstenMode}} method: {code} override def nodeName: String = if (tungstenMode) TungstenExchange else Exchange /** * Returns true iff we can support the data type, and we are not doing range partitioning. */ private lazy val tungstenMode: Boolean = { GenerateUnsafeProjection.canSupport(child.schema) !newPartitioning.isInstanceOf[RangePartitioning] } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9794) ISO DateTime parser is too strict
[ https://issues.apache.org/jira/browse/SPARK-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9794: -- Affects Version/s: 1.2.2 1.3.1 1.4.1 ISO DateTime parser is too strict - Key: SPARK-9794 URL: https://issues.apache.org/jira/browse/SPARK-9794 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0 Reporter: Alex Angelini The DateTime parser requires 3 millisecond digits, but that is not part of the official ISO8601 spec. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L132 https://en.wikipedia.org/wiki/ISO_8601 This results in the following exception when trying to parse datetime columns {code} java.text.ParseException: Unparseable date: 0001-01-01T00:00:00GMT-00:00 {code} [~joshrosen] [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9794) ISO DateTime parser is too strict
[ https://issues.apache.org/jira/browse/SPARK-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680688#comment-14680688 ] Josh Rosen commented on SPARK-9794: --- The same code exists in 1.4.0 and 1.4.1: https://github.com/apache/spark/blob/v1.4.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateUtils.scala#L86 It's also present in 1.3.0 / 1.3.1: https://github.com/apache/spark/blob/v1.3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66 And in 1.2.x: https://github.com/apache/spark/blob/v1.2.2/sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala#L158 Here's the pull request that originally added that line: https://github.com/apache/spark/pull/3012 ISO DateTime parser is too strict - Key: SPARK-9794 URL: https://issues.apache.org/jira/browse/SPARK-9794 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0 Reporter: Alex Angelini The DateTime parser requires 3 millisecond digits, but that is not part of the official ISO8601 spec. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L132 https://en.wikipedia.org/wiki/ISO_8601 This results in the following exception when trying to parse datetime columns {code} java.text.ParseException: Unparseable date: 0001-01-01T00:00:00GMT-00:00 {code} [~joshrosen] [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9795) Dynamic allocation: avoid double counting when killing same executor twice
[ https://issues.apache.org/jira/browse/SPARK-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9795: - Summary: Dynamic allocation: avoid double counting when killing same executor twice (was: Dynamic allocation: avoid double counting when killing same executor) Dynamic allocation: avoid double counting when killing same executor twice -- Key: SPARK-9795 URL: https://issues.apache.org/jira/browse/SPARK-9795 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Currently, if we kill the same executor twice in rapid succession, we will lower the executor target by 2 instead of 1. In cases where we don't re-adjust the target upwards frequently, this will result in jobs hanging. This may or may not be the same as SPARK-9745. Until we can verify the correlation, however, this will remain a separate issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9795) Dynamic allocation: avoid double counting when killing same executor
[ https://issues.apache.org/jira/browse/SPARK-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9795: --- Assignee: Andrew Or (was: Apache Spark) Dynamic allocation: avoid double counting when killing same executor Key: SPARK-9795 URL: https://issues.apache.org/jira/browse/SPARK-9795 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Currently, if we kill the same executor twice in rapid succession, we will lower the executor target by 2 instead of 1. In cases where we don't re-adjust the target upwards frequently, this will result in jobs hanging. This may or may not be the same as SPARK-9745. Until we can verify the correlation, however, this will remain a separate issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9795) Dynamic allocation: avoid double counting when killing same executor
[ https://issues.apache.org/jira/browse/SPARK-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680738#comment-14680738 ] Apache Spark commented on SPARK-9795: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/8078 Dynamic allocation: avoid double counting when killing same executor Key: SPARK-9795 URL: https://issues.apache.org/jira/browse/SPARK-9795 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Currently, if we kill the same executor twice in rapid succession, we will lower the executor target by 2 instead of 1. In cases where we don't re-adjust the target upwards frequently, this will result in jobs hanging. This may or may not be the same as SPARK-9745. Until we can verify the correlation, however, this will remain a separate issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9791) Review API for developer and experimental tags
Tathagata Das created SPARK-9791: Summary: Review API for developer and experimental tags Key: SPARK-9791 URL: https://issues.apache.org/jira/browse/SPARK-9791 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9750) DenseMatrix, SparseMatrix should override equals
[ https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9750: - Description: [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479] should override equals to ensure that two instances of the same matrix are equal. Same for DenseMatrix. This implementation should take into account the {{isTransposed}} flag and {{values}} may not be in the same order. was: [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479] should override equals to ensure that two instances of the same matrix are equal. This implementation should take into account the {{isTransposed}} flag and {{values}} may not be in the same order. DenseMatrix, SparseMatrix should override equals Key: SPARK-9750 URL: https://issues.apache.org/jira/browse/SPARK-9750 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Blocker [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479] should override equals to ensure that two instances of the same matrix are equal. Same for DenseMatrix. This implementation should take into account the {{isTransposed}} flag and {{values}} may not be in the same order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9750) DenseMatrix, SparseMatrix should override equals
[ https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9750: - Summary: DenseMatrix, SparseMatrix should override equals (was: SparseMatrix should override equals) DenseMatrix, SparseMatrix should override equals Key: SPARK-9750 URL: https://issues.apache.org/jira/browse/SPARK-9750 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Blocker [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479] should override equals to ensure that two instances of the same matrix are equal. This implementation should take into account the {{isTransposed}} flag and {{values}} may not be in the same order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9766) check and add missing docs for PySpark ML
[ https://issues.apache.org/jira/browse/SPARK-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9766: - Target Version/s: 1.5.0 check and add missing docs for PySpark ML - Key: SPARK-9766 URL: https://issues.apache.org/jira/browse/SPARK-9766 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.5.0 Reporter: Yanbo Liang Assignee: Yanbo Liang Check and add miss docs for PySpark ML (#this issue only check miss docs for o.a.s.ml not o.a.s.mllib). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org