[jira] [Updated] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level
[ https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19816: - Affects Version/s: (was: 2.2.0) > DataFrameCallbackSuite doesn't recover the log level > > > Key: SPARK-19816 > URL: https://issues.apache.org/jira/browse/SPARK-19816 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > "DataFrameCallbackSuite.execute callback functions when a DataFrame action > failed" sets the log level to "fatal" but doesn't recover it. Hence, tests > running after it won't output any logs except fatal logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level
[ https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19816: - Affects Version/s: 2.1.0 > DataFrameCallbackSuite doesn't recover the log level > > > Key: SPARK-19816 > URL: https://issues.apache.org/jira/browse/SPARK-19816 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > "DataFrameCallbackSuite.execute callback functions when a DataFrame action > failed" sets the log level to "fatal" but doesn't recover it. Hence, tests > running after it won't output any logs except fatal logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19815) Not orderable should be applied to right key instead of left key
[ https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895522#comment-15895522 ] Zhan Zhang commented on SPARK-19815: I am thinking the logic again. On the surface, the logic may be correct. Since in the join, the left and right key should be the same type. Please close this JIRA. > Not orderable should be applied to right key instead of left key > > > Key: SPARK-19815 > URL: https://issues.apache.org/jira/browse/SPARK-19815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > When generating ShuffledHashJoinExec, the orderable condition should be > applied to right key instead of left key. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895502#comment-15895502 ] Imran Rashid commented on SPARK-19659: -- I think Reynold has a good point. I really don't like the idea of always have the MapStatus track 2k sizes -- I already have to regularly recommend to users that they bump their partition count > 2k to avoid an OOM from too many CompressedMapStatus. Going over 2k partitions generally gives big memory savings from using HighlyCompressedMapStatus. Your point about deciding how many outlier to track is valid; but I think there are a lot of other options you might consider as well, eg., track all the sizes that are more than 2x the average, or track a few different size buckets, and keep a bit set for each bucket, etc. these should allow the MapStatus to stay very compact, but have bounded error on the size. For implementation, I'd also break your proposal down into smaller pieces. In fact, the three ideas are all useful independently (though they are more robust together). But two larger pieces I see missing: 1) how will we test the changes out? not for correctness, but for performance / stability benefits? 2) are there metrics we should be collecting so we can better answer these questions, that we currently are not answering? eg., the distribution of sizes in MapStatus is not stored anywhere for later analysis (though its not easy to come up with a good way to store them, since there are n^2 sizes in one shuffle); how much memory is used by the network layer; how much error is there in the sizes from the MapStatus, etc. I think some parts can be implemented anyway, behind feature flags (perhaps undocumented), but its something to keep in mind. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19701) the `in` operator in pyspark is broken
[ https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19701: Assignee: (was: Apache Spark) > the `in` operator in pyspark is broken > -- > > Key: SPARK-19701 > URL: https://issues.apache.org/jira/browse/SPARK-19701 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > > {code} > >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md") > >>> linesWithSpark = textFile.filter("Spark" in textFile.value) > Traceback (most recent call last): > File "", line 1, in > File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, > in __nonzero__ > raise ValueError("Cannot convert column into bool: please use '&' for > 'and', '|' for 'or', " > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19701) the `in` operator in pyspark is broken
[ https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895494#comment-15895494 ] Apache Spark commented on SPARK-19701: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17160 > the `in` operator in pyspark is broken > -- > > Key: SPARK-19701 > URL: https://issues.apache.org/jira/browse/SPARK-19701 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > > {code} > >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md") > >>> linesWithSpark = textFile.filter("Spark" in textFile.value) > Traceback (most recent call last): > File "", line 1, in > File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, > in __nonzero__ > raise ValueError("Cannot convert column into bool: please use '&' for > 'and', '|' for 'or', " > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19701) the `in` operator in pyspark is broken
[ https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19701: Assignee: Apache Spark > the `in` operator in pyspark is broken > -- > > Key: SPARK-19701 > URL: https://issues.apache.org/jira/browse/SPARK-19701 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > > {code} > >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md") > >>> linesWithSpark = textFile.filter("Spark" in textFile.value) > Traceback (most recent call last): > File "", line 1, in > File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, > in __nonzero__ > raise ValueError("Cannot convert column into bool: please use '&' for > 'and', '|' for 'or', " > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level
[ https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-19816: - Fix Version/s: 2.1.1 > DataFrameCallbackSuite doesn't recover the log level > > > Key: SPARK-19816 > URL: https://issues.apache.org/jira/browse/SPARK-19816 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > "DataFrameCallbackSuite.execute callback functions when a DataFrame action > failed" sets the log level to "fatal" but doesn't recover it. Hence, tests > running after it won't output any logs except fatal logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level
[ https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19816: Fix Version/s: 2.2.0 > DataFrameCallbackSuite doesn't recover the log level > > > Key: SPARK-19816 > URL: https://issues.apache.org/jira/browse/SPARK-19816 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > > "DataFrameCallbackSuite.execute callback functions when a DataFrame action > failed" sets the log level to "fatal" but doesn't recover it. Hence, tests > running after it won't output any logs except fatal logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level
[ https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19816. - Resolution: Fixed > DataFrameCallbackSuite doesn't recover the log level > > > Key: SPARK-19816 > URL: https://issues.apache.org/jira/browse/SPARK-19816 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > "DataFrameCallbackSuite.execute callback functions when a DataFrame action > failed" sets the log level to "fatal" but doesn't recover it. Hence, tests > running after it won't output any logs except fatal logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19818) SparkR union should check for name consistency of input data frames
[ https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19818: Assignee: (was: Apache Spark) > SparkR union should check for name consistency of input data frames > > > Key: SPARK-19818 > URL: https://issues.apache.org/jira/browse/SPARK-19818 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Priority: Minor > > The current implementation accepts data frames with different schemas. See > issues below: > {code} > df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = > c(1, 30, 19))) > union(df, df[, c(2, 1)]) > name age > 1 Michael 1.0 > 2Andy30.0 > 3 Justin19.0 > 4 1.0 Michael > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19818) SparkR union should check for name consistency of input data frames
[ https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19818: Assignee: Apache Spark > SparkR union should check for name consistency of input data frames > > > Key: SPARK-19818 > URL: https://issues.apache.org/jira/browse/SPARK-19818 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Assignee: Apache Spark >Priority: Minor > > The current implementation accepts data frames with different schemas. See > issues below: > {code} > df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = > c(1, 30, 19))) > union(df, df[, c(2, 1)]) > name age > 1 Michael 1.0 > 2Andy30.0 > 3 Justin19.0 > 4 1.0 Michael > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19818) SparkR union should check for name consistency of input data frames
[ https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895468#comment-15895468 ] Apache Spark commented on SPARK-19818: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/17159 > SparkR union should check for name consistency of input data frames > > > Key: SPARK-19818 > URL: https://issues.apache.org/jira/browse/SPARK-19818 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Priority: Minor > > The current implementation accepts data frames with different schemas. See > issues below: > {code} > df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = > c(1, 30, 19))) > union(df, df[, c(2, 1)]) > name age > 1 Michael 1.0 > 2Andy30.0 > 3 Justin19.0 > 4 1.0 Michael > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19818) SparkR union should check for name consistency of input data frames
Wayne Zhang created SPARK-19818: --- Summary: SparkR union should check for name consistency of input data frames Key: SPARK-19818 URL: https://issues.apache.org/jira/browse/SPARK-19818 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Wayne Zhang Priority: Minor The current implementation accepts data frames with different schemas. See issues below: {code} df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = c(1, 30, 19))) union(df, df[, c(2, 1)]) name age 1 Michael 1.0 2Andy30.0 3 Justin19.0 4 1.0 Michael {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-19804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19804. - Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.2.0 > HiveClientImpl does not work with Hive 2.2.0 metastore > -- > > Key: SPARK-19804 > URL: https://issues.apache.org/jira/browse/SPARK-19804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.2.0 > > > I know that Spark currently does not officially support Hive 2.2 (perhaps > because it hasn't been released yet); but we have some 2.2 patches in CDH and > the current code in the isolated client fails. The most probably culprit are > changes added in HIVE-13149. > The fix is simple, and here's the patch we applied in CDH: > https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0 > Fixing that doesn't affect any existing Hive version support, but will make > it easier to support 2.2 when it's out. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-19804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895459#comment-15895459 ] Xiao Li edited comment on SPARK-19804 at 3/4/17 2:47 AM: - Resolved by https://github.com/apache/spark/pull/17154 was (Author: smilegator): https://github.com/apache/spark/pull/17154 > HiveClientImpl does not work with Hive 2.2.0 metastore > -- > > Key: SPARK-19804 > URL: https://issues.apache.org/jira/browse/SPARK-19804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.2.0 > > > I know that Spark currently does not officially support Hive 2.2 (perhaps > because it hasn't been released yet); but we have some 2.2 patches in CDH and > the current code in the isolated client fails. The most probably culprit are > changes added in HIVE-13149. > The fix is simple, and here's the patch we applied in CDH: > https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0 > Fixing that doesn't affect any existing Hive version support, but will make > it easier to support 2.2 when it's out. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-19804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895459#comment-15895459 ] Xiao Li commented on SPARK-19804: - https://github.com/apache/spark/pull/17154 > HiveClientImpl does not work with Hive 2.2.0 metastore > -- > > Key: SPARK-19804 > URL: https://issues.apache.org/jira/browse/SPARK-19804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Minor > > I know that Spark currently does not officially support Hive 2.2 (perhaps > because it hasn't been released yet); but we have some 2.2 patches in CDH and > the current code in the isolated client fails. The most probably culprit are > changes added in HIVE-13149. > The fix is simple, and here's the patch we applied in CDH: > https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0 > Fixing that doesn't affect any existing Hive version support, but will make > it easier to support 2.2 when it's out. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895452#comment-15895452 ] Apache Spark commented on SPARK-16845: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/17157 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: hejie >Assignee: Liwei Lin > Fix For: 2.1.1, 2.2.0 > > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895451#comment-15895451 ] Apache Spark commented on SPARK-16845: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/17158 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: hejie >Assignee: Liwei Lin > Fix For: 2.1.1, 2.2.0 > > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895411#comment-15895411 ] jin xing edited comment on SPARK-19659 at 3/4/17 2:11 AM: -- [~rxin] Thanks a lot for comment. Tracking average size and also the outliers is a good idea. But there can be multiple huge blocks creating too much pressure(e.g. there are 10% blocks much bigger than they other 90%) and it is a little bit hard to decide how many outliers we should track. If we track too many outliers, *MapStatus* can cost too much memory. I think the benefit of tracking the max for each N/2000 consecutive blocks is that we can avoid having *MapStatus* cost too much memory(at most around 2000Bytes after compressing) and we can have all outliers under control. Do you think it's worth trying? was (Author: jinxing6...@126.com): [~rxin] Thanks a lot for comment. Tracking average size and also the outliers is a good idea. But there can be multiple huge blocks creating too much pressure(e.g. there are 10% blocks much bigger than they other 90%) and it is a little bit hard to decide how many outliers we should track. If we track too many outliers, *MapStatus* can cost too much memory. I think the benefit of tracking the max for each N/2000 consecutive blocks is that we can avoid having *MapStatus* cost too much memory(at most around 2000Bytes). > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895411#comment-15895411 ] jin xing commented on SPARK-19659: -- [~rxin] Thanks a lot for comment. Tracking average size and also the outliers is a good idea. But there can be multiple huge blocks creating too much pressure(e.g. there are 10% blocks much bigger than they other 90%) and it is a little bit hard to decide how many outliers we should track. If we track too many outliers, *MapStatus* can cost too much memory. I think the benefit of tracking the max for each N/2000 consecutive blocks is that we can avoid having *MapStatus* cost too much memory(at most around 2000Bytes). > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19817) make it clear that `timeZone` option is a general option in DataFrameReader/Writer
[ https://issues.apache.org/jira/browse/SPARK-19817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19817: Description: As timezone setting can also affect partition values, it works for all formats, we should make it clear. (was: As timezone setting can also affect partition values, it doesn't make sense that we only support timezone options for JSON and CSV in `DataFrameReader/Writer`, we should support all formats.) > make it clear that `timeZone` option is a general option in > DataFrameReader/Writer > -- > > Key: SPARK-19817 > URL: https://issues.apache.org/jira/browse/SPARK-19817 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Takuya Ueshin > Fix For: 2.2.0 > > > As timezone setting can also affect partition values, it works for all > formats, we should make it clear. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19817) make it clear that `timeZone` option is a general option in DataFrameReader/Writer
[ https://issues.apache.org/jira/browse/SPARK-19817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19817: Summary: make it clear that `timeZone` option is a general option in DataFrameReader/Writer (was: support timeZone option for all formats in `DataFrameReader/Writer`) > make it clear that `timeZone` option is a general option in > DataFrameReader/Writer > -- > > Key: SPARK-19817 > URL: https://issues.apache.org/jira/browse/SPARK-19817 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Takuya Ueshin > Fix For: 2.2.0 > > > As timezone setting can also affect partition values, it doesn't make sense > that we only support timezone options for JSON and CSV in > `DataFrameReader/Writer`, we should support all formats. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19817) support timeZone option for all formats in `DataFrameReader/Writer`
Wenchen Fan created SPARK-19817: --- Summary: support timeZone option for all formats in `DataFrameReader/Writer` Key: SPARK-19817 URL: https://issues.apache.org/jira/browse/SPARK-19817 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Takuya Ueshin As timezone setting can also affect partition values, it doesn't make sense that we only support timezone options for JSON and CSV in `DataFrameReader/Writer`, we should support all formats. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-18350: - > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > Labels: releasenotes > Fix For: 2.2.0 > > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19718) Fix flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-19718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19718. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.2.0 > Fix flaky test: > org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: > stress test for failOnDataLoss=false > --- > > Key: SPARK-19718 > URL: https://issues.apache.org/jira/browse/SPARK-19718 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > > SPARK-19617 changed HDFSMetadataLog to enable interrupts when using the local > file system. However, now we hit HADOOP-12074: `Shell.runCommand` converts > `InterruptedException` to `new IOException(ie.toString())` before Hadoop 2.8. > Test failure: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/2504/consoleFull > {code} > [info] - stress test for failOnDataLoss=false *** FAILED *** (1 minute, 1 > second) > [info] org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 27d45f4f-14dc-4c74-8b52-4bbd4f2b9bec, runId = > 23b8c1ea-4da9-4096-967a-692933e4b319] terminated with exception: > java.lang.InterruptedException > [info] at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:304) > [info] at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:190) > [info] Cause: java.io.IOException: java.lang.InterruptedException > [info] at org.apache.hadoop.util.Shell.runCommand(Shell.java:578) > [info] at org.apache.hadoop.util.Shell.run(Shell.java:478) > [info] at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766) > [info] at org.apache.hadoop.util.Shell.execCommand(Shell.java:859) > [info] at org.apache.hadoop.util.Shell.execCommand(Shell.java:842) > [info] at > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661) > [info] at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:300) > [info] at > org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1014) > [info] at > org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85) > [info] at > org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:354) > [info] at > org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:394) > [info] at > org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) > [info] at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:680) > [info] at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:676) > [info] at > org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > [info] at org.apache.hadoop.fs.FileContext.create(FileContext.java:676) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level
[ https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19816: Assignee: Apache Spark (was: Shixiong Zhu) > DataFrameCallbackSuite doesn't recover the log level > > > Key: SPARK-19816 > URL: https://issues.apache.org/jira/browse/SPARK-19816 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > "DataFrameCallbackSuite.execute callback functions when a DataFrame action > failed" sets the log level to "fatal" but doesn't recover it. Hence, tests > running after it won't output any logs except fatal logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level
[ https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19816: Assignee: Shixiong Zhu (was: Apache Spark) > DataFrameCallbackSuite doesn't recover the log level > > > Key: SPARK-19816 > URL: https://issues.apache.org/jira/browse/SPARK-19816 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > "DataFrameCallbackSuite.execute callback functions when a DataFrame action > failed" sets the log level to "fatal" but doesn't recover it. Hence, tests > running after it won't output any logs except fatal logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level
[ https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895348#comment-15895348 ] Apache Spark commented on SPARK-19816: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17156 > DataFrameCallbackSuite doesn't recover the log level > > > Key: SPARK-19816 > URL: https://issues.apache.org/jira/browse/SPARK-19816 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > "DataFrameCallbackSuite.execute callback functions when a DataFrame action > failed" sets the log level to "fatal" but doesn't recover it. Hence, tests > running after it won't output any logs except fatal logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19811) sparksql 2.1 can not prune hive partition
[ https://issues.apache.org/jira/browse/SPARK-19811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895346#comment-15895346 ] sydt edited comment on SPARK-19811 at 3/4/17 1:04 AM: -- this is not a problem because it can be resolved by change partition information "DAY_ID='20170212' AND PROV_ID ='842'" to lower case was (Author: wangchao2017): this is not a problem because it can be resolved by change partition information "DAY_ID='20170212' AND PROV_ID ='842'" to lower spell. > sparksql 2.1 can not prune hive partition > -- > > Key: SPARK-19811 > URL: https://issues.apache.org/jira/browse/SPARK-19811 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: sydt > > When sparksql2.1 execute sql, it has error: > java.lang.RuntimeException: Expected only partition pruning predicates: > (isnotnull(DAY_ID#216) && (DAY_ID#216 = 20170212)) and the sql sentence is > select PROD_INST_ID from CRM_DB.ITG_PROD_INST WHERE DAY_ID='20170212' AND > PROV_ID ='842' limit 10; where DAY_ID and PROVE_ID is partition in hive. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19811) sparksql 2.1 can not prune hive partition
[ https://issues.apache.org/jira/browse/SPARK-19811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895346#comment-15895346 ] sydt commented on SPARK-19811: -- this is not a problem because it can be resolved by change partition information "DAY_ID='20170212' AND PROV_ID ='842'" to lower spell. > sparksql 2.1 can not prune hive partition > -- > > Key: SPARK-19811 > URL: https://issues.apache.org/jira/browse/SPARK-19811 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: sydt > > When sparksql2.1 execute sql, it has error: > java.lang.RuntimeException: Expected only partition pruning predicates: > (isnotnull(DAY_ID#216) && (DAY_ID#216 = 20170212)) and the sql sentence is > select PROD_INST_ID from CRM_DB.ITG_PROD_INST WHERE DAY_ID='20170212' AND > PROV_ID ='842' limit 10; where DAY_ID and PROVE_ID is partition in hive. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level
Shixiong Zhu created SPARK-19816: Summary: DataFrameCallbackSuite doesn't recover the log level Key: SPARK-19816 URL: https://issues.apache.org/jira/browse/SPARK-19816 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 2.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu "DataFrameCallbackSuite.execute callback functions when a DataFrame action failed" sets the log level to "fatal" but doesn't recover it. Hence, tests running after it won't output any logs except fatal logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-13446: --- Assignee: Xiao Li > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang >Assignee: Xiao Li > Fix For: 2.2.0 > > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-13446. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17061 [https://github.com/apache/spark/pull/17061] > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang > Fix For: 2.2.0 > > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19348: -- Fix Version/s: 2.2.0 > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Vinayak Joshi >Assignee: Bryan Cutler > Fix For: 2.2.0 > > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18350. - Resolution: Fixed Assignee: Takuya Ueshin Fix Version/s: 2.2.0 > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > Labels: releasenotes > Fix For: 2.2.0 > > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18939) Timezone support in partition values.
[ https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-18939: --- Assignee: Takuya Ueshin > Timezone support in partition values. > - > > Key: SPARK-18939 > URL: https://issues.apache.org/jira/browse/SPARK-18939 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.2.0 > > > We should also use session local timezone to interpret partition values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18939) Timezone support in partition values.
[ https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18939. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17053 [https://github.com/apache/spark/pull/17053] > Timezone support in partition values. > - > > Key: SPARK-18939 > URL: https://issues.apache.org/jira/browse/SPARK-18939 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Takuya Ueshin > Fix For: 2.2.0 > > > We should also use session local timezone to interpret partition values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19815) Not orderable should be applied to right key instead of left key
[ https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-19815: --- Summary: Not orderable should be applied to right key instead of left key (was: Not order able should be applied to right key instead of left key) > Not orderable should be applied to right key instead of left key > > > Key: SPARK-19815 > URL: https://issues.apache.org/jira/browse/SPARK-19815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > When generating ShuffledHashJoinExec, the orderable condition should be > applied to right key instead of left key. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19815) Not order able should be applied to right key instead of left key
[ https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19815: Assignee: (was: Apache Spark) > Not order able should be applied to right key instead of left key > - > > Key: SPARK-19815 > URL: https://issues.apache.org/jira/browse/SPARK-19815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > When generating ShuffledHashJoinExec, the orderable condition should be > applied to right key instead of left key. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19815) Not order able should be applied to right key instead of left key
[ https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895250#comment-15895250 ] Apache Spark commented on SPARK-19815: -- User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/17155 > Not order able should be applied to right key instead of left key > - > > Key: SPARK-19815 > URL: https://issues.apache.org/jira/browse/SPARK-19815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > When generating ShuffledHashJoinExec, the orderable condition should be > applied to right key instead of left key. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19815) Not order able should be applied to right key instead of left key
[ https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19815: Assignee: Apache Spark > Not order able should be applied to right key instead of left key > - > > Key: SPARK-19815 > URL: https://issues.apache.org/jira/browse/SPARK-19815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Assignee: Apache Spark >Priority: Minor > > When generating ShuffledHashJoinExec, the orderable condition should be > applied to right key instead of left key. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19815) Not order able should be applied to right key instead of left key
Zhan Zhang created SPARK-19815: -- Summary: Not order able should be applied to right key instead of left key Key: SPARK-19815 URL: https://issues.apache.org/jira/browse/SPARK-19815 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Zhan Zhang Priority: Minor When generating ShuffledHashJoinExec, the orderable condition should be applied to right key instead of left key. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19084) conditional function: field
[ https://issues.apache.org/jira/browse/SPARK-19084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895154#comment-15895154 ] Apache Spark commented on SPARK-19084: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/17154 > conditional function: field > --- > > Key: SPARK-19084 > URL: https://issues.apache.org/jira/browse/SPARK-19084 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Chenzhao Guo > > field(str, str1, str2, ... ) is a variable-length(>=2) function which returns > the index of str in the list (str1, str2, ... ) or 0 if not found. > Every parameter is required to be subtype of AtomicType. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource
[ https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19813: Assignee: Burak Yavuz (was: Apache Spark) > maxFilesPerTrigger combo latestFirst may miss old files in combination with > maxFileAge in FileStreamSource > -- > > Key: SPARK-19813 > URL: https://issues.apache.org/jira/browse/SPARK-19813 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > There is a file stream source option called maxFileAge which limits how old > the files can be, relative the latest file that has been seen. This is used > to limit the files that need to be remembered as "processed". Files older > than the latest processed files are ignored. This values is by default 7 days. > This causes a problem when both > - latestFirst = true > - maxFilesPerTrigger > total files to be processed. > Here is what happens in all combinations > 1) latestFirst = false - Since files are processed in order, there wont be > any unprocessed file older than the latest processed file. All files will be > processed. > 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge > thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is > not, then all old files get processed in the first batch, and so no file is > left behind. > 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch > process the latest X files. That sets the threshold latest file - maxFileAge, > so files older than this threshold will never be considered for processing. > The bug is with case 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource
[ https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895110#comment-15895110 ] Apache Spark commented on SPARK-19813: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/17153 > maxFilesPerTrigger combo latestFirst may miss old files in combination with > maxFileAge in FileStreamSource > -- > > Key: SPARK-19813 > URL: https://issues.apache.org/jira/browse/SPARK-19813 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > There is a file stream source option called maxFileAge which limits how old > the files can be, relative the latest file that has been seen. This is used > to limit the files that need to be remembered as "processed". Files older > than the latest processed files are ignored. This values is by default 7 days. > This causes a problem when both > - latestFirst = true > - maxFilesPerTrigger > total files to be processed. > Here is what happens in all combinations > 1) latestFirst = false - Since files are processed in order, there wont be > any unprocessed file older than the latest processed file. All files will be > processed. > 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge > thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is > not, then all old files get processed in the first batch, and so no file is > left behind. > 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch > process the latest X files. That sets the threshold latest file - maxFileAge, > so files older than this threshold will never be considered for processing. > The bug is with case 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource
[ https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19813: Assignee: Apache Spark (was: Burak Yavuz) > maxFilesPerTrigger combo latestFirst may miss old files in combination with > maxFileAge in FileStreamSource > -- > > Key: SPARK-19813 > URL: https://issues.apache.org/jira/browse/SPARK-19813 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Apache Spark > > There is a file stream source option called maxFileAge which limits how old > the files can be, relative the latest file that has been seen. This is used > to limit the files that need to be remembered as "processed". Files older > than the latest processed files are ignored. This values is by default 7 days. > This causes a problem when both > - latestFirst = true > - maxFilesPerTrigger > total files to be processed. > Here is what happens in all combinations > 1) latestFirst = false - Since files are processed in order, there wont be > any unprocessed file older than the latest processed file. All files will be > processed. > 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge > thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is > not, then all old files get processed in the first batch, and so no file is > left behind. > 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch > process the latest X files. That sets the threshold latest file - maxFileAge, > so files older than this threshold will never be considered for processing. > The bug is with case 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC
[ https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894952#comment-15894952 ] Sean Owen commented on SPARK-19814: --- Yes, that already describes further optimizations. I would close this as a duplicate at least if you're not showing a memory leak. > Spark History Server Out Of Memory / Extreme GC > --- > > Key: SPARK-19814 > URL: https://issues.apache.org/jira/browse/SPARK-19814 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.0, 2.1.0 > Environment: Spark History Server (we've run it on several different > Hadoop distributions) >Reporter: Simon King > Attachments: SparkHistoryCPUandRAM.png > > > Spark History Server runs out of memory, gets into GC thrash and eventually > becomes unresponsive. This seems to happen more quickly with heavy use of the > REST API. We've seen this with several versions of Spark. > Running with the following settings (spark 2.1): > spark.history.fs.cleaner.enabledtrue > spark.history.fs.cleaner.interval 1d > spark.history.fs.cleaner.maxAge 7d > spark.history.retainedApplications 500 > We will eventually get errors like: > 17/02/25 05:02:19 WARN ServletHandler:· > javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: > GC overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489) > at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit > exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148) > at > org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110) > at > org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244) > at > org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49) > at > org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66) > at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) > at >
[jira] [Commented] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC
[ https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894948#comment-15894948 ] Simon King commented on SPARK-19814: Sean, I think that giving more memory only delays the problem, but we will experiment more with larger heap settings. We're just starting to look into the issue, hoping for early help diagnosing or configuring around the issue. Hope there's a simpler fix than the major overhaul proposed here: https://issues.apache.org/jira/browse/SPARK-18085 > Spark History Server Out Of Memory / Extreme GC > --- > > Key: SPARK-19814 > URL: https://issues.apache.org/jira/browse/SPARK-19814 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.0, 2.1.0 > Environment: Spark History Server (we've run it on several different > Hadoop distributions) >Reporter: Simon King > Attachments: SparkHistoryCPUandRAM.png > > > Spark History Server runs out of memory, gets into GC thrash and eventually > becomes unresponsive. This seems to happen more quickly with heavy use of the > REST API. We've seen this with several versions of Spark. > Running with the following settings (spark 2.1): > spark.history.fs.cleaner.enabledtrue > spark.history.fs.cleaner.interval 1d > spark.history.fs.cleaner.maxAge 7d > spark.history.retainedApplications 500 > We will eventually get errors like: > 17/02/25 05:02:19 WARN ServletHandler:· > javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: > GC overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489) > at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit > exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148) > at > org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110) > at > org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244) > at > org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49) > at > org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66) > at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109) > at >
[jira] [Commented] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC
[ https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894942#comment-15894942 ] Sean Owen commented on SPARK-19814: --- I'm not sure if this is a bug. It depends on how much memory you give, how much data the history server stores. 4G may not be enough; increase that? Unless it's a memory leak or some obviously too large data structure, I don't think it's a bug, but if you have a concrete optimization, you can open a pull request. > Spark History Server Out Of Memory / Extreme GC > --- > > Key: SPARK-19814 > URL: https://issues.apache.org/jira/browse/SPARK-19814 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.0, 2.1.0 > Environment: Spark History Server (we've run it on several different > Hadoop distributions) >Reporter: Simon King > Attachments: SparkHistoryCPUandRAM.png > > > Spark History Server runs out of memory, gets into GC thrash and eventually > becomes unresponsive. This seems to happen more quickly with heavy use of the > REST API. We've seen this with several versions of Spark. > Running with the following settings (spark 2.1): > spark.history.fs.cleaner.enabledtrue > spark.history.fs.cleaner.interval 1d > spark.history.fs.cleaner.maxAge 7d > spark.history.retainedApplications 500 > We will eventually get errors like: > 17/02/25 05:02:19 WARN ServletHandler:· > javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: > GC overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489) > at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit > exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148) > at > org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110) > at > org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244) > at > org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49) > at > org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66) > at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) >
[jira] [Updated] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC
[ https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon King updated SPARK-19814: --- Attachment: SparkHistoryCPUandRAM.png Graph showing CPU usage (top) and RSS RAM (bottom). Note the one run of SHS in the middle with lower max heap setting eventually spent much more CPU time on garbage collection. > Spark History Server Out Of Memory / Extreme GC > --- > > Key: SPARK-19814 > URL: https://issues.apache.org/jira/browse/SPARK-19814 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.0, 2.1.0 > Environment: Spark History Server (we've run it on several different > Hadoop distributions) >Reporter: Simon King > Attachments: SparkHistoryCPUandRAM.png > > > Spark History Server runs out of memory, gets into GC thrash and eventually > becomes unresponsive. This seems to happen more quickly with heavy use of the > REST API. We've seen this with several versions of Spark. > Running with the following settings (spark 2.1): > spark.history.fs.cleaner.enabledtrue > spark.history.fs.cleaner.interval 1d > spark.history.fs.cleaner.maxAge 7d > spark.history.retainedApplications 500 > We will eventually get errors like: > 17/02/25 05:02:19 WARN ServletHandler:· > javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: > GC overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489) > at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit > exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148) > at > org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110) > at > org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244) > at > org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49) > at > org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66) > at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) > at >
[jira] [Updated] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource
[ https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19813: - Target Version/s: 2.2.0 > maxFilesPerTrigger combo latestFirst may miss old files in combination with > maxFileAge in FileStreamSource > -- > > Key: SPARK-19813 > URL: https://issues.apache.org/jira/browse/SPARK-19813 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > There is a file stream source option called maxFileAge which limits how old > the files can be, relative the latest file that has been seen. This is used > to limit the files that need to be remembered as "processed". Files older > than the latest processed files are ignored. This values is by default 7 days. > This causes a problem when both > - latestFirst = true > - maxFilesPerTrigger > total files to be processed. > Here is what happens in all combinations > 1) latestFirst = false - Since files are processed in order, there wont be > any unprocessed file older than the latest processed file. All files will be > processed. > 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge > thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is > not, then all old files get processed in the first batch, and so no file is > left behind. > 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch > process the latest X files. That sets the threshold latest file - maxFileAge, > so files older than this threshold will never be considered for processing. > The bug is with case 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC
[ https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon King updated SPARK-19814: --- Description: Spark History Server runs out of memory, gets into GC thrash and eventually becomes unresponsive. This seems to happen more quickly with heavy use of the REST API. We've seen this with several versions of Spark. Running with the following settings (spark 2.1): spark.history.fs.cleaner.enabledtrue spark.history.fs.cleaner.interval 1d spark.history.fs.cleaner.maxAge 7d spark.history.retainedApplications 500 We will eventually get errors like: 17/02/25 05:02:19 WARN ServletHandler:· javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.spark_project.jetty.server.Server.handle(Server.java:499) at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148) at org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110) at org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244) at org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49) at org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66) at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158) at org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178) at org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109) at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109) at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) at org.glassfish.jersey.server.internal.routing.RoutingStage.apply(RoutingStage.java:92) at org.glassfish.jersey.server.internal.routing.RoutingStage.apply(RoutingStage.java:61) at org.glassfish.jersey.process.internal.Stages.process(Stages.java:197) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:318) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at
[jira] [Created] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC
Simon King created SPARK-19814: -- Summary: Spark History Server Out Of Memory / Extreme GC Key: SPARK-19814 URL: https://issues.apache.org/jira/browse/SPARK-19814 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0, 2.0.0, 1.6.1 Environment: Spark History Server (we've run it on several different Hadoop distributions) Reporter: Simon King Spark History Server runs out of memory, gets into GC thrash and eventually becomes unresponsive. This seems to happen more quickly with heavy use of the REST API. We've seen this with several versions of Spark. Running with the following settings (spark 2.1): {{spark.history.fs.cleaner.enabledtrue spark.history.fs.cleaner.interval 1d spark.history.fs.cleaner.maxAge 7d spark.history.retainedApplications 500}} We will eventually get errors like: {{17/02/25 05:02:19 WARN ServletHandler:· javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.spark_project.jetty.server.Server.handle(Server.java:499) at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148) at org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110) at org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244) at org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49) at org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66) at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158) at org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178) at org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109) at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109) at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) at org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) at org.glassfish.jersey.server.internal.routing.RoutingStage.apply(RoutingStage.java:92) at org.glassfish.jersey.server.internal.routing.RoutingStage.apply(RoutingStage.java:61) at org.glassfish.jersey.process.internal.Stages.process(Stages.java:197) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:318) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at
[jira] [Updated] (SPARK-19690) Join a streaming DataFrame with a batch DataFrame may not work
[ https://issues.apache.org/jira/browse/SPARK-19690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19690: - Priority: Critical (was: Major) > Join a streaming DataFrame with a batch DataFrame may not work > -- > > Key: SPARK-19690 > URL: https://issues.apache.org/jira/browse/SPARK-19690 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.3, 2.1.0, 2.1.1 >Reporter: Shixiong Zhu >Priority: Critical > > When joining a streaming DataFrame with a batch DataFrame, if the batch > DataFrame has an aggregation, it will be converted to a streaming physical > aggregation. Then the query will crash. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18258: - Target Version/s: (was: 2.2.0) > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. > After SPARK-17829 is complete and offsets have a .json method, an api for > this ticket might look like > {code} > trait Sink { > def addBatch(batchId: Long, data: DataFrame, start: OffsetSeq, end: > OffsetSeq): Unit > {code} > where start and end were provided by StreamExecution.runBatch using > committedOffsets and availableOffsets. > I'm not 100% certain that the offsets in the seq could always be mapped back > to the correct source when restarting complicated multi-source jobs, but I > think it'd be sufficient. Passing the string/json representation of the seq > instead of the seq itself would probably be sufficient as well, but the > convention of rendering a None as "-" in the json is maybe a little > idiosyncratic to parse, and the constant defining that is private. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource
Burak Yavuz created SPARK-19813: --- Summary: maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource Key: SPARK-19813 URL: https://issues.apache.org/jira/browse/SPARK-19813 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Burak Yavuz Assignee: Burak Yavuz There is a file stream source option called maxFileAge which limits how old the files can be, relative the latest file that has been seen. This is used to limit the files that need to be remembered as "processed". Files older than the latest processed files are ignored. This values is by default 7 days. This causes a problem when both - latestFirst = true - maxFilesPerTrigger > total files to be processed. Here is what happens in all combinations 1) latestFirst = false - Since files are processed in order, there wont be any unprocessed file older than the latest processed file. All files will be processed. 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is not, then all old files get processed in the first batch, and so no file is left behind. 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch process the latest X files. That sets the threshold latest file - maxFileAge, so files older than this threshold will never be considered for processing. The bug is with case 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19774) StreamExecution should call stop() on sources when a stream fails
[ https://issues.apache.org/jira/browse/SPARK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19774. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > StreamExecution should call stop() on sources when a stream fails > - > > Key: SPARK-19774 > URL: https://issues.apache.org/jira/browse/SPARK-19774 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.1.1, 2.2.0 > > > We call stop() on a Structured Streaming Source only when the stream is > shutdown when a user calls streamingQuery.stop(). We should actually stop all > sources when the stream fails as well, otherwise we may leak resources, e.g. > connections to Kafka. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19701) the `in` operator in pyspark is broken
[ https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894796#comment-15894796 ] Wenchen Fan commented on SPARK-19701: - let's remove it then > the `in` operator in pyspark is broken > -- > > Key: SPARK-19701 > URL: https://issues.apache.org/jira/browse/SPARK-19701 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > > {code} > >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md") > >>> linesWithSpark = textFile.filter("Spark" in textFile.value) > Traceback (most recent call last): > File "", line 1, in > File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, > in __nonzero__ > raise ValueError("Cannot convert column into bool: please use '&' for > 'and', '|' for 'or', " > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894768#comment-15894768 ] Marcelo Vanzin commented on SPARK-18085: bq. does this local db will delete the data as specified by the configuration? The existing log cleaner functionality will be maintained, so the application logs will be cleaned the same way they are today. For the new local DBs, I kinda touch on that in the document. My current plan is to first have a configuration for the maximum amount of data the SHS can use locally (and use a LRU-style approach to delete local DBs), and eventually cache these DBs in remote storage (e.g. HDFS) so that they don't need to be re-created (which can be expensive). > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894701#comment-15894701 ] Apache Spark commented on SPARK-18278: -- User 'erikerlandson' has created a pull request for this issue: https://github.com/apache/spark/pull/16061 > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19710) Test Failures in SQLQueryTests on big endian platforms
[ https://issues.apache.org/jira/browse/SPARK-19710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19710. --- Resolution: Fixed Assignee: Pete Robbins Fix Version/s: 2.2.0 > Test Failures in SQLQueryTests on big endian platforms > -- > > Key: SPARK-19710 > URL: https://issues.apache.org/jira/browse/SPARK-19710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Pete Robbins >Assignee: Pete Robbins >Priority: Minor > Fix For: 2.2.0 > > > Some of the new test queries introduced by > https://issues.apache.org/jira/browse/SPARK-18871 fail when run on zLinux > (big endian) > The order of the return rows is different to the results file, hence the > failures, but the results are valid for the queries as insufficient ordering > is specified to give absolute results. > The failing tests are in o.a.s.SQLQuerTestSuite > in-joins.sql > not-in-joins.sql > in-set-operations.sql > These can be fixed by adding to the ORDER BY clauses to determine the > resulting row order. > PR on it's way -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories
[ https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894584#comment-15894584 ] Thomas Graves commented on SPARK-19812: --- note that it will go ahead and start using the recovery db, it just doesn't copy over the old one so anything running gets lost. > YARN shuffle service fails to relocate recovery DB directories > -- > > Key: SPARK-19812 > URL: https://issues.apache.org/jira/browse/SPARK-19812 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.1 >Reporter: Thomas Graves >Assignee: Thomas Graves > > The yarn shuffle service tries to switch from the yarn local directories to > the real recovery directory but can fail to move the existing recovery db's. > It fails due to Files.move not doing directories that have contents. > 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move > recovery file sparkShuffleRecovery.ldb to the path > /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle > java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb > at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498) > at > sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) > at java.nio.file.Files.move(Files.java:1395) > at > org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369) > at > org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684) > This used to use f.renameTo and we switched it in the pr due to review > comments and it looks like didn't do a final real test. The tests are using > files rather then directories so it didn't catch. We need to fix the test > also. > history: > https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories
[ https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-19812: -- Summary: YARN shuffle service fails to relocate recovery DB directories (was: YARN shuffle service fix moving recovery DB directories) > YARN shuffle service fails to relocate recovery DB directories > -- > > Key: SPARK-19812 > URL: https://issues.apache.org/jira/browse/SPARK-19812 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.1 >Reporter: Thomas Graves >Assignee: Thomas Graves > > The yarn shuffle service tries to switch from the yarn local directories to > the real recovery directory but can fail to move the existing recovery db's. > It fails due to Files.move not doing directories that have contents. > 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move > recovery file sparkShuffleRecovery.ldb to the path > /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle > java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb > at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498) > at > sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) > at java.nio.file.Files.move(Files.java:1395) > at > org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369) > at > org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684) > This used to use f.renameTo and we switched it in the pr due to review > comments and it looks like didn't do a final real test. The tests are using > files rather then directories so it didn't catch. We need to fix the test > also. > history: > https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19812) YARN shuffle service fix moving recovery DB directories
Thomas Graves created SPARK-19812: - Summary: YARN shuffle service fix moving recovery DB directories Key: SPARK-19812 URL: https://issues.apache.org/jira/browse/SPARK-19812 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.0.1 Reporter: Thomas Graves Assignee: Thomas Graves The yarn shuffle service tries to switch from the yarn local directories to the real recovery directory but can fail to move the existing recovery db's. It fails due to Files.move not doing directories that have contents. 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move recovery file sparkShuffleRecovery.ldb to the path /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498) at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) at java.nio.file.Files.move(Files.java:1395) at org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369) at org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200) at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684) This used to use f.renameTo and we switched it in the pr due to review comments and it looks like didn't do a final real test. The tests are using files rather then directories so it didn't catch. We need to fix the test also. history: https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18389) Disallow cyclic view reference
[ https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18389: Assignee: (was: Apache Spark) > Disallow cyclic view reference > -- > > Key: SPARK-18389 > URL: https://issues.apache.org/jira/browse/SPARK-18389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The following should not be allowed: > {code} > CREATE VIEW testView AS SELECT id FROM jt > CREATE VIEW testView2 AS SELECT id FROM testView > ALTER VIEW testView AS SELECT * FROM testView2 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18389) Disallow cyclic view reference
[ https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18389: Assignee: Apache Spark > Disallow cyclic view reference > -- > > Key: SPARK-18389 > URL: https://issues.apache.org/jira/browse/SPARK-18389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > The following should not be allowed: > {code} > CREATE VIEW testView AS SELECT id FROM jt > CREATE VIEW testView2 AS SELECT id FROM testView > ALTER VIEW testView AS SELECT * FROM testView2 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18389) Disallow cyclic view reference
[ https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894571#comment-15894571 ] Apache Spark commented on SPARK-18389: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/17152 > Disallow cyclic view reference > -- > > Key: SPARK-18389 > URL: https://issues.apache.org/jira/browse/SPARK-18389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The following should not be allowed: > {code} > CREATE VIEW testView AS SELECT id FROM jt > CREATE VIEW testView2 AS SELECT id FROM testView > ALTER VIEW testView AS SELECT * FROM testView2 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19758) Casting string to timestamp in inline table definition fails with AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-19758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19758. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.2.0 > Casting string to timestamp in inline table definition fails with > AnalysisException > --- > > Key: SPARK-19758 > URL: https://issues.apache.org/jira/browse/SPARK-19758 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Liang-Chi Hsieh >Priority: Blocker > Fix For: 2.2.0 > > > The following query runs succesfully on Spark 2.1.x but fails in the current > master: > {code} > sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES > TIMESTAMP('1991-12-06 00:00:00.0')""") > {code} > Here's the error: > {code} > scala> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES > TIMESTAMP('1991-12-06 00:00:00.0')""") > org.apache.spark.sql.AnalysisException: failed to evaluate expression > CAST('1991-12-06 00:00:00.0' AS TIMESTAMP): None.get; line 1 pos 50 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:105) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:95) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:95) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:94) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.convert(ResolveInlineTables.scala:94) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:36) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:32) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:31) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:128) > at >
[jira] [Commented] (SPARK-15797) To expose groupingSets for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894509#comment-15894509 ] Pau Tallada Crespí commented on SPARK-15797: Hi, any progress on this? :P > To expose groupingSets for DataFrame > > > Key: SPARK-15797 > URL: https://issues.apache.org/jira/browse/SPARK-15797 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1 >Reporter: Priyanka Garg > > Currently, Cube and rollup functions are exposed in data frame but not > grouping sets. > For eg. > df.rollup($"department", $"group", $designation).avg() results into > a. All combinations of department , group and designations > b. All combinations of department , group , taking designation as null > c. All departments , taking groups and designation as null > d. taking department and group both null ( means aggregating on the complete > data) > On the same lines , there should be a function grouping sets , in which > custom groupings can be specified. > For eg. > df.groupingSets(($"department", $"group", $"designation"), ($"group") > ,($"designation"), () ).avg() > This should result into: > 1. All combinations of department, group and designation > 2. All values of group taking department and designation as null > 3. All values of designation, taking department and group as null. > 4. Aggregation on complete data i.e. taking designation, group and department > as null. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()
[ https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894496#comment-15894496 ] Herman van Hovell commented on SPARK-19503: --- We do not prune local sorts yet; however a user can explicitly request those. The query should return the requested physical layout, but other than that we should just prune unneeded shuffles and sorts. > Execution Plan Optimizer: avoid sort or shuffle when it does not change end > result such as df.sort(...).count() > --- > > Key: SPARK-19503 > URL: https://issues.apache.org/jira/browse/SPARK-19503 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 > Environment: Perhaps only a pyspark or databricks AWS issue >Reporter: R >Priority: Minor > Labels: execution, optimizer, plan, query > > df.sort(...).count() > performs shuffle and sort and then count! This is wasteful as sort is not > required here and makes me wonder how smart the algebraic optimiser is > indeed! The data may be partitioned by known count (such as parquet files) > and we should not shuffle to just perform count. > This may look trivial, but if optimiser fails to recognise this, I wonder > what else is it missing especially in more complex operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Gesher updated SPARK-19764: --- There's nothing output in the driver. It just appears hung. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
[ https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894402#comment-15894402 ] Jakub Dubovsky commented on SPARK-16599: [~srowen] I tried to create a custom spark build with the change you suggested above. But I am unable to install it locally (see below). I asked on spark dev mailing list but nobody really helped. So I try to post it here. [This is a change|https://gist.github.com/james64/cc158bdb81bc1828937c757fde94ce82] I did to spark on v2.1.0 tag. And [this is a build output|https://gist.github.com/james64/85b3bf4613e7105bebd687502258a518] I got when tried to run this: ./build/mvn -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -Dhadoop.version=2.6.0-cdh5.7.1 clean install I believe the profile selection and versions are right because this was successful: ./dev/make-distribution.sh --name spark-custom-lock --tgz -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -Dhadoop.version=2.6.0-cdh5.7.1 > java.util.NoSuchElementException: None.get at at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > -- > > Key: SPARK-16599 > URL: https://issues.apache.org/jira/browse/SPARK-16599 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: centos 6.7 spark 2.0 >Reporter: binde > > run a spark job with spark 2.0, error message > Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19810) Remove support for Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19810: Assignee: Apache Spark (was: Sean Owen) > Remove support for Scala 2.10 > - > > Key: SPARK-19810 > URL: https://issues.apache.org/jira/browse/SPARK-19810 > Project: Spark > Issue Type: Task > Components: ML, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Critical > > This tracks the removal of Scala 2.10 support, as discussed in > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html > and other lists. > The primary motivations are to simplify the code and build, and to enable > Scala 2.12 support later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19810) Remove support for Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19810: Assignee: Sean Owen (was: Apache Spark) > Remove support for Scala 2.10 > - > > Key: SPARK-19810 > URL: https://issues.apache.org/jira/browse/SPARK-19810 > Project: Spark > Issue Type: Task > Components: ML, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Critical > > This tracks the removal of Scala 2.10 support, as discussed in > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html > and other lists. > The primary motivations are to simplify the code and build, and to enable > Scala 2.12 support later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19810) Remove support for Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894267#comment-15894267 ] Apache Spark commented on SPARK-19810: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/17150 > Remove support for Scala 2.10 > - > > Key: SPARK-19810 > URL: https://issues.apache.org/jira/browse/SPARK-19810 > Project: Spark > Issue Type: Task > Components: ML, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Critical > > This tracks the removal of Scala 2.10 support, as discussed in > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html > and other lists. > The primary motivations are to simplify the code and build, and to enable > Scala 2.12 support later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19811) sparksql 2.1 can not prune hive partition
sydt created SPARK-19811: Summary: sparksql 2.1 can not prune hive partition Key: SPARK-19811 URL: https://issues.apache.org/jira/browse/SPARK-19811 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: sydt When sparksql2.1 execute sql, it has error: java.lang.RuntimeException: Expected only partition pruning predicates: (isnotnull(DAY_ID#216) && (DAY_ID#216 = 20170212)) and the sql sentence is select PROD_INST_ID from CRM_DB.ITG_PROD_INST WHERE DAY_ID='20170212' AND PROV_ID ='842' limit 10; where DAY_ID and PROVE_ID is partition in hive. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19810) Remove support for Scala 2.10
Sean Owen created SPARK-19810: - Summary: Remove support for Scala 2.10 Key: SPARK-19810 URL: https://issues.apache.org/jira/browse/SPARK-19810 Project: Spark Issue Type: Task Components: ML, Spark Core, SQL Affects Versions: 2.1.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Critical This tracks the removal of Scala 2.10 support, as discussed in http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html and other lists. The primary motivations are to simplify the code and build, and to enable Scala 2.12 support later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16773) Post Spark 2.0 deprecation & warnings cleanup
[ https://issues.apache.org/jira/browse/SPARK-16773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16773. --- Resolution: Done > Post Spark 2.0 deprecation & warnings cleanup > - > > Key: SPARK-16773 > URL: https://issues.apache.org/jira/browse/SPARK-16773 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark, Spark Core, SQL >Reporter: holdenk > > As part of the 2.0 release we deprecated a number of different internal > components (one of the largest ones being the old accumulator API), and also > upgraded our default build to Scala 2.11. > This has added a large number of deprecation warnings (internal and external) > - some of which can be worked around - and some of which can't (mostly in the > Scala 2.10 -> 2.11 reflection API and various tests). > We should attempt to limit the number of warnings in our build so that we can > notice new ones and thoughtfully consider if they are warranted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16775) Reduce internal warnings from deprecated accumulator API
[ https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894209#comment-15894209 ] Sean Owen commented on SPARK-16775: --- Are there still areas where uses of deprecated accumulators can be changed? I'm aware they're still referenced from tests, but they kind of have to be in at least most of those cases. > Reduce internal warnings from deprecated accumulator API > > > Key: SPARK-16775 > URL: https://issues.apache.org/jira/browse/SPARK-16775 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: holdenk > > Deprecating the old accumulator API added a large number of warnings - many > of these could be fixed with a bit of refactoring to offer a non-deprecated > internal class while still preserving the external deprecation warnings. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16775) Reduce internal warnings from deprecated accumulator API
[ https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16775: -- Affects Version/s: 2.1.0 Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-16773) > Reduce internal warnings from deprecated accumulator API > > > Key: SPARK-16775 > URL: https://issues.apache.org/jira/browse/SPARK-16775 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: holdenk > > Deprecating the old accumulator API added a large number of warnings - many > of these could be fixed with a bit of refactoring to offer a non-deprecated > internal class while still preserving the external deprecation warnings. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19807) Add reason for cancellation when a stage is killed using web UI
[ https://issues.apache.org/jira/browse/SPARK-19807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894185#comment-15894185 ] Genmao Yu edited comment on SPARK-19807 at 3/3/17 11:35 AM: !https://cloud.githubusercontent.com/assets/7402327/23549702/6a0c93f6-0048-11e7-8a3f-bf58befb887b.png! Do you mean the "Job 0 cancelled" in picture? was (Author: unclegen): !https://cloud.githubusercontent.com/assets/7402327/23549478/70888646-0047-11e7-8e2c-e64a3db43711.png! Do you mean the "Job 0 cancelled" in picture? > Add reason for cancellation when a stage is killed using web UI > --- > > Key: SPARK-19807 > URL: https://issues.apache.org/jira/browse/SPARK-19807 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Jacek Laskowski >Priority: Trivial > > When a user kills a stage using web UI (in Stages page), > {{StagesTab.handleKillRequest}} requests {{SparkContext}} to cancel the stage > without giving a reason. {{SparkContext}} has {{cancelStage(stageId: Int, > reason: String)}} that Spark could use to pass the information for > monitoring/debugging purposes. > {code} > scala> sc.range(0, 5, 1, 1).mapPartitions { nums => { Thread.sleep(60 * > 1000); nums } }.count > {code} > Use http://localhost:4040/stages/ and click Kill. > {code} > org.apache.spark.SparkException: Job 0 cancelled because Stage 0 was cancelled > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1426) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply$mcVI$sp(DAGScheduler.scala:1415) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply(DAGScheduler.scala:1408) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply(DAGScheduler.scala:1408) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:234) > at > org.apache.spark.scheduler.DAGScheduler.handleStageCancellation(DAGScheduler.scala:1408) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1670) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1656) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1645) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2019) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2040) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2059) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084) > at org.apache.spark.rdd.RDD.count(RDD.scala:1158) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19807) Add reason for cancellation when a stage is killed using web UI
[ https://issues.apache.org/jira/browse/SPARK-19807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894185#comment-15894185 ] Genmao Yu commented on SPARK-19807: --- !https://cloud.githubusercontent.com/assets/7402327/23549478/70888646-0047-11e7-8e2c-e64a3db43711.png! Do you mean the "Job 0 cancelled" in picture? > Add reason for cancellation when a stage is killed using web UI > --- > > Key: SPARK-19807 > URL: https://issues.apache.org/jira/browse/SPARK-19807 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Jacek Laskowski >Priority: Trivial > > When a user kills a stage using web UI (in Stages page), > {{StagesTab.handleKillRequest}} requests {{SparkContext}} to cancel the stage > without giving a reason. {{SparkContext}} has {{cancelStage(stageId: Int, > reason: String)}} that Spark could use to pass the information for > monitoring/debugging purposes. > {code} > scala> sc.range(0, 5, 1, 1).mapPartitions { nums => { Thread.sleep(60 * > 1000); nums } }.count > {code} > Use http://localhost:4040/stages/ and click Kill. > {code} > org.apache.spark.SparkException: Job 0 cancelled because Stage 0 was cancelled > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1426) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply$mcVI$sp(DAGScheduler.scala:1415) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply(DAGScheduler.scala:1408) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply(DAGScheduler.scala:1408) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:234) > at > org.apache.spark.scheduler.DAGScheduler.handleStageCancellation(DAGScheduler.scala:1408) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1670) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1656) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1645) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2019) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2040) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2059) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084) > at org.apache.spark.rdd.RDD.count(RDD.scala:1158) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19782) Spark query available cores from application
[ https://issues.apache.org/jira/browse/SPARK-19782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19782. --- Resolution: Not A Problem > Spark query available cores from application > > > Key: SPARK-19782 > URL: https://issues.apache.org/jira/browse/SPARK-19782 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Tom Lewis > > It might be helpful for Spark jobs to self regulate resources if they could > query how many cores exist on a executing system not just how many are being > used at a given time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19701) the `in` operator in pyspark is broken
[ https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894159#comment-15894159 ] Hyukjin Kwon commented on SPARK-19701: -- I was thinking a way to work around (e.g., hijacking..) but it seems we can't. BTW, the below codes seems throwing a {{TypeError}} if {{__nonzero__}} or {{__bool__}} returns other types. {code} class Column(object): def __contains__(self, item): print "I am contains" return Column() def __nonzero__(self): return "a" >>> 1 in Column() I am contains Traceback (most recent call last): File "", line 1, in TypeError: __nonzero__ should return bool or int, returned str {code} > the `in` operator in pyspark is broken > -- > > Key: SPARK-19701 > URL: https://issues.apache.org/jira/browse/SPARK-19701 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > > {code} > >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md") > >>> linesWithSpark = textFile.filter("Spark" in textFile.value) > Traceback (most recent call last): > File "", line 1, in > File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, > in __nonzero__ > raise ValueError("Cannot convert column into bool: please use '&' for > 'and', '|' for 'or', " > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19701) the `in` operator in pyspark is broken
[ https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894155#comment-15894155 ] Hyukjin Kwon commented on SPARK-19701: -- [~cloud_fan], I took a look this for my curiosity. It seems this is what happens now : {code} class Column(object): def __contains__(self, item): print "I am contains" return Column() def __nonzero__(self): raise Exception("I am nonzero.") >>> 1 in Column() I am contains Traceback (most recent call last): File "", line 1, in File "", line 6, in __nonzero__ Exception: I am nonzero. {code} It seems it calls {{__contains__}} first and then {{__nonzero__}} or {{__bool__}} is being called against {{Column()}} to make this a bool. It seems {{__nonzero__}} (for Python 2), {{__bool__}} (for Python 3) and {{__contains__}} forcing the the return into a bool unlike other operators. I also referred the references as below to check my assumption and little knowledge: http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378 http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777 I tested the codes above in 1.6.3, 2.1.0 and in the master branch. It seems it has not been working so far. Should we maybe remove this? > the `in` operator in pyspark is broken > -- > > Key: SPARK-19701 > URL: https://issues.apache.org/jira/browse/SPARK-19701 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > > {code} > >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md") > >>> linesWithSpark = textFile.filter("Spark" in textFile.value) > Traceback (most recent call last): > File "", line 1, in > File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, > in __nonzero__ > raise ValueError("Cannot convert column into bool: please use '&' for > 'and', '|' for 'or', " > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19801) Remove JDK7 from Travis CI
[ https://issues.apache.org/jira/browse/SPARK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19801: - Assignee: Dongjoon Hyun > Remove JDK7 from Travis CI > -- > > Key: SPARK-19801 > URL: https://issues.apache.org/jira/browse/SPARK-19801 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.2.0 > > > Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR > verification (JDK7/JDK8 maven compilation and Java Linter) and contributors > can see the additional result via their Travis CI dashboard (or PC). > This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was > removed via SPARK-19550. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19801) Remove JDK7 from Travis CI
[ https://issues.apache.org/jira/browse/SPARK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19801. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17143 [https://github.com/apache/spark/pull/17143] > Remove JDK7 from Travis CI > -- > > Key: SPARK-19801 > URL: https://issues.apache.org/jira/browse/SPARK-19801 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 2.2.0 > > > Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR > verification (JDK7/JDK8 maven compilation and Java Linter) and contributors > can see the additional result via their Travis CI dashboard (or PC). > This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was > removed via SPARK-19550. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right
[ https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19792: -- Priority: Trivial (was: Major) Hm, I'm honestly not sure. Does this refer to the memory allocated to each executor by the worker, or, does it refer to the amount of memory the worker can assign to executors? > In the Master Page,the column named “Memory per Node” ,I think it is not all > right > --- > > Key: SPARK-19792 > URL: https://issues.apache.org/jira/browse/SPARK-19792 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: liuxian >Priority: Trivial > > Open the spark web page,in the Master Page ,have two tables:Running > Applications table and Completed Applications table, to the column named > “Memory per Node” ,I think it is not all right ,because a node may be not > have only one executor.So I think that should be named as “Memory per > Executor”.Otherwise easy to let the user misunderstanding -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19797: - Assignee: Zhe Sun > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Assignee: Zhe Sun >Priority: Trivial > Labels: documentation > Fix For: 2.1.1, 2.2.0 > > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19797. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 17137 [https://github.com/apache/spark/pull/17137] > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Priority: Trivial > Labels: documentation > Fix For: 2.1.1, 2.2.0 > > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19339) StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next on empty iterator
[ https://issues.apache.org/jira/browse/SPARK-19339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19339. --- Resolution: Duplicate > StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next > on empty iterator > - > > Key: SPARK-19339 > URL: https://issues.apache.org/jira/browse/SPARK-19339 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.2, 2.1.0 >Reporter: Barry Becker >Priority: Minor > > This problem is easy to reproduce by running > StatFunctions.multipleApproxQuantiles on an empty dataset, but I think it can > occur in other cases, like if the column is all null or all one value. > I have unit tests that can hit it in several different cases. > The fix that I have introduced locally is to return > {code} > if (sampled.length == 0) 0 else sampled.last.value > {code} > instead of > {code} > sampled.last.value > {code} > at the end of QuantileSummaries.query. > Below is the exception: > {code} > next on empty iterator > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.last(TraversableLike.scala:459) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132) > at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207) > at > org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(SparkPercentileCalculator.scala:91) > at > org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(SparkPercentileCalculator.scala:91) > at > org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(SparkPercentileCalculator.scala:91) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1.apply(SparkPercentileCalculator.scala:91) > at > org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1.apply(SparkPercentileCalculator.scala:91) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.SparkPercentileCalculator.multipleApproxQuantiles(SparkPercentileCalculator.scala:91) > at > com.mineset.spark.statistics.model.ContinuousMinesetStats.quartiles$lzycompute(ContinuousMinesetStats.scala:274) > at > com.mineset.spark.statistics.model.ContinuousMinesetStats.quartiles(ContinuousMinesetStats.scala:272) > at > com.mineset.spark.statistics.model.MinesetStats.com$mineset$spark$statistics$model$MinesetStats$$serializeContinuousFeature$1(MinesetStats.scala:66) > at > com.mineset.spark.statistics.model.MinesetStats$$anonfun$calculateWithColumns$1.apply(MinesetStats.scala:118) > at > com.mineset.spark.statistics.model.MinesetStats$$anonfun$calculateWithColumns$1.apply(MinesetStats.scala:114) > at scala.collection.immutable.List.foreach(List.scala:381) > at >
[jira] [Resolved] (SPARK-19739) SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of AWS env vars
[ https://issues.apache.org/jira/browse/SPARK-19739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19739. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17080 [https://github.com/apache/spark/pull/17080] > SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of > AWS env vars > -- > > Key: SPARK-19739 > URL: https://issues.apache.org/jira/browse/SPARK-19739 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Steve Loughran >Priority: Minor > Fix For: 2.2.0 > > > {{SparkHadoopUtil.appendS3AndSparkHadoopConfigurations()}} propagates the AWS > user and secret key to s3n and s3a config options, so getting secrets from > the user to the cluster, if set. > AWS also supports session authentication (env var {{AWS_SESSION_TOKEN}}) and > region endpoints {{AWS_DEFAULT_REGION}}, the latter being critical if you > want to address V4-auth-only endpoints like frankfurt and Seol. > These env vars should be picked up and passed down to S3a too. 4+ lines of > code, though impossible to test unless the existing code is refactored to > take the env var map[String, String], so allowing a test suite to set the > values in itds own map. > side issue: what if only half the env vars are set and users are trying to > understand why auth is failing? It may be good to build up a string > identifying which env vars had their value propagate, and log that @ debug, > while not logging the values, obviously. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19739) SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of AWS env vars
[ https://issues.apache.org/jira/browse/SPARK-19739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19739: - Assignee: Genmao Yu > SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of > AWS env vars > -- > > Key: SPARK-19739 > URL: https://issues.apache.org/jira/browse/SPARK-19739 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Steve Loughran >Assignee: Genmao Yu >Priority: Minor > Fix For: 2.2.0 > > > {{SparkHadoopUtil.appendS3AndSparkHadoopConfigurations()}} propagates the AWS > user and secret key to s3n and s3a config options, so getting secrets from > the user to the cluster, if set. > AWS also supports session authentication (env var {{AWS_SESSION_TOKEN}}) and > region endpoints {{AWS_DEFAULT_REGION}}, the latter being critical if you > want to address V4-auth-only endpoints like frankfurt and Seol. > These env vars should be picked up and passed down to S3a too. 4+ lines of > code, though impossible to test unless the existing code is refactored to > take the env var map[String, String], so allowing a test suite to set the > values in itds own map. > side issue: what if only half the env vars are set and users are trying to > understand why auth is failing? It may be good to build up a string > identifying which env vars had their value propagate, and log that @ debug, > while not logging the values, obviously. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19794) Release HDFS Client after read/write checkpoint
[ https://issues.apache.org/jira/browse/SPARK-19794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19794. --- Resolution: Not A Problem See PR > Release HDFS Client after read/write checkpoint > --- > > Key: SPARK-19794 > URL: https://issues.apache.org/jira/browse/SPARK-19794 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: darion yaphet > > RDD check point write each partation into HDFS and reading from HDFS when RDD > need recomputation . After process with HDFS HDFS client and streams should > be closed . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19808) About the default blocking arg in unpersist
[ https://issues.apache.org/jira/browse/SPARK-19808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894087#comment-15894087 ] Sean Owen commented on SPARK-19808: --- (Maybe you can rewrite this as a proposed change rather than question?) They should be consistent, but I don't think they're worth changing now because it's a behavior change for little gain. Consider also the destroy() and unpersist() operations for broadcasts. However I have never been sure why an application would want to block waiting on an unpersist operation. For that reason, I think most calls in Spark are blocking=false and I'd personally support making this consistent. That is, unless someone highlights why this sometimes isn't a good idea? > About the default blocking arg in unpersist > --- > > Key: SPARK-19808 > URL: https://issues.apache.org/jira/browse/SPARK-19808 > Project: Spark > Issue Type: Question > Components: ML, Spark Core >Affects Versions: 2.1.0 >Reporter: zhengruifeng >Priority: Minor > > Now, {{unpersist}} are commonly used with default value in ML. > Most algorithms like {{KMeans}} use {{RDD.unpersisit}} and the default > {{blocking}} is {{true}} > And for meta algorithms like {{OneVsRest}}, {{CrossValidator}} use > {{Dataset.unpersist}} and the default {{blocking}} is {{false}} > Should the default value for {{RDD.unpersisit}} and {{Dataset.unpersist}} be > consistent? > And all the {{blocking}} arg in ML should be set {{false}}? > [~srowen] [~mlnick] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894074#comment-15894074 ] DjvuLee commented on SPARK-18085: - [~vanzin] This is a nice design. There is not much information about the delete. The history log can be large after a few weeks, does this local db will delete the data as specified by the configuration? > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894039#comment-15894039 ] Apache Spark commented on SPARK-19257: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/17149 > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19809) NullPointerException on empty ORC file
Michał Dawid created SPARK-19809: Summary: NullPointerException on empty ORC file Key: SPARK-19809 URL: https://issues.apache.org/jira/browse/SPARK-19809 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.2, 1.6.3 Reporter: Michał Dawid When reading from hive ORC table if there are some 0 byte files we get NullPointerException: {code}java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.zeppelin.spark.ZeppelinContext.showDF(ZeppelinContext.java:209) at org.apache.zeppelin.spark.SparkSqlInterpreter.interpret(SparkSqlInterpreter.java:129)
[jira] [Created] (SPARK-19808) About the default blocking arg in unpersist
zhengruifeng created SPARK-19808: Summary: About the default blocking arg in unpersist Key: SPARK-19808 URL: https://issues.apache.org/jira/browse/SPARK-19808 Project: Spark Issue Type: Question Components: ML, Spark Core Affects Versions: 2.1.0 Reporter: zhengruifeng Priority: Minor Now, {{unpersist}} are commonly used with default value in ML. Most algorithms like {{KMeans}} use {{RDD.unpersisit}} and the default {{blocking}} is {{true}} And for meta algorithms like {{OneVsRest}}, {{CrossValidator}} use {{Dataset.unpersist}} and the default {{blocking}} is {{false}} Should the default value for {{RDD.unpersisit}} and {{Dataset.unpersist}} be consistent? And all the {{blocking}} arg in ML should be set {{false}}? [~srowen] [~mlnick] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org