[jira] [Assigned] (SPARK-27493) Upgrade ASM to 7.1
[ https://issues.apache.org/jira/browse/SPARK-27493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27493: Assignee: Dongjoon Hyun > Upgrade ASM to 7.1 > -- > > Key: SPARK-27493 > URL: https://issues.apache.org/jira/browse/SPARK-27493 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > SPARK-25946 upgrades ASM to 7.0 to support JDK11. This PR aims to update ASM > to 7.1 to bring the bug fixes. > - https://asm.ow2.io/versions.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27493) Upgrade ASM to 7.1
[ https://issues.apache.org/jira/browse/SPARK-27493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27493. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24395 [https://github.com/apache/spark/pull/24395] > Upgrade ASM to 7.1 > -- > > Key: SPARK-27493 > URL: https://issues.apache.org/jira/browse/SPARK-27493 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > SPARK-25946 upgrades ASM to 7.0 to support JDK11. This PR aims to update ASM > to 7.1 to bring the bug fixes. > - https://asm.ow2.io/versions.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27498) Built-in parquet code path does not respect hive.enforce.bucketing
[ https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-27498: -- Description: _Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._ Spark makes an effort to avoid corrupting bucketed Hive tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false. However, this behavior falls down when Spark uses the built-in Parquet code path to write to the Hive table. Here's an example. In Hive, do this (I create a table where things work as expected, and one where things don't work as expected): {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebuckettext1; hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile; hive> insert into hivebuckettext1 select * from sourcetable; hive> drop table hivebucketparq1; hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucketparq1 select * from sourcetable; {noformat} For the text table, things seem to work as expected: {noformat} scala> sql("insert into hivebuckettext1 select 1, 2, 3") 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.; {noformat} For the parquet table, the insert just happens: {noformat} scala> sql("insert into hivebucketparq1 select 1, 2, 3") res1: org.apache.spark.sql.DataFrame = [] scala> {noformat} Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497). If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected. Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not. Probably the check should be made in an analyzer rule while the InsertIntoTable node still holds a HiveTableRelation. was: _Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._ Spark makes an effort to avoid corrupting bucketed Hive tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false. However, this behavior falls down when Spark uses the built-in Parquet code path to write to the Hive table. Here's an example. In Hive, do this (I create a table where things work as expected, and one where things don't work as expected): {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebuckettext1; hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile; hive> insert into hivebuckettext1 select * from sourcetable; hive> drop table hivebucketparq1; hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucketparq1 select * from sourcetable; {noformat} For the text table, things seem to work as expected: {noformat} scala> sql("insert into hivebuckettext1 select 1, 2, 3") 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.; {noformat} For the parquet table, the insert just happens: {noformat} scala> sql("insert into hivebucketparq1 select 1, 2, 3") res1: org.apache.spark.sql.DataFrame = [] scala> {noformat} Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497). If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected. Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not. > Built-in parquet code path does not respect hive.enforce.bucketing > -- > > Key: SPARK-27498 > URL: https://issues.apache.org/jira/browse/SPARK-27498 > Project: Spark > Issue Type: Bug
[jira] [Updated] (SPARK-27498) Built-in parquet code path does not respect hive.enforce.bucketing
[ https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-27498: -- Description: _Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._ Spark makes an effort to avoid corrupting bucketed Hive tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false. However, this behavior falls down when Spark uses the built-in Parquet code path to write to the table. Here's an example. In Hive, do this (I create a table where things work as expected, and one where things don't work as expected): {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebuckettext1; hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile; hive> insert into hivebuckettext1 select * from sourcetable; hive> drop table hivebucketparq1; hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucketparq1 select * from sourcetable; {noformat} For the text table, things seem to work as expected: {noformat} scala> sql("insert into hivebuckettext1 select 1, 2, 3") 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.; {noformat} For the parquet table, the insert just happens: {noformat} scala> sql("insert into hivebucketparq1 select 1, 2, 3") res1: org.apache.spark.sql.DataFrame = [] scala> {noformat} Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497). If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected. Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not. was: _Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._ Spark makes an effort to avoid corrupting Hive-bucketed tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false. However, this behavior falls down when Spark uses the built-in Parquet code path to write to the table. Here's an example. In Hive, do this (I create a table where things work as expected, and one where things don't work as expected): {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebuckettext1; hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile; hive> insert into hivebuckettext1 select * from sourcetable; hive> drop table hivebucketparq1; hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucketparq1 select * from sourcetable; {noformat} For the text table, things seem to work as expected: {noformat} scala> sql("insert into hivebuckettext1 select 1, 2, 3") 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.; {noformat} For the parquet table, the insert just happens: {noformat} scala> sql("insert into hivebucketparq1 select 1, 2, 3") res1: org.apache.spark.sql.DataFrame = [] scala> {noformat} Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497). If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected. Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not. > Built-in parquet code path does not respect hive.enforce.bucketing > -- > > Key: SPARK-27498 > URL: https://issues.apache.org/jira/browse/SPARK-27498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Bruce Robbins >Priority: Major >
[jira] [Updated] (SPARK-27498) Built-in parquet code path does not respect hive.enforce.bucketing
[ https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-27498: -- Description: _Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._ Spark makes an effort to avoid corrupting bucketed Hive tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false. However, this behavior falls down when Spark uses the built-in Parquet code path to write to the Hive table. Here's an example. In Hive, do this (I create a table where things work as expected, and one where things don't work as expected): {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebuckettext1; hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile; hive> insert into hivebuckettext1 select * from sourcetable; hive> drop table hivebucketparq1; hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucketparq1 select * from sourcetable; {noformat} For the text table, things seem to work as expected: {noformat} scala> sql("insert into hivebuckettext1 select 1, 2, 3") 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.; {noformat} For the parquet table, the insert just happens: {noformat} scala> sql("insert into hivebucketparq1 select 1, 2, 3") res1: org.apache.spark.sql.DataFrame = [] scala> {noformat} Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497). If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected. Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not. was: _Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._ Spark makes an effort to avoid corrupting bucketed Hive tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false. However, this behavior falls down when Spark uses the built-in Parquet code path to write to the table. Here's an example. In Hive, do this (I create a table where things work as expected, and one where things don't work as expected): {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebuckettext1; hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile; hive> insert into hivebuckettext1 select * from sourcetable; hive> drop table hivebucketparq1; hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucketparq1 select * from sourcetable; {noformat} For the text table, things seem to work as expected: {noformat} scala> sql("insert into hivebuckettext1 select 1, 2, 3") 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.; {noformat} For the parquet table, the insert just happens: {noformat} scala> sql("insert into hivebucketparq1 select 1, 2, 3") res1: org.apache.spark.sql.DataFrame = [] scala> {noformat} Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497). If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected. Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not. > Built-in parquet code path does not respect hive.enforce.bucketing > -- > > Key: SPARK-27498 > URL: https://issues.apache.org/jira/browse/SPARK-27498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Bruce Robbins >Priority:
[jira] [Updated] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats
[ https://issues.apache.org/jira/browse/SPARK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-27497: -- Description: The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that has the following characteristics: - table is created by Hive (or even Spark, if you use HQL DDL) - table is stored in Parquet format - table has at least one Hive-created data file already For example, do the following in Hive: {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebucket1; hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucket1 select * from sourcetable; hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) CLUSTERED BY ( a, b) SORTED BY ( a ASC, b ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='1', 'numRows'='1', 'rawDataSize'='3', 'totalSize'='352', 'transient_lastDdlTime'='142971') Time taken: 0.056 seconds, Fetched: 26 row(s) hive> {noformat} Then in spark-shell, do the following: {noformat} scala> sql("insert into hivebucket1 select 1, 3, 7") 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res0: org.apache.spark.sql.DataFrame = [] {noformat} Note: At this point, I would have expected Spark to throw an {{AnalysisException}} with the message "Output Hive table `default`.`hivebucket1` is bucketed...". However, I am ignoring that for now and may open a separate Jira (SPARK-27498). Return to some Hive CLI and note that the bucket specification is gone from the table definition: {noformat} hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'SORTBUCKETCOLSPREFIX'='TRUE', 'numFiles'='2', 'numRows'='-1', 'rawDataSize'='-1', 'totalSize'='1144', 'transient_lastDdlTime'='123374') Time taken: 1.619 seconds, Fetched: 20 row(s) hive> {noformat} This information is lost when Spark attempts to update table stats. HiveClientImpl.toHiveTable drops the bucket specification. toHiveTable drops the bucket information because {{table.provider}} is None instead of "hive". {{table.provider}} is not "hive" because Spark bypassed the serdes and used the built-in parquet code path (by default, spark.sql.hive.convertMetastoreParquet is true). was: The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that has the following characteristics: - table is created by Hive (or even Spark, if you use HQL DDL) - table is stored in Parquet format - table has at least one Hive-created data file already For example, do the following in Hive: {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebucket1; hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucket1 select * from sourcetable; hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) CLUSTERED BY ( a, b) SORTED BY ( a ASC, b ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='1', 'numRows'='1', 'rawDataSize'='3', 'totalSize'='352', 'transient_lastDdlTime'='142971') Time taken: 0.056 seconds, Fetched: 26 row(s) hive> {noformat} Then in spark-shell, do the following: {noformat} scala> sql("insert into hivebucket1 select 1, 3, 7") 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res0: org.apache.spark.sql.DataFrame = [] {noformat} Note: At this point, I would have expected Spark to throw an {{AnalysisException}} with the message "Output Hive table `default`.`hivebucket1` is
[jira] [Updated] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats
[ https://issues.apache.org/jira/browse/SPARK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-27497: -- Description: The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that has the following characteristics: - table is created by Hive (or even Spark, if you use HQL DDL) - table is stored in Parquet format - table has at least one Hive-created data file already For example, do the following in Hive: {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebucket1; hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucket1 select * from sourcetable; hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) CLUSTERED BY ( a, b) SORTED BY ( a ASC, b ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='1', 'numRows'='1', 'rawDataSize'='3', 'totalSize'='352', 'transient_lastDdlTime'='142971') Time taken: 0.056 seconds, Fetched: 26 row(s) hive> {noformat} Then in spark-shell, do the following: {noformat} scala> sql("insert into hivebucket1 select 1, 3, 7") 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res0: org.apache.spark.sql.DataFrame = [] {noformat} Note: At this point, I would have expected Spark to throw an {{AnalysisException}} with the message "Output Hive table `default`.`hivebucket1` is bucketed...". However, I am ignoring that for now and may open a separate Jira (SPARK-27498). Return to some Hive CLI and note that the bucket specification is gone from the table definition: {noformat} hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'SORTBUCKETCOLSPREFIX'='TRUE', 'numFiles'='2', 'numRows'='-1', 'rawDataSize'='-1', 'totalSize'='1144', 'transient_lastDdlTime'='123374') Time taken: 1.619 seconds, Fetched: 20 row(s) hive> {noformat} This information is lost when Spark attempts to update table stats. This is because HiveClientImpl.toHiveTable drops the bucket specification. toHiveTable drops the bucket information because {{table.provider}} is None. {{table.provider}} is None because (I assume) Spark bypassed the serdes and used the built-in parquet code path (by default, spark.sql.hive.convertMetastoreParquet is true). was: The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that has the following characteristics: - table is created by Hive (or even Spark, if you use HQL DDL) - table is stored in Parquet format - table has at least one Hive-created data file already For example, do the following in Hive: {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebucket1; hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucket1 select * from sourcetable; hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) CLUSTERED BY ( a, b) SORTED BY ( a ASC, b ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='1', 'numRows'='1', 'rawDataSize'='3', 'totalSize'='352', 'transient_lastDdlTime'='142971') Time taken: 0.056 seconds, Fetched: 26 row(s) hive> {noformat} Then in spark-shell, do the following: {noformat} scala> sql("insert into hivebucket1 select 1, 3, 7") 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res0: org.apache.spark.sql.DataFrame = [] {noformat} Note: At this point, I would have expected Spark to throw an {{AnalysisException}} with the message "Output Hive table `default`.`hivebucket1` is
[jira] [Commented] (SPARK-25422) flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)
[ https://issues.apache.org/jira/browse/SPARK-25422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820707#comment-16820707 ] Mike Chan commented on SPARK-25422: --- Will this problem potentially hitting Spark 2.3.1 as well? I have a new cluster at this version and always hitting corrupt remote block when 1 specific table involved. > flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated > (encryption = on) (with replication as stream) > > > Key: SPARK-25422 > URL: https://issues.apache.org/jira/browse/SPARK-25422 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > stacktrace > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 7, localhost, executor 1): java.io.IOException: > org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of > broadcast_0: 1651574976 != 1165629262 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: corrupt remote block > broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313) > ... 13 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27441) Add read/write tests to Hive serde tables(include Parquet vectorized reader)
[ https://issues.apache.org/jira/browse/SPARK-27441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27441: Summary: Add read/write tests to Hive serde tables(include Parquet vectorized reader) (was: Add read/write tests to Hive serde tables) > Add read/write tests to Hive serde tables(include Parquet vectorized reader) > > > Key: SPARK-27441 > URL: https://issues.apache.org/jira/browse/SPARK-27441 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > The versions between Hive, Parquet and ORC after the built-in Hive upgrade to > 2.3.4: > built-in Hive is 1.2.1: > || ||ORC||Parquet|| > |Spark datasource table|1.5.5|1.10.1| > |Spark hive table|Hive built-in|1.6.0| > |Hive 1.2.1|Hive built-in|1.6.0| > built-in Hive is 2.3.4: > || ||ORC||Parquet|| > |Spark datasource table|1.5.5|1.10.1| > |Spark hive table|1.5.5|1.8.1| > |Hive 2.3.4|1.3.3|1.8.1| > We should add a test for Hive Serde table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27441) Add read/write tests to Hive serde tables
[ https://issues.apache.org/jira/browse/SPARK-27441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27441: Issue Type: Sub-task (was: Improvement) Parent: SPARK-27500 > Add read/write tests to Hive serde tables > - > > Key: SPARK-27441 > URL: https://issues.apache.org/jira/browse/SPARK-27441 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > The versions between Hive, Parquet and ORC after the built-in Hive upgrade to > 2.3.4: > built-in Hive is 1.2.1: > || ||ORC||Parquet|| > |Spark datasource table|1.5.5|1.10.1| > |Spark hive table|Hive built-in|1.6.0| > |Hive 1.2.1|Hive built-in|1.6.0| > built-in Hive is 2.3.4: > || ||ORC||Parquet|| > |Spark datasource table|1.5.5|1.10.1| > |Spark hive table|1.5.5|1.8.1| > |Hive 2.3.4|1.3.3|1.8.1| > We should add a test for Hive Serde table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27501) Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress present stream
Yuming Wang created SPARK-27501: --- Summary: Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress present stream Key: SPARK-27501 URL: https://issues.apache.org/jira/browse/SPARK-27501 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27500) Add tests for the built-in Hive 2.3
Yuming Wang created SPARK-27500: --- Summary: Add tests for the built-in Hive 2.3 Key: SPARK-27500 URL: https://issues.apache.org/jira/browse/SPARK-27500 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Our Spark will use some of the new features and bug fixes of Hive 2.3, and we should add tests for these. This is an umbrella JIRA for tracking this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27499) Support mapping spark.local.dir to hostPath volume
Junjie Chen created SPARK-27499: --- Summary: Support mapping spark.local.dir to hostPath volume Key: SPARK-27499 URL: https://issues.apache.org/jira/browse/SPARK-27499 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.1 Reporter: Junjie Chen Currently, the k8s executor builder mount spark.local.dir as emptyDir or memory, it should satisfy some small workload, while in some heavily workload like TPCDS, both of them can have some problem, such as pods are evicted due to disk pressure when using emptyDir, and OOM when using tmpfs. In particular on cloud environment, users may allocate cluster with minimum configuration and add cloud storage when running workload. In this case, we can specify multiple elastic storage as spark.local.dir to accelerate the spilling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17668) Support representing structs with case classes and tuples in spark sql udf inputs
[ https://issues.apache.org/jira/browse/SPARK-17668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820637#comment-16820637 ] william hesch commented on SPARK-17668: --- +1 > Support representing structs with case classes and tuples in spark sql udf > inputs > - > > Key: SPARK-17668 > URL: https://issues.apache.org/jira/browse/SPARK-17668 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: koert kuipers >Priority: Minor > > after having gotten used to have case classes represent complex structures in > Datasets, i am surprised to find out that when i work in DataFrames with udfs > no such magic exists, and i have to fall back to manipulating Row objects, > which is error prone and somewhat ugly. > for example: > {noformat} > case class Person(name: String, age: Int) > val df = Seq((Person("john", 33), 5), (Person("mike", 30), 6)).toDF("person", > "id") > val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age + > 1) }).apply(col("person"))) > df1.printSchema > df1.show > {noformat} > leads to: > {noformat} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast > to Person > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27491) SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-27491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820619#comment-16820619 ] Stavros Kontopoulos commented on SPARK-27491: - Can you reach the rest api outside spark submit? > SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty > response! therefore Airflow won't integrate with Spark 2.3.x > -- > > Key: SPARK-27491 > URL: https://issues.apache.org/jira/browse/SPARK-27491 > Project: Spark > Issue Type: Bug > Components: Java API, Scheduler, Spark Core, Spark Shell, Spark > Submit >Affects Versions: 2.3.3 >Reporter: t oo >Priority: Blocker > > This issue must have been introduced after Spark 2.1.1 as it is working in > that version. This issue is affecting me in Spark 2.3.3/2.3.0. I am using > spark standalone mode if that makes a difference. > See below spark 2.3.3 returns empty response while 2.1.1 returns a response. > > Spark 2.1.1: > [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home1/bin/spark-class > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + export SPARK_HOME=/home/ec2here/spark_home1 > + SPARK_HOME=/home/ec2here/spark_home1 > + '[' -z /home/ec2here/spark_home1 ']' > + . /home/ec2here/spark_home1/bin/load-spark-env.sh > ++ '[' -z /home/ec2here/spark_home1 ']' > ++ '[' -z '' ']' > ++ export SPARK_ENV_LOADED=1 > ++ SPARK_ENV_LOADED=1 > ++ parent_dir=/home/ec2here/spark_home1 > ++ user_conf_dir=/home/ec2here/spark_home1/conf > ++ '[' -f /home/ec2here/spark_home1/conf/spark-env.sh ']' > ++ set -a > ++ . /home/ec2here/spark_home1/conf/spark-env.sh > +++ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 > +++ JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 > ulimit -n 1048576 > ++ set +a > ++ '[' -z '' ']' > ++ ASSEMBLY_DIR2=/home/ec2here/spark_home1/assembly/target/scala-2.11 > ++ ASSEMBLY_DIR1=/home/ec2here/spark_home1/assembly/target/scala-2.10 > ++ [[ -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ]] > ++ '[' -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ']' > ++ export SPARK_SCALA_VERSION=2.10 > ++ SPARK_SCALA_VERSION=2.10 > + '[' -n /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ']' > + RUNNER=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java > + '[' -d /home/ec2here/spark_home1/jars ']' > + SPARK_JARS_DIR=/home/ec2here/spark_home1/jars > + '[' '!' -d /home/ec2here/spark_home1/jars ']' > + LAUNCH_CLASSPATH='/home/ec2here/spark_home1/jars/*' > + '[' -n '' ']' > + [[ -n '' ]] > + CMD=() > + IFS= > + read -d '' -r ARG > ++ build_command org.apache.spark.deploy.SparkSubmit --master > spark://domainhere:6066 --status driver-20190417130324-0009 > ++ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xmx128m -cp > '/home/ec2here/spark_home1/jars/*' org.apache.spark.launcher.Main > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > ++ printf '%d\0' 0 > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + COUNT=10 > + LAST=9 > + LAUNCHER_EXIT_CODE=0 > + [[ 0 =~ ^[0-9]+$ ]] > + '[' 0 '!=' 0 ']' > + CMD=("${CMD[@]:0:$LAST}") > + exec /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -cp > '/home/ec2here/spark_home1/conf/:/home/ec2here/spark_home1/jars/*' -Xmx2048m > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 19/04/17 14:03:27 INFO RestSubmissionClient: Submitting a request for the > status of submission driver-20190417130324-0009 in spark://domainhere:6066. > 19/04/17 14:03:28 INFO RestSubmissionClient: Server responded with > SubmissionStatusResponse: > { > "action" : "SubmissionStatusResponse", > "driverState" : "FAILED", > "serverSparkVersion" : "2.3.3", > "submissionId" : "driver-20190417130324-0009", > "success" : true, > "workerHostPort" : "x.y.211.40:11819", > "workerId" : "worker-20190417115840-x.y.211.40-11819" > } > [ec2here@ip-x-y-160-225 ~]$ > > Spark 2.3.3: > [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home/bin/spark-class > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + '[' -z '' ']' > ++ dirname
[jira] [Created] (SPARK-27498) Built-in parquet code path does not respect hive.enforce.bucketing
Bruce Robbins created SPARK-27498: - Summary: Built-in parquet code path does not respect hive.enforce.bucketing Key: SPARK-27498 URL: https://issues.apache.org/jira/browse/SPARK-27498 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: Bruce Robbins _Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._ Spark makes an effort to avoid corrupting Hive-bucketed tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false. However, this behavior falls down when Spark uses the built-in Parquet code path to write to the table. Here's an example. In Hive, do this (I create a table where things work as expected, and one where things don't work as expected): {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebuckettext1; hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile; hive> insert into hivebuckettext1 select * from sourcetable; hive> drop table hivebucketparq1; hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucketparq1 select * from sourcetable; {noformat} For the text table, things seem to work as expected: {noformat} scala> sql("insert into hivebuckettext1 select 1, 2, 3") 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.; {noformat} For the parquet table, the insert just happens: {noformat} scala> sql("insert into hivebucketparq1 select 1, 2, 3") res1: org.apache.spark.sql.DataFrame = [] scala> {noformat} Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497). If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected. Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats
[ https://issues.apache.org/jira/browse/SPARK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-27497: -- Description: The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that has the following characteristics: - table is created by Hive (or even Spark, if you use HQL DDL) - table is stored in Parquet format - table has at least one Hive-created data file already For example, do the following in Hive: {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebucket1; hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucket1 select * from sourcetable; hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) CLUSTERED BY ( a, b) SORTED BY ( a ASC, b ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='1', 'numRows'='1', 'rawDataSize'='3', 'totalSize'='352', 'transient_lastDdlTime'='142971') Time taken: 0.056 seconds, Fetched: 26 row(s) hive> {noformat} Then in spark-shell, do the following: {noformat} scala> sql("insert into hivebucket1 select 1, 3, 7") 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res0: org.apache.spark.sql.DataFrame = [] {noformat} Note: At this point, I would have expected Spark to throw an {{AnalysisException}} with the message "Output Hive table `default`.`hivebucket1` is bucketed...". However, I am ignoring that for now and may open a separate Jira (SPARK-27498). Return to some Hive CLI and note that the bucket specification is gone from the table definition: {noformat} hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'SORTBUCKETCOLSPREFIX'='TRUE', 'numFiles'='2', 'numRows'='-1', 'rawDataSize'='-1', 'totalSize'='1144', 'transient_lastDdlTime'='123374') Time taken: 1.619 seconds, Fetched: 20 row(s) hive> {noformat} This information is lost when Spark attempts to update table stats. This is because HiveClientImpl.toHiveTable drops the bucket specification. was: The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that has the following characteristics: - table is created by Hive (or even Spark, if you use HQL DDL) - table is stored in Parquet format - table has at least one Hive-created data file already For example, do the following in Hive: {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebucket1; hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucket1 select * from sourcetable; hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) CLUSTERED BY ( a, b) SORTED BY ( a ASC, b ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='1', 'numRows'='1', 'rawDataSize'='3', 'totalSize'='352', 'transient_lastDdlTime'='142971') Time taken: 0.056 seconds, Fetched: 26 row(s) hive> {noformat} Then in spark-shell, do the following: {noformat} scala> sql("insert into hivebucket1 select 1, 3, 7") 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res0: org.apache.spark.sql.DataFrame = [] {noformat} Note: At this point, I would have expected Spark to throw an {{AnalysisException}} with the message "Output Hive table `default`.`hivebucket1` is bucketed...". However, I am ignoring that for now and may open a separate Jira. Return to some Hive CLI and note that the bucket specification is gone from the table definition: {noformat} hive> show create table hivebucket1; OK CREATE TABLE
[jira] [Updated] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats
[ https://issues.apache.org/jira/browse/SPARK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-27497: -- Description: The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that has the following characteristics: - table is created by Hive (or even Spark, if you use HQL DDL) - table is stored in Parquet format - table has at least one Hive-created data file already For example, do the following in Hive: {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebucket1; hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucket1 select * from sourcetable; hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) CLUSTERED BY ( a, b) SORTED BY ( a ASC, b ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='1', 'numRows'='1', 'rawDataSize'='3', 'totalSize'='352', 'transient_lastDdlTime'='142971') Time taken: 0.056 seconds, Fetched: 26 row(s) hive> {noformat} Then in spark-shell, do the following: {noformat} scala> sql("insert into hivebucket1 select 1, 3, 7") 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res0: org.apache.spark.sql.DataFrame = [] {noformat} Note: At this point, I would have expected Spark to throw an {{AnalysisException}} with the message "Output Hive table `default`.`hivebucket1` is bucketed...". However, I am ignoring that for now and may open a separate Jira. Return to some Hive CLI and note that the bucket specification is gone from the table definition: {noformat} hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'SORTBUCKETCOLSPREFIX'='TRUE', 'numFiles'='2', 'numRows'='-1', 'rawDataSize'='-1', 'totalSize'='1144', 'transient_lastDdlTime'='123374') Time taken: 1.619 seconds, Fetched: 20 row(s) hive> {noformat} This information is lost when Spark attempts to update table stats. This is because HiveClientImpl.toHiveTable drops the bucket specification. was: The bucket spec gets wiped out after Spark writes to Hive-bucketed table that has the following characteristics: - table is created by Hive (or even Spark, if you use HQL DDL) - table is stored in Parquet format - table has at least one Hive-created data file already For example, do the following in Hive: {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebucket1; hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucket1 select * from sourcetable; hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) CLUSTERED BY ( a, b) SORTED BY ( a ASC, b ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='1', 'numRows'='1', 'rawDataSize'='3', 'totalSize'='352', 'transient_lastDdlTime'='142971') Time taken: 0.056 seconds, Fetched: 26 row(s) hive> {noformat} Then in spark-shell, do the following: {noformat} scala> sql("insert into hivebucket1 select 1, 3, 7") 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res0: org.apache.spark.sql.DataFrame = [] {noformat} Note: At this point, I would have expected Spark to throw an {{AnalysisException}} with the message "Output Hive table `default`.`hivebucket1` is bucketed...". However, I am ignoring that for now and may open a separate Jira. Return to some Hive CLI and note that the bucket specification is gone from the table definition: {noformat} hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int,
[jira] [Created] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats
Bruce Robbins created SPARK-27497: - Summary: Spark wipes out bucket spec in metastore when updating table stats Key: SPARK-27497 URL: https://issues.apache.org/jira/browse/SPARK-27497 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: Bruce Robbins The bucket spec gets wiped out after Spark writes to Hive-bucketed table that has the following characteristics: - table is created by Hive (or even Spark, if you use HQL DDL) - table is stored in Parquet format - table has at least one Hive-created data file already For example, do the following in Hive: {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebucket1; hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucket1 select * from sourcetable; hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) CLUSTERED BY ( a, b) SORTED BY ( a ASC, b ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='1', 'numRows'='1', 'rawDataSize'='3', 'totalSize'='352', 'transient_lastDdlTime'='142971') Time taken: 0.056 seconds, Fetched: 26 row(s) hive> {noformat} Then in spark-shell, do the following: {noformat} scala> sql("insert into hivebucket1 select 1, 3, 7") 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res0: org.apache.spark.sql.DataFrame = [] {noformat} Note: At this point, I would have expected Spark to throw an {{AnalysisException}} with the message "Output Hive table `default`.`hivebucket1` is bucketed...". However, I am ignoring that for now and may open a separate Jira. Return to some Hive CLI and note that the bucket specification is gone from the table definition: {noformat} hive> show create table hivebucket1; OK CREATE TABLE `hivebucket1`( `a` int, `b` int, `c` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'SORTBUCKETCOLSPREFIX'='TRUE', 'numFiles'='2', 'numRows'='-1', 'rawDataSize'='-1', 'totalSize'='1144', 'transient_lastDdlTime'='123374') Time taken: 1.619 seconds, Fetched: 20 row(s) hive> {noformat} This information is lost when Spark attempts to update table stats. This is because HiveClientImpl.toHiveTable drops the bucket specification. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features
[ https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820549#comment-16820549 ] Calvin Park edited comment on SPARK-27365 at 4/17/19 10:34 PM: --- Simple Jenkinsfile {code:java} pipeline { agent { dockerfile { label 'docker-gpu' args '--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all' } } stages { stage('smi') { steps { sh 'nvidia-smi' } } } }{code} with Dockerfile {code:java} FROM ubuntu:xenial RUN apt-get update && \ DEBIAN_FRONTEND=noninteractive apt-get install -y \ curl \ flake8 \ git-core \ openjdk-8-jdk \ python2.7 \ python-pip \ wget RUN pip install \ requests \ numpy # Build script looks for javac in jre dir ENV JAVA_HOME "/usr/lib/jvm/java-8-openjdk-amd64" # http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage # We have a pretty beefy server ENV MAVEN_OPTS "-Xmx20g -XX:ReservedCodeCacheSize=2g" {code} was (Author: calvinatnvidia): Simple Jenkinsfile {code:java} pipeline { agent { dockerfile { label 'docker-gpu' args '--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all' } } stages { stage('smi') { steps { sh 'nvidia-smi' } } } }{code} with Dockerfile {code:java} FROM ubuntu:xenial RUN apt-get update && \ DEBIAN_FRONTEND=noninteractive apt-get install -y \ curl \ flake8 \ git-core \ openjdk-8-jdk \ python2.7 \ python-pip \ wget RUN DEBIAN_FRONTEND=noninteractive pip install \ requests \ numpy # Build script looks for javac in jre dir ENV JAVA_HOME "/usr/lib/jvm/java-8-openjdk-amd64" # http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage # We have a pretty beefy server ENV MAVEN_OPTS "-Xmx20g -XX:ReservedCodeCacheSize=2g" {code} > Spark Jenkins supports testing GPU-aware scheduling features > > > Key: SPARK-27365 > URL: https://issues.apache.org/jira/browse/SPARK-27365 > Project: Spark > Issue Type: Story > Components: jenkins >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Upgrade Spark Jenkins to install GPU cards and run GPU integration tests > triggered by "GPU" in PRs. > cc: [~afeng] [~shaneknapp] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features
[ https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820549#comment-16820549 ] Calvin Park edited comment on SPARK-27365 at 4/17/19 10:34 PM: --- Simple Jenkinsfile {code:java} pipeline { agent { dockerfile { label 'docker-gpu' args '--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all' } } stages { stage('smi') { steps { sh 'nvidia-smi' } } } }{code} with Dockerfile {code:java} FROM ubuntu:xenial RUN apt-get update && \ DEBIAN_FRONTEND=noninteractive apt-get install -y \ curl \ flake8 \ git-core \ openjdk-8-jdk \ python2.7 \ python-pip \ wget RUN DEBIAN_FRONTEND=noninteractive pip install \ requests \ numpy # Build script looks for javac in jre dir ENV JAVA_HOME "/usr/lib/jvm/java-8-openjdk-amd64" # http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage # We have a pretty beefy server ENV MAVEN_OPTS "-Xmx20g -XX:ReservedCodeCacheSize=2g" {code} was (Author: calvinatnvidia): Simple Jenkinsfile {code:java} pipeline { agent { dockerfile { label 'docker-gpu' args '--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all' } } stages { stage('smi') { steps { sh 'nvidia-smi' } } } }{code} with Dockerfile FROM ubuntu:xenialRUN apt-get update && \DEBIAN_FRONTEND=noninteractive apt-get install -y \curl \flake8 \git-core \openjdk-8-jdk \ python2.7 \python-pip \wgetRUN DEBIAN_FRONTEND=noninteractive pip install \requests \numpy# Build script looks for javac in jre dirENV JAVA_HOME "/usr/lib/jvm/java-8-openjdk-amd64"# http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage# We have a pretty beefy serverENV MAVEN_OPTS "-Xmx20g -XX:ReservedCodeCacheSize=2g" > Spark Jenkins supports testing GPU-aware scheduling features > > > Key: SPARK-27365 > URL: https://issues.apache.org/jira/browse/SPARK-27365 > Project: Spark > Issue Type: Story > Components: jenkins >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Upgrade Spark Jenkins to install GPU cards and run GPU integration tests > triggered by "GPU" in PRs. > cc: [~afeng] [~shaneknapp] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features
[ https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820549#comment-16820549 ] Calvin Park commented on SPARK-27365: - Simple Jenkinsfile {code:java} pipeline { agent { dockerfile { label 'docker-gpu' args '--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all' } } stages { stage('smi') { steps { sh 'nvidia-smi' } } } }{code} with Dockerfile FROM ubuntu:xenialRUN apt-get update && \DEBIAN_FRONTEND=noninteractive apt-get install -y \curl \flake8 \git-core \openjdk-8-jdk \ python2.7 \python-pip \wgetRUN DEBIAN_FRONTEND=noninteractive pip install \requests \numpy# Build script looks for javac in jre dirENV JAVA_HOME "/usr/lib/jvm/java-8-openjdk-amd64"# http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage# We have a pretty beefy serverENV MAVEN_OPTS "-Xmx20g -XX:ReservedCodeCacheSize=2g" > Spark Jenkins supports testing GPU-aware scheduling features > > > Key: SPARK-27365 > URL: https://issues.apache.org/jira/browse/SPARK-27365 > Project: Spark > Issue Type: Story > Components: jenkins >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Upgrade Spark Jenkins to install GPU cards and run GPU integration tests > triggered by "GPU" in PRs. > cc: [~afeng] [~shaneknapp] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27491) SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-27491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820542#comment-16820542 ] t oo commented on SPARK-27491: -- cc: [~skonto] [~mpmolek] [~gschiavon] [~scrapco...@gmail.com] > SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty > response! therefore Airflow won't integrate with Spark 2.3.x > -- > > Key: SPARK-27491 > URL: https://issues.apache.org/jira/browse/SPARK-27491 > Project: Spark > Issue Type: Bug > Components: Java API, Scheduler, Spark Core, Spark Shell, Spark > Submit >Affects Versions: 2.3.3 >Reporter: t oo >Priority: Blocker > > This issue must have been introduced after Spark 2.1.1 as it is working in > that version. This issue is affecting me in Spark 2.3.3/2.3.0. I am using > spark standalone mode if that makes a difference. > See below spark 2.3.3 returns empty response while 2.1.1 returns a response. > > Spark 2.1.1: > [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home1/bin/spark-class > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + export SPARK_HOME=/home/ec2here/spark_home1 > + SPARK_HOME=/home/ec2here/spark_home1 > + '[' -z /home/ec2here/spark_home1 ']' > + . /home/ec2here/spark_home1/bin/load-spark-env.sh > ++ '[' -z /home/ec2here/spark_home1 ']' > ++ '[' -z '' ']' > ++ export SPARK_ENV_LOADED=1 > ++ SPARK_ENV_LOADED=1 > ++ parent_dir=/home/ec2here/spark_home1 > ++ user_conf_dir=/home/ec2here/spark_home1/conf > ++ '[' -f /home/ec2here/spark_home1/conf/spark-env.sh ']' > ++ set -a > ++ . /home/ec2here/spark_home1/conf/spark-env.sh > +++ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 > +++ JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 > ulimit -n 1048576 > ++ set +a > ++ '[' -z '' ']' > ++ ASSEMBLY_DIR2=/home/ec2here/spark_home1/assembly/target/scala-2.11 > ++ ASSEMBLY_DIR1=/home/ec2here/spark_home1/assembly/target/scala-2.10 > ++ [[ -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ]] > ++ '[' -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ']' > ++ export SPARK_SCALA_VERSION=2.10 > ++ SPARK_SCALA_VERSION=2.10 > + '[' -n /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ']' > + RUNNER=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java > + '[' -d /home/ec2here/spark_home1/jars ']' > + SPARK_JARS_DIR=/home/ec2here/spark_home1/jars > + '[' '!' -d /home/ec2here/spark_home1/jars ']' > + LAUNCH_CLASSPATH='/home/ec2here/spark_home1/jars/*' > + '[' -n '' ']' > + [[ -n '' ]] > + CMD=() > + IFS= > + read -d '' -r ARG > ++ build_command org.apache.spark.deploy.SparkSubmit --master > spark://domainhere:6066 --status driver-20190417130324-0009 > ++ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xmx128m -cp > '/home/ec2here/spark_home1/jars/*' org.apache.spark.launcher.Main > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > ++ printf '%d\0' 0 > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + COUNT=10 > + LAST=9 > + LAUNCHER_EXIT_CODE=0 > + [[ 0 =~ ^[0-9]+$ ]] > + '[' 0 '!=' 0 ']' > + CMD=("${CMD[@]:0:$LAST}") > + exec /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -cp > '/home/ec2here/spark_home1/conf/:/home/ec2here/spark_home1/jars/*' -Xmx2048m > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 19/04/17 14:03:27 INFO RestSubmissionClient: Submitting a request for the > status of submission driver-20190417130324-0009 in spark://domainhere:6066. > 19/04/17 14:03:28 INFO RestSubmissionClient: Server responded with > SubmissionStatusResponse: > { > "action" : "SubmissionStatusResponse", > "driverState" : "FAILED", > "serverSparkVersion" : "2.3.3", > "submissionId" : "driver-20190417130324-0009", > "success" : true, > "workerHostPort" : "x.y.211.40:11819", > "workerId" : "worker-20190417115840-x.y.211.40-11819" > } > [ec2here@ip-x-y-160-225 ~]$ > > Spark 2.3.3: > [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home/bin/spark-class > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + '[' -z '' ']' > ++ dirname
[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct
[ https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820525#comment-16820525 ] shahid commented on SPARK-27468: [~zsxwing] Thanks > "Storage Level" in "RDD Storage Page" is not correct > > > Key: SPARK-27468 > URL: https://issues.apache.org/jira/browse/SPARK-27468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Shixiong Zhu >Priority: Major > Attachments: Screenshot from 2019-04-17 10-42-55.png > > > I ran the following unit test and checked the UI. > {code} > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[2,1,1024]") > .set("spark.ui.enabled", "true") > sc = new SparkContext(conf) > val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd.count() > Thread.sleep(360) > {code} > The storage level is "Memory Deserialized 1x Replicated" in the RDD storage > page. > I tried to debug and found this is because Spark emitted the following two > events: > {code} > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 > replicas),56,0)) > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, > 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 > replicas),56,0)) > {code} > The storage level in the second event will overwrite the first one. "1 > replicas" comes from this line: > https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 > Maybe AppStatusListener should calculate the replicas from events? > Another fact we may need to think about is when replicas is 2, will two Spark > events arrive in the same order? Currently, two RPCs from different executors > can arrive in any order. > Credit goes to [~srfnmnk] who reported this issue originally. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27496) RPC should send back the fatal errors
Shixiong Zhu created SPARK-27496: Summary: RPC should send back the fatal errors Key: SPARK-27496 URL: https://issues.apache.org/jira/browse/SPARK-27496 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.1 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Right now, when a fatal error throws from "receiveAndReply", the sender will not be notified. We should try our best to send it back. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27434) memory leak in spark driver
[ https://issues.apache.org/jira/browse/SPARK-27434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820516#comment-16820516 ] Ryne Yang commented on SPARK-27434: --- [~shahid] were you able to reproduce this by the steps I provided? > memory leak in spark driver > --- > > Key: SPARK-27434 > URL: https://issues.apache.org/jira/browse/SPARK-27434 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: OS: Centos 7 > JVM: > **_openjdk version "1.8.0_201"_ > _OpenJDK Runtime Environment (IcedTea 3.11.0) (Alpine 8.201.08-r0)_ > _OpenJDK 64-Bit Server VM (build 25.201-b08, mixed mode)_ > Spark version: 2.4.0 >Reporter: Ryne Yang >Priority: Major > Attachments: Screen Shot 2019-04-10 at 12.11.35 PM.png > > > we got a OOM exception on the driver after driver has completed multiple > jobs(we are reusing spark context). > so we took a heap dump and looked at the leak analysis, found out that under > AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. > > can someone take a look at? > here is the heap analysis: > !Screen Shot 2019-04-10 at 12.11.35 PM.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27473) Support filter push down for status fields in binary file data source
[ https://issues.apache.org/jira/browse/SPARK-27473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820500#comment-16820500 ] Xiangrui Meng commented on SPARK-27473: --- Given SPARK-25558 is WIP, we might want to flatten the status column to support filter push down. > Support filter push down for status fields in binary file data source > - > > Key: SPARK-27473 > URL: https://issues.apache.org/jira/browse/SPARK-27473 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > As a user, I can use > `spark.read.format("binaryFile").load(path).filter($"status.lenght" < > 1L)` to load files that are less than 1e8 bytes. Spark shouldn't even > read files that are bigger than 1e8 bytes in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27473) Support filter push down for status fields in binary file data source
[ https://issues.apache.org/jira/browse/SPARK-27473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27473: - Assignee: Weichen Xu > Support filter push down for status fields in binary file data source > - > > Key: SPARK-27473 > URL: https://issues.apache.org/jira/browse/SPARK-27473 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > As a user, I can use > `spark.read.format("binaryFile").load(path).filter($"status.lenght" < > 1L)` to load files that are less than 1e8 bytes. Spark shouldn't even > read files that are bigger than 1e8 bytes in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27495) Support Stage level resource scheduling
Thomas Graves created SPARK-27495: - Summary: Support Stage level resource scheduling Key: SPARK-27495 URL: https://issues.apache.org/jira/browse/SPARK-27495 Project: Spark Issue Type: Story Components: Spark Core Affects Versions: 3.0.0 Reporter: Thomas Graves Currently Spark supports CPU level scheduling and we are adding in accelerator aware scheduling with https://issues.apache.org/jira/browse/SPARK-24615, but both of those are scheduling via application level configurations. Meaning there is one configuration that is set for the entire lifetime of the application and the user can't change it between Spark jobs/stages within that application. Many times users have different requirements for different stages of their application so they want to be able to configure at the stage level what resources are required for that stage. For example, I might start a spark application which first does some ETL work that needs lots of cores to run many tasks in parallel, then once that is done I want to run some ML job and at that point I want GPU's, less CPU's, and more memory. With this Jira we want to add the ability for users to specify the resources for different stages. Note that https://issues.apache.org/jira/browse/SPARK-24615 had some discussions on this but this part of it was removed from that. We should come up with a proposal on how to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27495) Support Stage level resource configuration and scheduling
[ https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-27495: -- Summary: Support Stage level resource configuration and scheduling (was: Support Stage level resource scheduling) > Support Stage level resource configuration and scheduling > - > > Key: SPARK-27495 > URL: https://issues.apache.org/jira/browse/SPARK-27495 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Currently Spark supports CPU level scheduling and we are adding in > accelerator aware scheduling with > https://issues.apache.org/jira/browse/SPARK-24615, but both of those are > scheduling via application level configurations. Meaning there is one > configuration that is set for the entire lifetime of the application and the > user can't change it between Spark jobs/stages within that application. > Many times users have different requirements for different stages of their > application so they want to be able to configure at the stage level what > resources are required for that stage. > For example, I might start a spark application which first does some ETL work > that needs lots of cores to run many tasks in parallel, then once that is > done I want to run some ML job and at that point I want GPU's, less CPU's, > and more memory. > With this Jira we want to add the ability for users to specify the resources > for different stages. > Note that https://issues.apache.org/jira/browse/SPARK-24615 had some > discussions on this but this part of it was removed from that. > We should come up with a proposal on how to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct
[ https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820460#comment-16820460 ] Shixiong Zhu commented on SPARK-27468: -- [~shahid] You need to use "--master local-cluster[2,1,1024]". The local mode has only one BlockManager. > "Storage Level" in "RDD Storage Page" is not correct > > > Key: SPARK-27468 > URL: https://issues.apache.org/jira/browse/SPARK-27468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Shixiong Zhu >Priority: Major > Attachments: Screenshot from 2019-04-17 10-42-55.png > > > I ran the following unit test and checked the UI. > {code} > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[2,1,1024]") > .set("spark.ui.enabled", "true") > sc = new SparkContext(conf) > val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd.count() > Thread.sleep(360) > {code} > The storage level is "Memory Deserialized 1x Replicated" in the RDD storage > page. > I tried to debug and found this is because Spark emitted the following two > events: > {code} > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 > replicas),56,0)) > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, > 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 > replicas),56,0)) > {code} > The storage level in the second event will overwrite the first one. "1 > replicas" comes from this line: > https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 > Maybe AppStatusListener should calculate the replicas from events? > Another fact we may need to think about is when replicas is 2, will two Spark > events arrive in the same order? Currently, two RPCs from different executors > can arrive in any order. > Credit goes to [~srfnmnk] who reported this issue originally. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27494) Null values don't work in Kafka source v2
Shixiong Zhu created SPARK-27494: Summary: Null values don't work in Kafka source v2 Key: SPARK-27494 URL: https://issues.apache.org/jira/browse/SPARK-27494 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.1 Reporter: Shixiong Zhu Right now Kafka source v2 doesn't support null values. The issue is in org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow which doesn't handle null values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27493) Upgrade ASM to 7.1
[ https://issues.apache.org/jira/browse/SPARK-27493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27493: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-24417 > Upgrade ASM to 7.1 > -- > > Key: SPARK-27493 > URL: https://issues.apache.org/jira/browse/SPARK-27493 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > SPARK-25946 upgrades ASM to 7.0 to support JDK11. This PR aims to update ASM > to 7.1 to bring the bug fixes. > - https://asm.ow2.io/versions.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27493) Upgrade ASM to 7.1
Dongjoon Hyun created SPARK-27493: - Summary: Upgrade ASM to 7.1 Key: SPARK-27493 URL: https://issues.apache.org/jira/browse/SPARK-27493 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun SPARK-25946 upgrades ASM to 7.0 to support JDK11. This PR aims to update ASM to 7.1 to bring the bug fixes. - https://asm.ow2.io/versions.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27276) Increase the minimum pyarrow version to 0.12.1
[ https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820370#comment-16820370 ] shane knapp commented on SPARK-27276: - test currently running: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104671/ > Increase the minimum pyarrow version to 0.12.1 > -- > > Key: SPARK-27276 > URL: https://issues.apache.org/jira/browse/SPARK-27276 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > The current minimum version is 0.8.0, which is pretty ancient since Arrow has > been moving fast and a lot has changed since this version. There are > currently many workarounds checking for different versions or disabling > specific functionality, and the code is getting ugly and difficult to > maintain. Increasing the version will allow cleanup and upgrade the testing > environment. > This involves changing the pyarrow version in setup.py (currently at 0.8.0), > updating Jenkins to test against the new version, code cleanup to remove > workarounds from older versions. Newer versions of pyarrow have dropped > support for Python 3.4, so it might be necessary to update to Python 3.5+ in > Jenkins as well. Users would then need to ensure at least this version of > pyarrow is installed on the cluster. > There is also a 0.12.1 release, so I will need to check what bugs that fixed > to see if that will be a better version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6
[ https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820364#comment-16820364 ] shane knapp commented on SPARK-25079: - from my email to dev@: ok. after much wailing and gnashing of teeth (and conversations w/[~bryanc]), i think we're coming to a general idea of how python testing will soon work! i propose the following: py27: master, 2.3, 2.4 py36 + pandas 0.19.2 + pyarrow 0.8.0: 2.3, 2.4 py36 + pandas 0.24.2 + pyarrow 0.12.1: master all of the above combinations have been tested (locally) and pass. i will need to create/deploy the new 2.3/4 branch python envs and then test my two PRs against them. the good: 1) this, IMO, will get us to a place where we can get all spark python tests using py36 as quickly as possible w/o needing to backport and spend a ton of time fixing 2.3/4 tests. 2) there is literally *one* hardcoded path (in dev/run-tests.py) that needs to be updated on 2.3/4 to point to a different python env than 'py3k'. the bad: 1) three python envs to deal with (with the env supporting 2.3 and 2.4 remaining relatively static). since the 'good' definitely outweighs the 'bad', my vote is for 'good'. ;) also: i am putting my foot down and we won't be testing against more than three python envs! > [PYTHON] upgrade python 3.4 -> 3.6 > -- > > Key: SPARK-25079 > URL: https://issues.apache.org/jira/browse/SPARK-25079 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > for the impending arrow upgrade > (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python > 3.4 -> 3.5. > i have been testing this here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69] > my methodology: > 1) upgrade python + arrow to 3.5 and 0.10.0 > 2) run python tests > 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and > upgrade centos workers to python3.5 > 4) simultaneously do the following: > - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that > points to python3.5 (this is currently being tested here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)] > - push a change to python/run-tests.py replacing 3.4 with 3.5 > 5) once the python3.5 change to run-tests.py is merged, we will need to > back-port this to all existing branches > 6) then and only then can i remove the python3.4 -> python3.5 symlink -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp closed SPARK-27389. --- > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: shane knapp >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25088) Rest Server default & doc updates
[ https://issues.apache.org/jira/browse/SPARK-25088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820321#comment-16820321 ] Imran Rashid commented on SPARK-25088: -- if you're allowing unauthed rest, what is the point of auth on standard submission? For most users, they'd just think they had a secure setup with auth on standard submission, and not realize they'd left a backdoor wide open. Its not worth that security risk > Rest Server default & doc updates > - > > Key: SPARK-25088 > URL: https://issues.apache.org/jira/browse/SPARK-25088 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > > The rest server could use some updates on defaults & docs, both in standalone > and mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820309#comment-16820309 ] Imran Rashid commented on SPARK-6235: - [~glal14] actually this was fixed in 2.4. There was one open issue, SPARK-24936, but I just closed that as its just improving an error msg which I think isn't really worth fixing just for spark 3.0, and so also resolved this umbrella. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-6235. - Resolution: Fixed Fix Version/s: 2.4.0 > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24936) Better error message when trying a shuffle fetch over 2 GB
[ https://issues.apache.org/jira/browse/SPARK-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-24936. -- Resolution: Won't Fix As we've already shipped 2.4, I think its unlikely we're going to fix this later. I don't think we need to worry that much about spark 3.0 talking to shuffle services < 2.2. If anybody is motivated, feel free to submit a pr here, but I think leaving this open is probably misleading about the status. > Better error message when trying a shuffle fetch over 2 GB > -- > > Key: SPARK-24936 > URL: https://issues.apache.org/jira/browse/SPARK-24936 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > After SPARK-24297, spark will try to fetch shuffle blocks to disk if their > over 2GB. However, this will fail with an external shuffle service running < > spark 2.2, with an unhelpful error message like: > {noformat} > 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 > (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1 > , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message= > org.apache.spark.shuffle.FetchFailedException: > java.lang.UnsupportedOperationException > at > org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > ... > {noformat} > We can't do anything to make the shuffle succeed, in this situation, but we > should fail with a better error message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27492) High level user documentation
Thomas Graves created SPARK-27492: - Summary: High level user documentation Key: SPARK-27492 URL: https://issues.apache.org/jira/browse/SPARK-27492 Project: Spark Issue Type: Story Components: Documentation Affects Versions: 3.0.0 Reporter: Thomas Graves Add some high level user documentation about how this feature works together and point to things like the example discovery script, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27403) Fix `updateTableStats` to update table stats always with new stats or None
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27403: -- Summary: Fix `updateTableStats` to update table stats always with new stats or None (was: Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue) > Fix `updateTableStats` to update table stats always with new stats or None > -- > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1 >Reporter: Sujith Chacko >Assignee: Sujith Chacko >Priority: Major > Fix For: 2.4.2, 3.0.0 > > > system shall update the table stats automatically if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is enabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet") > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > > +---+---++--- > |col_name|data_type|comment| > +---+---++--- > |name|string|null| > |age|int|null| > | | | | > | # Detailed Table Information| | | > |Database|default| | > |Table|table1| | > |Owner|Administrator| | > |Created Time|Sun Apr 07 23:41:56 IST 2019| | > |Last Access|Thu Jan 01 05:30:00 IST 1970| | > |Created By|Spark 2.4.1| | > |Type|MANAGED| | > |Provider|hive| | > |Table Properties|[transient_lastDdlTime=1554660716]| | > |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > | > |Storage Properties|[serialization.format=1]| | > |Partition Provider|Catalog| | > +---+---++--- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27403: -- Fix Version/s: 2.4.2 > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1 >Reporter: Sujith Chacko >Assignee: Sujith Chacko >Priority: Major > Fix For: 2.4.2, 3.0.0 > > > system shall update the table stats automatically if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is enabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet") > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > > +---+---++--- > |col_name|data_type|comment| > +---+---++--- > |name|string|null| > |age|int|null| > | | | | > | # Detailed Table Information| | | > |Database|default| | > |Table|table1| | > |Owner|Administrator| | > |Created Time|Sun Apr 07 23:41:56 IST 2019| | > |Last Access|Thu Jan 01 05:30:00 IST 1970| | > |Created By|Spark 2.4.1| | > |Type|MANAGED| | > |Provider|hive| | > |Table Properties|[transient_lastDdlTime=1554660716]| | > |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > | > |Storage Properties|[serialization.format=1]| | > |Partition Provider|Catalog| | > +---+---++--- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820229#comment-16820229 ] Gowtam Lal commented on SPARK-6235: --- It would be great to see this go out. Any updates? > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27491) SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-27491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] t oo updated SPARK-27491: - Summary: SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark 2.3.x (was: SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark) > SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty > response! therefore Airflow won't integrate with Spark 2.3.x > -- > > Key: SPARK-27491 > URL: https://issues.apache.org/jira/browse/SPARK-27491 > Project: Spark > Issue Type: Bug > Components: Java API, Scheduler, Spark Core, Spark Shell, Spark > Submit >Affects Versions: 2.3.3 >Reporter: t oo >Priority: Blocker > > This issue must have been introduced after Spark 2.1.1 as it is working in > that version. This issue is affecting me in Spark 2.3.3/2.3.0. I am using > spark standalone mode if that makes a difference. > See below spark 2.3.3 returns empty response while 2.1.1 returns a response. > > Spark 2.1.1: > [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home1/bin/spark-class > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + export SPARK_HOME=/home/ec2here/spark_home1 > + SPARK_HOME=/home/ec2here/spark_home1 > + '[' -z /home/ec2here/spark_home1 ']' > + . /home/ec2here/spark_home1/bin/load-spark-env.sh > ++ '[' -z /home/ec2here/spark_home1 ']' > ++ '[' -z '' ']' > ++ export SPARK_ENV_LOADED=1 > ++ SPARK_ENV_LOADED=1 > ++ parent_dir=/home/ec2here/spark_home1 > ++ user_conf_dir=/home/ec2here/spark_home1/conf > ++ '[' -f /home/ec2here/spark_home1/conf/spark-env.sh ']' > ++ set -a > ++ . /home/ec2here/spark_home1/conf/spark-env.sh > +++ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 > +++ JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 > ulimit -n 1048576 > ++ set +a > ++ '[' -z '' ']' > ++ ASSEMBLY_DIR2=/home/ec2here/spark_home1/assembly/target/scala-2.11 > ++ ASSEMBLY_DIR1=/home/ec2here/spark_home1/assembly/target/scala-2.10 > ++ [[ -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ]] > ++ '[' -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ']' > ++ export SPARK_SCALA_VERSION=2.10 > ++ SPARK_SCALA_VERSION=2.10 > + '[' -n /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ']' > + RUNNER=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java > + '[' -d /home/ec2here/spark_home1/jars ']' > + SPARK_JARS_DIR=/home/ec2here/spark_home1/jars > + '[' '!' -d /home/ec2here/spark_home1/jars ']' > + LAUNCH_CLASSPATH='/home/ec2here/spark_home1/jars/*' > + '[' -n '' ']' > + [[ -n '' ]] > + CMD=() > + IFS= > + read -d '' -r ARG > ++ build_command org.apache.spark.deploy.SparkSubmit --master > spark://domainhere:6066 --status driver-20190417130324-0009 > ++ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xmx128m -cp > '/home/ec2here/spark_home1/jars/*' org.apache.spark.launcher.Main > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > ++ printf '%d\0' 0 > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + COUNT=10 > + LAST=9 > + LAUNCHER_EXIT_CODE=0 > + [[ 0 =~ ^[0-9]+$ ]] > + '[' 0 '!=' 0 ']' > + CMD=("${CMD[@]:0:$LAST}") > + exec /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -cp > '/home/ec2here/spark_home1/conf/:/home/ec2here/spark_home1/jars/*' -Xmx2048m > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 19/04/17 14:03:27 INFO RestSubmissionClient: Submitting a request for the > status of submission driver-20190417130324-0009 in spark://domainhere:6066. > 19/04/17 14:03:28 INFO RestSubmissionClient: Server responded with > SubmissionStatusResponse: > { > "action" : "SubmissionStatusResponse", > "driverState" : "FAILED", > "serverSparkVersion" : "2.3.3", > "submissionId" : "driver-20190417130324-0009", > "success" : true, > "workerHostPort" : "x.y.211.40:11819", > "workerId" : "worker-20190417115840-x.y.211.40-11819" > } > [ec2here@ip-x-y-160-225 ~]$ > > Spark 2.3.3: > [ec2here@ip-x-y-160-225 ~]$ bash -x
[jira] [Commented] (SPARK-27491) SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-27491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820211#comment-16820211 ] t oo commented on SPARK-27491: -- cc: [~ash] [~bolke] > SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty > response! therefore Airflow won't integrate with Spark 2.3.x > -- > > Key: SPARK-27491 > URL: https://issues.apache.org/jira/browse/SPARK-27491 > Project: Spark > Issue Type: Bug > Components: Java API, Scheduler, Spark Core, Spark Shell, Spark > Submit >Affects Versions: 2.3.3 >Reporter: t oo >Priority: Blocker > > This issue must have been introduced after Spark 2.1.1 as it is working in > that version. This issue is affecting me in Spark 2.3.3/2.3.0. I am using > spark standalone mode if that makes a difference. > See below spark 2.3.3 returns empty response while 2.1.1 returns a response. > > Spark 2.1.1: > [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home1/bin/spark-class > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + export SPARK_HOME=/home/ec2here/spark_home1 > + SPARK_HOME=/home/ec2here/spark_home1 > + '[' -z /home/ec2here/spark_home1 ']' > + . /home/ec2here/spark_home1/bin/load-spark-env.sh > ++ '[' -z /home/ec2here/spark_home1 ']' > ++ '[' -z '' ']' > ++ export SPARK_ENV_LOADED=1 > ++ SPARK_ENV_LOADED=1 > ++ parent_dir=/home/ec2here/spark_home1 > ++ user_conf_dir=/home/ec2here/spark_home1/conf > ++ '[' -f /home/ec2here/spark_home1/conf/spark-env.sh ']' > ++ set -a > ++ . /home/ec2here/spark_home1/conf/spark-env.sh > +++ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 > +++ JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 > ulimit -n 1048576 > ++ set +a > ++ '[' -z '' ']' > ++ ASSEMBLY_DIR2=/home/ec2here/spark_home1/assembly/target/scala-2.11 > ++ ASSEMBLY_DIR1=/home/ec2here/spark_home1/assembly/target/scala-2.10 > ++ [[ -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ]] > ++ '[' -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ']' > ++ export SPARK_SCALA_VERSION=2.10 > ++ SPARK_SCALA_VERSION=2.10 > + '[' -n /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ']' > + RUNNER=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java > + '[' -d /home/ec2here/spark_home1/jars ']' > + SPARK_JARS_DIR=/home/ec2here/spark_home1/jars > + '[' '!' -d /home/ec2here/spark_home1/jars ']' > + LAUNCH_CLASSPATH='/home/ec2here/spark_home1/jars/*' > + '[' -n '' ']' > + [[ -n '' ]] > + CMD=() > + IFS= > + read -d '' -r ARG > ++ build_command org.apache.spark.deploy.SparkSubmit --master > spark://domainhere:6066 --status driver-20190417130324-0009 > ++ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xmx128m -cp > '/home/ec2here/spark_home1/jars/*' org.apache.spark.launcher.Main > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > ++ printf '%d\0' 0 > + CMD+=("$ARG") > + IFS= > + read -d '' -r ARG > + COUNT=10 > + LAST=9 > + LAUNCHER_EXIT_CODE=0 > + [[ 0 =~ ^[0-9]+$ ]] > + '[' 0 '!=' 0 ']' > + CMD=("${CMD[@]:0:$LAST}") > + exec /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -cp > '/home/ec2here/spark_home1/conf/:/home/ec2here/spark_home1/jars/*' -Xmx2048m > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 19/04/17 14:03:27 INFO RestSubmissionClient: Submitting a request for the > status of submission driver-20190417130324-0009 in spark://domainhere:6066. > 19/04/17 14:03:28 INFO RestSubmissionClient: Server responded with > SubmissionStatusResponse: > { > "action" : "SubmissionStatusResponse", > "driverState" : "FAILED", > "serverSparkVersion" : "2.3.3", > "submissionId" : "driver-20190417130324-0009", > "success" : true, > "workerHostPort" : "x.y.211.40:11819", > "workerId" : "worker-20190417115840-x.y.211.40-11819" > } > [ec2here@ip-x-y-160-225 ~]$ > > Spark 2.3.3: > [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home/bin/spark-class > org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status > driver-20190417130324-0009 > + '[' -z '' ']' > ++ dirname /home/ec2here/spark_home/bin/spark-class > + source
[jira] [Created] (SPARK-27491) SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark
t oo created SPARK-27491: Summary: SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark Key: SPARK-27491 URL: https://issues.apache.org/jira/browse/SPARK-27491 Project: Spark Issue Type: Bug Components: Java API, Scheduler, Spark Core, Spark Shell, Spark Submit Affects Versions: 2.3.3 Reporter: t oo This issue must have been introduced after Spark 2.1.1 as it is working in that version. This issue is affecting me in Spark 2.3.3/2.3.0. I am using spark standalone mode if that makes a difference. See below spark 2.3.3 returns empty response while 2.1.1 returns a response. Spark 2.1.1: [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home1/bin/spark-class org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status driver-20190417130324-0009 + export SPARK_HOME=/home/ec2here/spark_home1 + SPARK_HOME=/home/ec2here/spark_home1 + '[' -z /home/ec2here/spark_home1 ']' + . /home/ec2here/spark_home1/bin/load-spark-env.sh ++ '[' -z /home/ec2here/spark_home1 ']' ++ '[' -z '' ']' ++ export SPARK_ENV_LOADED=1 ++ SPARK_ENV_LOADED=1 ++ parent_dir=/home/ec2here/spark_home1 ++ user_conf_dir=/home/ec2here/spark_home1/conf ++ '[' -f /home/ec2here/spark_home1/conf/spark-env.sh ']' ++ set -a ++ . /home/ec2here/spark_home1/conf/spark-env.sh +++ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 +++ JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ulimit -n 1048576 ++ set +a ++ '[' -z '' ']' ++ ASSEMBLY_DIR2=/home/ec2here/spark_home1/assembly/target/scala-2.11 ++ ASSEMBLY_DIR1=/home/ec2here/spark_home1/assembly/target/scala-2.10 ++ [[ -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ]] ++ '[' -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ']' ++ export SPARK_SCALA_VERSION=2.10 ++ SPARK_SCALA_VERSION=2.10 + '[' -n /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ']' + RUNNER=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java + '[' -d /home/ec2here/spark_home1/jars ']' + SPARK_JARS_DIR=/home/ec2here/spark_home1/jars + '[' '!' -d /home/ec2here/spark_home1/jars ']' + LAUNCH_CLASSPATH='/home/ec2here/spark_home1/jars/*' + '[' -n '' ']' + [[ -n '' ]] + CMD=() + IFS= + read -d '' -r ARG ++ build_command org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status driver-20190417130324-0009 ++ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xmx128m -cp '/home/ec2here/spark_home1/jars/*' org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status driver-20190417130324-0009 + CMD+=("$ARG") + IFS= + read -d '' -r ARG + CMD+=("$ARG") + IFS= + read -d '' -r ARG + CMD+=("$ARG") + IFS= + read -d '' -r ARG + CMD+=("$ARG") + IFS= + read -d '' -r ARG + CMD+=("$ARG") + IFS= + read -d '' -r ARG + CMD+=("$ARG") + IFS= + read -d '' -r ARG + CMD+=("$ARG") + IFS= + read -d '' -r ARG + CMD+=("$ARG") + IFS= + read -d '' -r ARG + CMD+=("$ARG") + IFS= + read -d '' -r ARG ++ printf '%d\0' 0 + CMD+=("$ARG") + IFS= + read -d '' -r ARG + COUNT=10 + LAST=9 + LAUNCHER_EXIT_CODE=0 + [[ 0 =~ ^[0-9]+$ ]] + '[' 0 '!=' 0 ']' + CMD=("${CMD[@]:0:$LAST}") + exec /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -cp '/home/ec2here/spark_home1/conf/:/home/ec2here/spark_home1/jars/*' -Xmx2048m org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status driver-20190417130324-0009 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 19/04/17 14:03:27 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20190417130324-0009 in spark://domainhere:6066. 19/04/17 14:03:28 INFO RestSubmissionClient: Server responded with SubmissionStatusResponse: { "action" : "SubmissionStatusResponse", "driverState" : "FAILED", "serverSparkVersion" : "2.3.3", "submissionId" : "driver-20190417130324-0009", "success" : true, "workerHostPort" : "x.y.211.40:11819", "workerId" : "worker-20190417115840-x.y.211.40-11819" } [ec2here@ip-x-y-160-225 ~]$ Spark 2.3.3: [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home/bin/spark-class org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status driver-20190417130324-0009 + '[' -z '' ']' ++ dirname /home/ec2here/spark_home/bin/spark-class + source /home/ec2here/spark_home/bin/find-spark-home dirname /home/ec2here/spark_home/bin/spark-class +++ cd /home/ec2here/spark_home/bin +++ pwd ++ FIND_SPARK_HOME_PYTHON_SCRIPT=/home/ec2here/spark_home/bin/find_spark_home.py ++ '[' '!' -z '' ']' ++ '[' '!' -f /home/ec2here/spark_home/bin/find_spark_home.py ']' dirname /home/ec2here/spark_home/bin/spark-class +++ cd /home/ec2here/spark_home/bin/.. +++ pwd ++ export SPARK_HOME=/home/ec2here/spark_home ++ SPARK_HOME=/home/ec2here/spark_home + . /home/ec2here/spark_home/bin/load-spark-env.sh ++ '[' -z /home/ec2here/spark_home ']' ++
[jira] [Commented] (SPARK-27485) Certain query plans fail to run when autoBroadcastJoinThreshold is set to -1
[ https://issues.apache.org/jira/browse/SPARK-27485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820189#comment-16820189 ] shahid commented on SPARK-27485: Could you please share test to reproduce this? > Certain query plans fail to run when autoBroadcastJoinThreshold is set to -1 > > > Key: SPARK-27485 > URL: https://issues.apache.org/jira/browse/SPARK-27485 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.4.0 >Reporter: Muthu Jayakumar >Priority: Minor > > Certain queries fail with > {noformat} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:349) > at scala.None$.get(Option.scala:347) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$reorder$1(EnsureRequirements.scala:238) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$reorder$1$adapted(EnsureRequirements.scala:233) > at scala.collection.immutable.List.foreach(List.scala:388) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:233) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:262) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:289) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:296) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$4(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:296) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:38) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$prepareForExecution$1(QueryExecution.scala:87) > at >
[jira] [Created] (SPARK-27490) File source V2: return correct result for Dataset.inputFiles()
Gengliang Wang created SPARK-27490: -- Summary: File source V2: return correct result for Dataset.inputFiles() Key: SPARK-27490 URL: https://issues.apache.org/jira/browse/SPARK-27490 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang Currently, a `Dateset` with file source V2 always return empty results for method `Dataset.inputFiles()`. We should fix it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27489) UI updates to show executor resource information
Thomas Graves created SPARK-27489: - Summary: UI updates to show executor resource information Key: SPARK-27489 URL: https://issues.apache.org/jira/browse/SPARK-27489 Project: Spark Issue Type: Story Components: Web UI Affects Versions: 3.0.0 Reporter: Thomas Graves Assignee: Thomas Graves We are adding other resource type support to the executors and Spark. We should show the resource information for each executor on the UI Executors page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27364) User-facing APIs for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reopened SPARK-27364: --- reopening since it has a subtask > User-facing APIs for GPU-aware scheduling > - > > Key: SPARK-27364 > URL: https://issues.apache.org/jira/browse/SPARK-27364 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > > Design and implement: > * General guidelines for cluster managers to understand resource requests at > application start. The concrete conf/param will be under the design of each > cluster manager. > * APIs to fetch assigned resources from task context. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27364) User-facing APIs for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820120#comment-16820120 ] Thomas Graves commented on SPARK-27364: --- based on no comments on this I'm going to resolve this and we can discuss more in the prs for implementation. > User-facing APIs for GPU-aware scheduling > - > > Key: SPARK-27364 > URL: https://issues.apache.org/jira/browse/SPARK-27364 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > > Design and implement: > * General guidelines for cluster managers to understand resource requests at > application start. The concrete conf/param will be under the design of each > cluster manager. > * APIs to fetch assigned resources from task context. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27364) User-facing APIs for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-27364. --- Resolution: Fixed > User-facing APIs for GPU-aware scheduling > - > > Key: SPARK-27364 > URL: https://issues.apache.org/jira/browse/SPARK-27364 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > > Design and implement: > * General guidelines for cluster managers to understand resource requests at > application start. The concrete conf/param will be under the design of each > cluster manager. > * APIs to fetch assigned resources from task context. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27488) Driver interface to support GPU resources
[ https://issues.apache.org/jira/browse/SPARK-27488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820119#comment-16820119 ] Thomas Graves commented on SPARK-27488: --- Note, the api design is here: https://issues.apache.org/jira/browse/SPARK-27364 > Driver interface to support GPU resources > -- > > Key: SPARK-27488 > URL: https://issues.apache.org/jira/browse/SPARK-27488 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > We want to have an interface to allow the users on the driver to get what > resources are allocated to them. This is mostly to handle the case the > cluster manager does not launch the driver in an isolated environment and > where users could be sharing hosts. For instance in standalone mode it > doesn't support container isolation so a host may have 4 gpu's, but only 2 of > them could be assigned to the driver. In this case we need an interface for > the cluster manager to specify what gpu's for the driver to use and an > interface for the user to get the resource information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27488) Driver interface to support GPU resources
Thomas Graves created SPARK-27488: - Summary: Driver interface to support GPU resources Key: SPARK-27488 URL: https://issues.apache.org/jira/browse/SPARK-27488 Project: Spark Issue Type: Story Components: Spark Core Affects Versions: 3.0.0 Reporter: Thomas Graves Assignee: Thomas Graves We want to have an interface to allow the users on the driver to get what resources are allocated to them. This is mostly to handle the case the cluster manager does not launch the driver in an isolated environment and where users could be sharing hosts. For instance in standalone mode it doesn't support container isolation so a host may have 4 gpu's, but only 2 of them could be assigned to the driver. In this case we need an interface for the cluster manager to specify what gpu's for the driver to use and an interface for the user to get the resource information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820114#comment-16820114 ] Dave DeCaprio commented on SPARK-23904: --- No, it's just in master, which is the 3.X branch. I do have a backports of this and other PRs I have made related to large query plans in my repo: [https://github.com/DaveDeCaprio/spark] - it's the closedloop-2.4 branch. > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27458) Remind developer using IntelliJ to update maven version
[ https://issues.apache.org/jira/browse/SPARK-27458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27458. --- Resolution: Fixed Assignee: William Wong Fix Version/s: 3.0.0 Resolved by https://github.com/apache/spark-website/pull/195 > Remind developer using IntelliJ to update maven version > --- > > Key: SPARK-27458 > URL: https://issues.apache.org/jira/browse/SPARK-27458 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: William Wong >Assignee: William Wong >Priority: Minor > Fix For: 3.0.0 > > > I am using IntelliJ to update a few spark source. I tried to follow the guide > at '[http://spark.apache.org/developer-tools.html]' to setup an IntelliJ > project for Spark. However, the project was failed to build. It was due to > missing classes generated via antlr on sql/catalyst project. I tried to click > the button 'Generate Sources and Update Folders for all Projects' but it does > not help. Antlr task was not triggered as expected. > Checked the IntelliJ log file and found that it was because I did not set the > maven version properly and the 'Generate Sources and Update Folders for all > Projects' process was failed silently: > > _2019-04-14 16:05:24,796 [ 314609] INFO - #org.jetbrains.idea.maven - > [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion > failed with message:_ > _Detected Maven Version: 3.3.9 is not in the allowed range 3.6.0._ > _2019-04-14 16:05:24,813 [ 314626] INFO - #org.jetbrains.idea.maven - > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute > goal org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M2:enforce > (enforce-versions) on project spark-parent_2.12: Some Enforcer rules have > failed. Look above for specific messages explaining why the rule failed._ > _java.lang.RuntimeException: > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute > goal org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M2:enforce > (enforce-versions) on project spark-parent_2.12: Some Enforcer rules have > failed. Look above for specific messages explaining why the rule failed._ > > Be honest, failing an action silently should be an IntelliJ bug. However, > enhancing the page '[http://spark.apache.org/developer-tools.html]' to > remind developers to check the maven version may save those new joiners some > time. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820058#comment-16820058 ] Izek Greenfield commented on SPARK-23904: - [~DaveDeCaprio] Does that PR go into 2.4.1 release? > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan
[ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19712. - Resolution: Fixed Issue resolved by pull request 24331 [https://github.com/apache/spark/pull/24331] > EXISTS and Left Semi join do not produce the same plan > -- > > Key: SPARK-19712 > URL: https://issues.apache.org/jira/browse/SPARK-19712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong >Priority: Major > Fix For: 3.0.0 > > > This problem was found during the development of SPARK-18874. > The EXISTS form in the following query: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 > from t3 where t1.t1b=t3.t3b)")}} > gives the optimized plan below: > {code} > == Optimized Logical Plan == > Join Inner, (t1a#7 = t2a#25) > :- Join LeftSemi, (t1b#8 = t3b#58) > : :- Filter isnotnull(t1a#7) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Project [1 AS 1#271, t3b#58] > : +- Relation[t3a#57,t3b#58,t3c#59] parquet > +- Filter isnotnull(t2a#25) >+- Relation[t2a#25,t2b#26,t2c#27] parquet > {code} > whereas a semantically equivalent Left Semi join query below: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on > t1.t1b=t3.t3b")}} > gives the following optimized plan: > {code} > == Optimized Logical Plan == > Join LeftSemi, (t1b#8 = t3b#58) > :- Join Inner, (t1a#7 = t2a#25) > : :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7)) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Filter isnotnull(t2a#25) > : +- Relation[t2a#25,t2b#26,t2c#27] parquet > +- Project [t3b#58] >+- Relation[t3a#57,t3b#58,t3c#59] parquet > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan
[ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19712: --- Assignee: Dilip Biswal > EXISTS and Left Semi join do not produce the same plan > -- > > Key: SPARK-19712 > URL: https://issues.apache.org/jira/browse/SPARK-19712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > This problem was found during the development of SPARK-18874. > The EXISTS form in the following query: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 > from t3 where t1.t1b=t3.t3b)")}} > gives the optimized plan below: > {code} > == Optimized Logical Plan == > Join Inner, (t1a#7 = t2a#25) > :- Join LeftSemi, (t1b#8 = t3b#58) > : :- Filter isnotnull(t1a#7) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Project [1 AS 1#271, t3b#58] > : +- Relation[t3a#57,t3b#58,t3c#59] parquet > +- Filter isnotnull(t2a#25) >+- Relation[t2a#25,t2b#26,t2c#27] parquet > {code} > whereas a semantically equivalent Left Semi join query below: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on > t1.t1b=t3.t3b")}} > gives the following optimized plan: > {code} > == Optimized Logical Plan == > Join LeftSemi, (t1b#8 = t3b#58) > :- Join Inner, (t1a#7 = t2a#25) > : :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7)) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Filter isnotnull(t2a#25) > : +- Relation[t2a#25,t2b#26,t2c#27] parquet > +- Project [t3b#58] >+- Relation[t3a#57,t3b#58,t3c#59] parquet > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode
[ https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhanfeng Huo updated SPARK-3438: Comment: was deleted (was: This is a newest PR on master with commit 0d1cc4ae42e1f73538dd8b9b1880ca9e5b124108(Mon Sep 8 14:32:53 2014 +0530). 1,PR:https://github.com/apache/spark/pull/2320 And this is the original PR that my PR baseed on. 2,PR:https://github.com/apache/spark/pull/265/files) > Support for accessing secured HDFS in Standalone Mode > - > > Key: SPARK-3438 > URL: https://issues.apache.org/jira/browse/SPARK-3438 > Project: Spark > Issue Type: New Feature > Components: Deploy, Spark Core >Affects Versions: 1.0.2 >Reporter: Zhanfeng Huo >Priority: Major > > Access to secured HDFS is currently supported in YARN using YARN's built in > security mechanism. In YARN mode, a user application is authenticated when it > is submitted, then it acquires delegation tokens and them ship them (via > YARN) securely to workers. > In Standalone mode, it would be nice to support a more mechanism for > accessing HDFS where we rely on a single shared secret to authenticate > communication in the standalone cluster. > 1. A company is running a standalone cluster. > 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. > all Spark jobs can trust one another. > 3. They are able to provide a Hadoop login on the driver node via a keytab or > kinit. They want tokens from this login to be distributed to the executors to > allow access to secure HDFS. > 4. They also don't want to trust the network on the cluster. I.e. don't want > to allow someone to fetch HDFS tokens easily over a known protocol, without > authentication. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode
[ https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhanfeng Huo updated SPARK-3438: Comment: was deleted (was: test) > Support for accessing secured HDFS in Standalone Mode > - > > Key: SPARK-3438 > URL: https://issues.apache.org/jira/browse/SPARK-3438 > Project: Spark > Issue Type: New Feature > Components: Deploy, Spark Core >Affects Versions: 1.0.2 >Reporter: Zhanfeng Huo >Priority: Major > > Access to secured HDFS is currently supported in YARN using YARN's built in > security mechanism. In YARN mode, a user application is authenticated when it > is submitted, then it acquires delegation tokens and them ship them (via > YARN) securely to workers. > In Standalone mode, it would be nice to support a more mechanism for > accessing HDFS where we rely on a single shared secret to authenticate > communication in the standalone cluster. > 1. A company is running a standalone cluster. > 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. > all Spark jobs can trust one another. > 3. They are able to provide a Hadoop login on the driver node via a keytab or > kinit. They want tokens from this login to be distributed to the executors to > allow access to secure HDFS. > 4. They also don't want to trust the network on the cluster. I.e. don't want > to allow someone to fetch HDFS tokens easily over a known protocol, without > authentication. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode
[ https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820005#comment-16820005 ] Zhanfeng Huo commented on SPARK-3438: - test > Support for accessing secured HDFS in Standalone Mode > - > > Key: SPARK-3438 > URL: https://issues.apache.org/jira/browse/SPARK-3438 > Project: Spark > Issue Type: New Feature > Components: Deploy, Spark Core >Affects Versions: 1.0.2 >Reporter: Zhanfeng Huo >Priority: Major > > Access to secured HDFS is currently supported in YARN using YARN's built in > security mechanism. In YARN mode, a user application is authenticated when it > is submitted, then it acquires delegation tokens and them ship them (via > YARN) securely to workers. > In Standalone mode, it would be nice to support a more mechanism for > accessing HDFS where we rely on a single shared secret to authenticate > communication in the standalone cluster. > 1. A company is running a standalone cluster. > 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. > all Spark jobs can trust one another. > 3. They are able to provide a Hadoop login on the driver node via a keytab or > kinit. They want tokens from this login to be distributed to the executors to > allow access to secure HDFS. > 4. They also don't want to trust the network on the cluster. I.e. don't want > to allow someone to fetch HDFS tokens easily over a known protocol, without > authentication. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27430) broadcast hint should be respected for broadcast nested loop join
[ https://issues.apache.org/jira/browse/SPARK-27430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27430. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24376 [https://github.com/apache/spark/pull/24376] > broadcast hint should be respected for broadcast nested loop join > - > > Key: SPARK-27430 > URL: https://issues.apache.org/jira/browse/SPARK-27430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27487) Spark - Scala 2.12 compatibility
Vadym Holubnychyi created SPARK-27487: - Summary: Spark - Scala 2.12 compatibility Key: SPARK-27487 URL: https://issues.apache.org/jira/browse/SPARK-27487 Project: Spark Issue Type: Bug Components: Build, Deploy Affects Versions: 2.4.1 Environment: Scala 2.12.7, Hadoop 2.7.7, Spark 2.4.1. Reporter: Vadym Holubnychyi Hi, I've faced one interesting problem during the development. It's written that Spark 2.4.1 is compatible with Scala 2.12 (a minor version is not specified!). So I tried to deploy an application that was written on Scala 2.12.7 and I got a lot of serialization errors. Later I checked that Spark had been built on Scala 2.12.8 and I switched on it, and everything works well now. Isn't it an error that Spark 2.4.1 doesn't support other minor versions? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
[ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819844#comment-16819844 ] Song Jun commented on SPARK-19842: -- I think Constraint should be designed with DataSource v2 and can do more than this jira. Constraint can be used to: 1. data integrity(not include in this jira) 2. optimizer can use it to rewrite query to gain perfermance(not just PK/FK, unique/not null is also useful) For data integrity, we have two scenarios: 1.1 DataSource native support data integrity, such as mysql/oracle and so on Spark should only call read/write API of this DataSource, and do nothing about data integrity. 1.2 DataSource do not support data integrity, such as csv/json/parquet and so on Spark can provide data integrity for this DataSource like Hive does(maybe a switch can be used to turn it off), and we can discuss to support which kind of Constraint. For example, Hive support PK/FK/UNIQUE(DISABLE RELY)/NOT NUL/DEFAULT, NOT NULL ENFORCE check is implement by add an extra UDF GenericUDFEnforceNotNullConstraint to the Plan(https://issues.apache.org/jira/browse/HIVE-16605). For Optimizer rewrite query: 2.1 We can add Constraint Information into CatalogTable which is returned by catalog.getTable API. Then Optimizer can use it to do query rewrite. 2.2 if we can not get Constraint information, we can use hint to the SQL Above all, we can bring Constraint feature to DataSource v2 design: a) to support 2.1 feature, we can add constraint information to createTable/alterTable/getTable API in this SPIP(https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#) b) to support data integrity, we can add ConstaintSupport mix-in for DataSource v2: if one DataSource support Constraint, then Spark do nothing when insert data; if one DataSource do not support Constraint but still want to do constraint check, then Spark should do the constraint check like Hive(such as not null in Hive add a extra udf GenericUDFEnforceNotNullConstraint to the Plan). if one DataSource do not support Constraint and do not want to do constraint check, then Spark do nothing. Hive catalog support constraint, we can implement this logic in createTable/alterTable API . Then we can use SparkSQL DDL to create Table with constraint which stored to HiveMetaStore by Hive catalog API. for example:CREATE TABLE t(a STRING, b STRING NOT NULL DISABLE, CONSTRAINT pk1 PRIMARY KEY (a) DISABLE) USING parquet; As for how to store constraint, because Hive 2.1 has provide constraint API in Hive.java, we can call it directly in createTable/alterTable API of Hive catalog. There is no need to use table properties to store these constraint information by Spark. There are some concern for using Hive 2.1 catalog API directly in the docs(https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit#heading=h.lnxbz9), such as Spark built-in Hive is 1.2.1, but upgrade Hive to 2.3.4 is inprogress(https://issues.apache.org/jira/browse/SPARK-23710). [~cloud_fan] [~ioana-delaney] If this proposal is reasonable, please give me some feedback. Thanks! > Informational Referential Integrity Constraints Support in Spark > > > Key: SPARK-19842 > URL: https://issues.apache.org/jira/browse/SPARK-19842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney >Priority: Major > Attachments: InformationalRIConstraints.doc > > > *Informational Referential Integrity Constraints Support in Spark* > This work proposes support for _informational primary key_ and _foreign key > (referential integrity) constraints_ in Spark. The main purpose is to open up > an area of query optimization techniques that rely on referential integrity > constraints semantics. > An _informational_ or _statistical constraint_ is a constraint such as a > _unique_, _primary key_, _foreign key_, or _check constraint_, that can be > used by Spark to improve query performance. Informational constraints are not > enforced by the Spark SQL engine; rather, they are used by Catalyst to > optimize the query processing. They provide semantics information that allows > Catalyst to rewrite queries to eliminate joins, push down aggregates, remove > unnecessary Distinct operations, and perform a number of other optimizations. > Informational constraints are primarily targeted to applications that load > and analyze data that originated from a data warehouse. For such > applications, the conditions for a given constraint are known to be true, so > the constraint does not need to be enforced during data load operations. > The attached document covers constraint definition, metastore storage, > constraint
[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819809#comment-16819809 ] Gabor Somogyi commented on SPARK-27409: --- I mean does this cause any data processing issue other than the stack? > Micro-batch support for Kafka Source in Spark 2.3 > - > > Key: SPARK-27409 > URL: https://issues.apache.org/jira/browse/SPARK-27409 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Prabhjot Singh Bharaj >Priority: Major > > It seems with this change - > [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50] > in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in > micro-batch mode but only in continuous mode. Is that understanding correct ? > {code:java} > E Py4JJavaError: An error occurred while calling o217.load. > E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549) > E at > org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) > E at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) > E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > E at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > E at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > E at java.lang.reflect.Method.invoke(Method.java:498) > E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > E at py4j.Gateway.invoke(Gateway.java:282) > E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > E at py4j.commands.CallCommand.execute(CallCommand.java:79) > E at py4j.GatewayConnection.run(GatewayConnection.java:238) > E at java.lang.Thread.run(Thread.java:748) > E Caused by: org.apache.kafka.common.KafkaException: > org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: > non-existent (No such file or directory) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44) > E at > org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93) > E at > org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51) > E at > org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657) > E ... 19 more > E Caused by: org.apache.kafka.common.KafkaException: > java.io.FileNotFoundException: non-existent (No such file or directory) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41) > E ... 23 more > E Caused by: java.io.FileNotFoundException: non-existent (No such file or > directory) > E at java.io.FileInputStream.open0(Native Method) > E at java.io.FileInputStream.open(FileInputStream.java:195) > E at java.io.FileInputStream.(FileInputStream.java:138) > E at java.io.FileInputStream.(FileInputStream.java:93) > E at > org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216) > E at > org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201) > E at > org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119) > E ... 24 more{code} > When running a simple data stream loader for kafka without an SSL cert, it > goes through this code block - > > {code:java} > ... > ... > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) > E at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) > ... > ...{code} > > Note
[jira] [Updated] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect
[ https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27475: Issue Type: Sub-task (was: Bug) Parent: SPARK-23710 > dev/deps/spark-deps-hadoop-3.2 is incorrect > --- > > Key: SPARK-27475 > URL: https://issues.apache.org/jira/browse/SPARK-27475 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27402) Fix hadoop-3.2 test issue(except the hive-thriftserver module)
[ https://issues.apache.org/jira/browse/SPARK-27402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27402: Description: Fix sql/core and sql/hive modules test issue for hadoop-3.2 > Fix hadoop-3.2 test issue(except the hive-thriftserver module) > -- > > Key: SPARK-27402 > URL: https://issues.apache.org/jira/browse/SPARK-27402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Fix sql/core and sql/hive modules test issue for hadoop-3.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27402) Fix hadoop-3.2 test issue(except the hive-thriftserver module)
[ https://issues.apache.org/jira/browse/SPARK-27402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27402: Description: (was: When we upgrade the built-in Hive to 2.3.4, the default spark.sql.hive.metastore.version should be 2.3.4. This will not be compatible with spark-2.3.3-bin-hadoop2.7.tgz and spark-2.4.1-bin-hadoop2.7.tgz.) > Fix hadoop-3.2 test issue(except the hive-thriftserver module) > -- > > Key: SPARK-27402 > URL: https://issues.apache.org/jira/browse/SPARK-27402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27402) Fix hadoop-3.2 test issue(except the hive-thriftserver module)
[ https://issues.apache.org/jira/browse/SPARK-27402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27402: Summary: Fix hadoop-3.2 test issue(except the hive-thriftserver module) (was: Support HiveExternalCatalog backward compatibility test) > Fix hadoop-3.2 test issue(except the hive-thriftserver module) > -- > > Key: SPARK-27402 > URL: https://issues.apache.org/jira/browse/SPARK-27402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > When we upgrade the built-in Hive to 2.3.4, the default > spark.sql.hive.metastore.version should be 2.3.4. This will not be compatible > with spark-2.3.3-bin-hadoop2.7.tgz and spark-2.4.1-bin-hadoop2.7.tgz. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25088) Rest Server default & doc updates
[ https://issues.apache.org/jira/browse/SPARK-25088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819764#comment-16819764 ] t oo commented on SPARK-25088: -- why block rest if auth is on? for example i want to be able to use unauthed rest AND authed standard submission > Rest Server default & doc updates > - > > Key: SPARK-25088 > URL: https://issues.apache.org/jira/browse/SPARK-25088 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > > The rest server could use some updates on defaults & docs, both in standalone > and mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27486) Enable History server storage information test
shahid created SPARK-27486: -- Summary: Enable History server storage information test Key: SPARK-27486 URL: https://issues.apache.org/jira/browse/SPARK-27486 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.1, 2.3.3, 3.0.0 Reporter: shahid After SPARK-22050, we can store the information about block updated events to eventLog, if we enable "spark.eventLog.logBlockUpdates.enabled=true". We have disabled the test related to storage in the History server suite after SPARK-13845. So, we can enable the test, by adding an eventlog corresponding to the application, which has enabled "spark.eventLog.logBlockUpdates.enabled=true" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27486) Enable History server storage information test
[ https://issues.apache.org/jira/browse/SPARK-27486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819759#comment-16819759 ] shahid commented on SPARK-27486: I will raise a PR > Enable History server storage information test > -- > > Key: SPARK-27486 > URL: https://issues.apache.org/jira/browse/SPARK-27486 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.3, 2.4.1, 3.0.0 >Reporter: shahid >Priority: Minor > > After SPARK-22050, we can store the information about block updated events to > eventLog, if we enable "spark.eventLog.logBlockUpdates.enabled=true". We have > disabled the test related to storage in the History server suite after > SPARK-13845. So, we can enable the test, by adding an eventlog corresponding > to the application, which has enabled > "spark.eventLog.logBlockUpdates.enabled=true" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org