[jira] [Commented] (SPARK-30442) Write mode ignored when using CodecStreams
[ https://issues.apache.org/jira/browse/SPARK-30442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047024#comment-17047024 ] Abhishek Madav commented on SPARK-30442: In case of task failures, say the task fails to write to local-disk or is interrupted, the file is empty but materialized on the file-system. The next task which retries to write to this location would see the file and return a FileAlreadyExistException. Thus making it not resilient to task-failures. > Write mode ignored when using CodecStreams > -- > > Key: SPARK-30442 > URL: https://issues.apache.org/jira/browse/SPARK-30442 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.4 >Reporter: Jesse Collins >Priority: Major > > Overwrite is hardcoded to false in the codec stream. This can cause issues, > particularly with aws tools, that make it impossible to retry. > Ideally, this should be read from the write mode set for the DataWriter that > is writing through this codec class. > [https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala#L81] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551296#comment-16551296 ] Abhishek Madav commented on SPARK-24864: Thanks for the reply. The views are currently crated by the customer and the spark-job hasn't been able to keep up with the upgrade from 1.6 -> 2.0+ hence they feel it is a regression. Is there anything that can be done to go back to the 1.6 way of column referencing? > Cannot resolve auto-generated column ordinals in a hive view > > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsrc1new").show > org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given > input columns: [id, upper(name)]; line 1 pos 24; > 'Project [*] > +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` > +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] > +- SubqueryAlias vsrc1new > +- Project [id#634, upper(name#635) AS upper(name)#636] > +- MetastoreRelation default, src1 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > {code} > *Steps to reproduce:* > 1: Create a simple table, say src > {code:java} > CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' > {code} > 2: Create a view, say with name vsrc1new > {code:java} > CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, > upper(name) FROM src1) vsrc1new; > {code} > 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. > 4. Creating a dataframe using: > {code:java} > spark.sql("Select * from vsrc1new").show //throws error > {code} > The auto-generated column names for the view are not resolved. Am I possibly > missing some spark-sql configuration here? I tried the repro-case against > spark 1.6 and that worked fine. Any inputs are appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Madav updated SPARK-24864: --- Description: Spark job reading from a hive-view fails with analysis exception when resolving column ordinals which are autogenerated. *Exception*: {code:java} scala> spark.sql("Select * from vsrc1new").show org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given input columns: [id, upper(name)]; line 1 pos 24; 'Project [*] +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] +- SubqueryAlias vsrc1new +- Project [id#634, upper(name#635) AS upper(name)#636] +- MetastoreRelation default, src1 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) {code} *Steps to reproduce:* 1: Create a simple table, say src {code:java} CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' {code} 2: Create a view, say with name vsrc1new {code:java} CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, upper(name) FROM src1) vsrc1new; {code} 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. 4. Creating a dataframe using: {code:java} spark.sql("Select * from vsrc1new").show //throws error {code} The auto-generated column names for the view are not resolved. Am I possibly missing some spark-sql configuration here? I tried the repro-case against spark 1.6 and that worked fine. Any inputs are appreciated. was: Spark job reading from a hive-view fails with analysis exception when resolving column ordinals which are autogenerated. *Exception*: {code:java} scala> spark.sql("Select * from vsrc1new").show org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given input columns: [id, upper(name)]; line 1 pos 24; 'Project [*] +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] +- SubqueryAlias vsrc1new +- Project [id#634, upper(name#635) AS upper(name)#636] +- MetastoreRelation default, src1 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) {code} Steps to reproduce: 1: Create a simple table, say src {code:java} CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' {code} 2: Create a view, say with name vsrc1new {code:java} CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, upper(name) FROM src1) vsrc1new; {code} 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. 4. Creating a dataframe using: {code:java} spark.sql("Select * from vsrc1new").show //throws error {code} The auto-generated column names for the view are not resolved. Am I possibly missing some spark-sql configuration here? I tried the repro-case against spark 1.6 and that worked fine. Any inputs are appreciated. > Cannot resolve auto-generated column ordinals in a hive view > > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > Fix For: 2.4.0 > > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsr
[jira] [Updated] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Madav updated SPARK-24864: --- Summary: Cannot resolve auto-generated column ordinals in a hive view (was: Cannot reference auto-generated column ordinals in a hive view) > Cannot resolve auto-generated column ordinals in a hive view > > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > Fix For: 2.4.0 > > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsrc1new").show > org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given > input columns: [id, upper(name)]; line 1 pos 24; > 'Project [*] > +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` > +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] > +- SubqueryAlias vsrc1new > +- Project [id#634, upper(name#635) AS upper(name)#636] > +- MetastoreRelation default, src1 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > {code} > Steps to reproduce: > 1: Create a simple table, say src > {code:java} > CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' > {code} > 2: Create a view, say with name vsrc1new > {code:java} > CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, > upper(name) FROM src1) vsrc1new; > {code} > 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. > 4. Creating a dataframe using: > {code:java} > spark.sql("Select * from vsrc1new").show //throws error > {code} > The auto-generated column names for the view are not resolved. Am I possibly > missing some spark-sql configuration here? I tried the repro-case against > spark 1.6 and that worked fine. Any inputs are appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24864) Cannot reference auto-generated column ordinals in a hive view
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Madav updated SPARK-24864: --- Summary: Cannot reference auto-generated column ordinals in a hive view (was: Cannot reference auto-generated column ordinals in a hive-view. ) > Cannot reference auto-generated column ordinals in a hive view > -- > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > Fix For: 2.4.0 > > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsrc1new").show > org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given > input columns: [id, upper(name)]; line 1 pos 24; > 'Project [*] > +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` > +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] > +- SubqueryAlias vsrc1new > +- Project [id#634, upper(name#635) AS upper(name)#636] > +- MetastoreRelation default, src1 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > {code} > Steps to reproduce: > 1: Create a simple table, say src > {code:java} > CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' > {code} > 2: Create a view, say with name vsrc1new > {code:java} > CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, > upper(name) FROM src1) vsrc1new; > {code} > 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. > 4. Creating a dataframe using: > {code:java} > spark.sql("Select * from vsrc1new").show //throws error > {code} > The auto-generated column names for the view are not resolved. Am I possibly > missing some spark-sql configuration here? I tried the repro-case against > spark 1.6 and that worked fine. Any inputs are appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24864) Cannot reference auto-generated column ordinals in a hive-view.
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Madav updated SPARK-24864: --- Description: Spark job reading from a hive-view fails with analysis exception when resolving column ordinals which are autogenerated. *Exception*: {code:java} scala> spark.sql("Select * from vsrc1new").show org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given input columns: [id, upper(name)]; line 1 pos 24; 'Project [*] +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] +- SubqueryAlias vsrc1new +- Project [id#634, upper(name#635) AS upper(name)#636] +- MetastoreRelation default, src1 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) {code} Steps to reproduce: 1: Create a simple table, say src {code:java} CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' {code} 2: Create a view, say with name vsrc1new {code:java} CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, upper(name) FROM src1) vsrc1new; {code} 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. 4. Creating a dataframe using: {code:java} spark.sql("Select * from vsrc1new").show //throws error {code} The auto-generated column names for the view are not resolved. Am I possibly missing some spark-sql configuration here? I tried the repro-case against spark 1.6 and that worked fine. Any inputs are appreciated. > Cannot reference auto-generated column ordinals in a hive-view. > > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > Fix For: 2.4.0 > > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsrc1new").show > org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given > input columns: [id, upper(name)]; line 1 pos 24; > 'Project [*] > +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` > +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] > +- SubqueryAlias vsrc1new > +- Project [id#634, upper(name#635) AS upper(name)#636] > +- MetastoreRelation default, src1 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > {code} > Steps to reproduce: > 1: Create a simple table, say src > {code:java} > CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' > {code} > 2: Create a view, say with name vsrc1new > {code:java} > CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, > upper(name) FROM src1) vsrc1new; > {code} > 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. > 4. Creating a dataframe using: > {code:java} > spark.sql("Select * from vsrc1new").show //throws error > {code} > The auto-generated column names for the view are not resolved. Am I possibly > missing some spark-sql configuration here? I tried the repro-case against > spark 1.6 and that worked fine. Any inputs are appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) ---
[jira] [Created] (SPARK-24864) Cannot reference auto-generated column ordinals in a hive-view.
Abhishek Madav created SPARK-24864: -- Summary: Cannot reference auto-generated column ordinals in a hive-view. Key: SPARK-24864 URL: https://issues.apache.org/jira/browse/SPARK-24864 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.1 Reporter: Abhishek Madav Fix For: 2.4.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
[ https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Madav updated SPARK-20697: --- Priority: Critical (was: Major) > MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables. > -- > > Key: SPARK-20697 > URL: https://issues.apache.org/jira/browse/SPARK-20697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.2.1, 2.3.0 >Reporter: Abhishek Madav >Priority: Critical > > MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table > does not restore the bucketing information to the storage descriptor in the > metastore. > Steps to reproduce: > 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) > PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED > FIELDS TERMINATED BY ','; > 2) In Hive-CLI issue a desc formatted for the table. > # col_namedata_type comment > > a int > > # Partition Information > # col_namedata_type comment > > b int > > # Detailed Table Information > Database: sparkhivebucket > Owner:devbld > CreateTime: Wed May 10 10:31:07 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:8020/user/hive/warehouse/partbucket > Table Type: MANAGED_TABLE > Table Parameters: > transient_lastDdlTime 1494437467 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: 10 > Bucket Columns: [a] > Sort Columns: [] > Storage Desc Params: > field.delim , > serialization.format, > 3) In spark-shell, > scala> spark.sql("MSCK REPAIR TABLE partbucket") > 4) Back to Hive-CLI > desc formatted partbucket; > # col_namedata_type comment > > a int > > # Partition Information > # col_namedata_type comment > > b int > > # Detailed Table Information > Database: sparkhivebucket > Owner:devbld > CreateTime: Wed May 10 10:31:07 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: > hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket > Table Type: MANAGED_TABLE > Table Parameters: > spark.sql.partitionProvider catalog > transient_lastDdlTime 1494437647 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > field.delim , > serialization.format, > Further inserts to this table cannot be made in bucketed fashion through > Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
[ https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Madav updated SPARK-20697: --- Affects Version/s: 2.2.0 2.2.1 2.3.0 > MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables. > -- > > Key: SPARK-20697 > URL: https://issues.apache.org/jira/browse/SPARK-20697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.2.1, 2.3.0 >Reporter: Abhishek Madav >Priority: Major > > MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table > does not restore the bucketing information to the storage descriptor in the > metastore. > Steps to reproduce: > 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) > PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED > FIELDS TERMINATED BY ','; > 2) In Hive-CLI issue a desc formatted for the table. > # col_namedata_type comment > > a int > > # Partition Information > # col_namedata_type comment > > b int > > # Detailed Table Information > Database: sparkhivebucket > Owner:devbld > CreateTime: Wed May 10 10:31:07 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:8020/user/hive/warehouse/partbucket > Table Type: MANAGED_TABLE > Table Parameters: > transient_lastDdlTime 1494437467 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: 10 > Bucket Columns: [a] > Sort Columns: [] > Storage Desc Params: > field.delim , > serialization.format, > 3) In spark-shell, > scala> spark.sql("MSCK REPAIR TABLE partbucket") > 4) Back to Hive-CLI > desc formatted partbucket; > # col_namedata_type comment > > a int > > # Partition Information > # col_namedata_type comment > > b int > > # Detailed Table Information > Database: sparkhivebucket > Owner:devbld > CreateTime: Wed May 10 10:31:07 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: > hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket > Table Type: MANAGED_TABLE > Table Parameters: > spark.sql.partitionProvider catalog > transient_lastDdlTime 1494437647 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > field.delim , > serialization.format, > Further inserts to this table cannot be made in bucketed fashion through > Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
[ https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Madav updated SPARK-20697: --- Description: MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table does not restore the bucketing information to the storage descriptor in the metastore. Steps to reproduce: 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 2) In Hive-CLI issue a desc formatted for the table. # col_name data_type comment a int # Partition Information # col_name data_type comment b int # Detailed Table Information Database: sparkhivebucket Owner: devbld CreateTime: Wed May 10 10:31:07 PDT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://localhost:8020/user/hive/warehouse/partbucket Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1494437467 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets:10 Bucket Columns: [a] Sort Columns: [] Storage Desc Params: field.delim , serialization.format, 3) In spark-shell, scala> spark.sql("MSCK REPAIR TABLE partbucket") 4) Back to Hive-CLI desc formatted partbucket; # col_name data_type comment a int # Partition Information # col_name data_type comment b int # Detailed Table Information Database: sparkhivebucket Owner: devbld CreateTime: Wed May 10 10:31:07 PDT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket Table Type: MANAGED_TABLE Table Parameters: spark.sql.partitionProvider catalog transient_lastDdlTime 1494437647 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format, Further inserts to this table cannot be made in bucketed fashion through Hive. was: MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table does not restore the bucketing information to the storage descriptor in the metastore. Steps to reproduce: 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 2) In Hive-CLI issue a desc formatted for the table. # col_name data_type comment a int # Partition Information # col_name data_type comment b int # Detailed Table Information Database: sparkhivebucket Owner:
[jira] [Created] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
Abhishek Madav created SPARK-20697: -- Summary: MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables. Key: SPARK-20697 URL: https://issues.apache.org/jira/browse/SPARK-20697 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Abhishek Madav MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table does not restore the bucketing information to the storage descriptor in the metastore. Steps to reproduce: 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 2) In Hive-CLI issue a desc formatted for the table. # col_name data_type comment a int # Partition Information # col_name data_type comment b int # Detailed Table Information Database: sparkhivebucket Owner: devbld CreateTime: Wed May 10 10:31:07 PDT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://localhost:8020/user/hive/warehouse/partbucket Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1494437467 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets:10 Bucket Columns: [a] Sort Columns: [] Storage Desc Params: field.delim , serialization.format, 3) In spark-shell, scala> spark.sql("MSCK REPAIR TABLE partbucket") 4) Back to Hive-CLI desc formatted partbucket; # col_name data_type comment a int # Partition Information # col_name data_type comment b int # Detailed Table Information Database: sparkhivebucket Owner: devbld CreateTime: Wed May 10 10:31:07 PDT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket Table Type: MANAGED_TABLE Table Parameters: spark.sql.partitionProvider catalog transient_lastDdlTime 1494437647 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true
[ https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999016#comment-15999016 ] Abhishek Madav commented on SPARK-19532: I am running into this issue wherein codepath similar to hiveWriterContainer is trying to the HDFS location. I tried setting spark.speculation to false but it doesn't seem to be the issue. Is there any workaround? This wait-time leads to make the job run real slow. > [Core]`DataStreamer for file` threads of DFSOutputStream leak if set > `spark.speculation` to true > > > Key: SPARK-19532 > URL: https://issues.apache.org/jira/browse/SPARK-19532 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > When set `spark.speculation` to true, from thread dump page of Executor of > WebUI, I found that there are about 1300 threads named "DataStreamer for > file > /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" > in TIMED_WAITING state. > {code} > java.lang.Object.wait(Native Method) > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564) > {code} > The off-heap memory exceeds a lot until Executor exited with OOM exception. > This problem occurs only when writing data to the Hadoop(tasks may be killed > by Executor during writing). > Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? > The version of Hadoop is 2.6.4. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17302) Cannot set non-Spark SQL session variables in hive-site.xml, spark-defaults.conf, or using --conf
[ https://issues.apache.org/jira/browse/SPARK-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870616#comment-15870616 ] Abhishek Madav commented on SPARK-17302: I believe this is fixed as part of SPARK-15887. Could you check? > Cannot set non-Spark SQL session variables in hive-site.xml, > spark-defaults.conf, or using --conf > - > > Key: SPARK-17302 > URL: https://issues.apache.org/jira/browse/SPARK-17302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue > > When configuration changed for 2.0 to the new SparkSession structure, Spark > stopped using Hive's internal HiveConf for session state and now uses > HiveSessionState and an associated SQLConf. Now, session options like > hive.exec.compress.output and hive.exec.dynamic.partition.mode are pulled > from this SQLConf. This doesn't include session properties from hive-site.xml > (including hive.exec.compress.output), and no longer contains Spark-specific > overrides from spark-defaults.conf that used the spark.hadoop.hive... pattern. > Also, setting these variables on the command-line no longer works because > settings must start with "spark.". > Is there a recommended way to set Hive session properties? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org