[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734150#comment-15734150 ] Dongjoon Hyun commented on SPARK-11374: --- For this issue, there is a discussion now on the PR. It seems that we can make a decision now, YES(Resolved) or NO(Wont Fix). > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15482812#comment-15482812 ] Dongjoon Hyun commented on SPARK-11374: --- Which versions of Spark are you using now? For **one** line header removal, `spark-csv` package has the workaround for Spark 1.6.x and below. In addition, Spark 2.0 also supports that package natively. https://github.com/databricks/spark-csv If you want this as a SQL table option, we don't have a workaround. > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15476343#comment-15476343 ] Rahul Jain commented on SPARK-11374: Hey guys, i am facing the same issue, just wondering if there is any workaround for that or if we can skip the first row somehow. > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420503#comment-15420503 ] Dongjoon Hyun commented on SPARK-11374: --- Hi [~stephane.maa...@gmail.com], Thank you for comments. Yep. I noticed that option too, but that seems more tricky. The current approach of Spark Scala API and my PR is checking if the partition's file start position is zero. So, it's not straight-forward to apply to footer option. For this issue, I think it could be acceptable since Spark Scala API already supports `header` option. However, for the `footer` option, I think we need a new JIRA issue to get some attention and to build consensus for that option. Thanks, Dongjoon. > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420491#comment-15420491 ] Stephane Maarek commented on SPARK-11374: - Hi, Thanks for the PR. Can you also test for the footer option? Might as well solve both issues Thanks Stéphane > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420489#comment-15420489 ] Apache Spark commented on SPARK-11374: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/14638 > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238689#comment-15238689 ] Stephane Maarek commented on SPARK-11374: - any updates on this? Just some log: {code} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'='', "skip.header.line.count"="1"); select * from spark_testing.test_csv_2; hive> select * from spark_testing.test_csv_2; OK NULL3 {code} spark: {code} scala> sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} That's a big problem > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229401#comment-15229401 ] Stephane Maarek commented on SPARK-11374: - I may add that more metadata isn't processed, namely TBLPROPERTIES ('serialization.null.format'='') Also, another issue (may still be related to Spark not reading Hive Metadata or not properly using Hive), but if you create a csv with the following (spaces intended) 1, 2,3 4, 5,6 use Hive as this: CREATE EXTERNAL TABLE `my_table`( `c1` DECIMAL, `c2` DECIMAL, `c3` DECIMAL) ... etc select * from my_table will return in Hive 1,2,3 4,5,6 But using a hive context, in Spark 1,null,3 4,null,6 > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org