Re: ORC Transaction Table - Spark
> Or, is this an artifact of an incompatibility between ORC files written by > the Hive 2.x ORC serde not being readable by the Hive 1.x ORC serde? > 3. Is there a difference in the ORC file format spec. at play here? Nope, we're still defaulting to hive-0.12 format ORC files in Hive-2.x. We haven't changed the format compatibility in 5 years, so we're due for a refresh soon. > 5. What’s the mechanism that affects Spark here? SparkSQL has never properly supported ACID, because to do this correctly Spark has to grab locks on the table and heartbeat the lock, to prevent a compaction from removing a currently used ACID snapshot. AFAIK, there's no code in SparkSQL to handle transactions in Hive - this is not related to the format, it is related to the directory structure used to maintain ACID snapshots, so that you can delete a row without failing queries in progress. However, that's mostly an operational issue for production. Off the raw filesystem (i.e not table), I've used SparkSQL to read the ACID 2.x raw data to write a acidfsck which checks underlying structures by reading them as raw data, so that I can easily do tests like "There's only 1 delete for each ROW__ID" when ACID 2.x was in dev. You can think of the ACID data as basically Struct , Struct when reading it raw. > 6. Any similar issues with Parquet format in Hive 1.x and 2.x? Not similar - but a different set of Parquet incompatibilities are inbound, with parquet.writer.version=v2. Cheers, Gopal
RE: ORC Transaction Table - Spark
Just some clarifying points please. 1. Is this the general case for all file formats? 2. Or, is this an artifact of an incompatibility between ORC files written by the Hive 2.x ORC serde not being readable by the Hive 1.x ORC serde? 3. Is there a difference in the ORC file format spec. at play here? 4. Or, is any incompatibility limited to the Hive ORC serde implementations in Hive 1.x and 2.x? 5. What’s the mechanism that affects Spark here? a. Same ORC serdes as Hive? b. Similar issues in Spark ORC serde implementation(s) as in Hive 1.x ORC serde? 6. Any similar issues with Parquet format in Hive 1.x and 2.x? From: Aviral Agarwal [mailto:aviral12...@gmail.com] Sent: Wednesday, August 23, 2017 10:34 PM To: user@hive.apache.org Subject: Re: ORC Transaction Table - Spark So, there is no way possible right now for Spark to read Hive 2.x data ? On Thu, Aug 24, 2017 at 12:17 AM, Eugene Koifman mailto:ekoif...@hortonworks.com>> wrote: This looks like you have some data written by Hive 2.x and Hive 1.x code trying to read it. That is not supported. From: Aviral Agarwal mailto:aviral12...@gmail.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>> Date: Wednesday, August 23, 2017 at 12:24 AM To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>> Subject: Re: ORC Transaction Table - Spark Hi, Yes it caused by wrong naming convention of the delta directory : /apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_0645253_0645253_0001 How do I solve this ? Thanks ! Aviral Agarwal On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman mailto:ekoif...@hortonworks.com>> wrote: Could you do recursive “ls” in your table or partition that you are trying to read? Most likely you have files that don’t follow expected naming convention Eugene From: Aviral Agarwal mailto:aviral12...@gmail.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>> Date: Tuesday, August 22, 2017 at 5:39 AM To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>> Subject: ORC Transaction Table - Spark Hi, I am trying to read hive orc transaction table through Spark but I am getting the following error Caused by: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) . Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0645253_0001" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 118 more Any help would be appreciated. Thanks and Regards, Aviral Agarwal
Re: ORC Transaction Table - Spark
As far as I know, Spark can't read Hive's transactionnal tables yet: https://issues.apache.org/jira/browse/SPARK-16996 On Thu, Aug 24, 2017 at 4:34 AM, Aviral Agarwal wrote: > So, there is no way possible right now for Spark to read Hive 2.x data ? > > On Thu, Aug 24, 2017 at 12:17 AM, Eugene Koifman > wrote: > >> This looks like you have some data written by Hive 2.x and Hive 1.x code >> trying to read it. >> >> That is not supported. >> >> >> >> *From: *Aviral Agarwal >> *Reply-To: *"user@hive.apache.org" >> *Date: *Wednesday, August 23, 2017 at 12:24 AM >> *To: *"user@hive.apache.org" >> *Subject: *Re: ORC Transaction Table - Spark >> >> >> >> Hi, >> >> Yes it caused by wrong naming convention of the delta directory : >> >> /apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_0645 >> 253_0645253_0001 >> >> How do I solve this ? >> >> Thanks ! >> Aviral Agarwal >> >> >> >> On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman < >> ekoif...@hortonworks.com> wrote: >> >> Could you do recursive “ls” in your table or partition that you are >> trying to read? >> >> Most likely you have files that don’t follow expected naming convention >> >> >> >> Eugene >> >> >> >> >> >> *From: *Aviral Agarwal >> *Reply-To: *"user@hive.apache.org" >> *Date: *Tuesday, August 22, 2017 at 5:39 AM >> *To: *"user@hive.apache.org" >> *Subject: *ORC Transaction Table - Spark >> >> >> >> Hi, >> >> >> >> I am trying to read hive orc transaction table through Spark but I am >> getting the following error >> >> >> Caused by: java.lang.RuntimeException: serious problem >> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli >> tsInfo(OrcInputFormat.java:1021) >> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(Or >> cInputFormat.java:1048) >> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) >> . >> Caused by: java.util.concurrent.ExecutionException: >> java.lang.NumberFormatException: For input string: "0645253_0001" >> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli >> tsInfo(OrcInputFormat.java:998) >> ... 118 more >> >> >> Any help would be appreciated. >> >> Thanks and Regards, >> Aviral Agarwal >> >> >> > >
Re: ORC Transaction Table - Spark
So, there is no way possible right now for Spark to read Hive 2.x data ? On Thu, Aug 24, 2017 at 12:17 AM, Eugene Koifman wrote: > This looks like you have some data written by Hive 2.x and Hive 1.x code > trying to read it. > > That is not supported. > > > > *From: *Aviral Agarwal > *Reply-To: *"user@hive.apache.org" > *Date: *Wednesday, August 23, 2017 at 12:24 AM > *To: *"user@hive.apache.org" > *Subject: *Re: ORC Transaction Table - Spark > > > > Hi, > > Yes it caused by wrong naming convention of the delta directory : > > /apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_ > 0645253_0645253_0001 > > How do I solve this ? > > Thanks ! > Aviral Agarwal > > > > On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman > wrote: > > Could you do recursive “ls” in your table or partition that you are trying > to read? > > Most likely you have files that don’t follow expected naming convention > > > > Eugene > > > > > > *From: *Aviral Agarwal > *Reply-To: *"user@hive.apache.org" > *Date: *Tuesday, August 22, 2017 at 5:39 AM > *To: *"user@hive.apache.org" > *Subject: *ORC Transaction Table - Spark > > > > Hi, > > > > I am trying to read hive orc transaction table through Spark but I am > getting the following error > > > Caused by: java.lang.RuntimeException: serious problem > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo( > OrcInputFormat.java:1021) > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits( > OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > . > Caused by: java.util.concurrent.ExecutionException: > java.lang.NumberFormatException: > For input string: "0645253_0001" > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo( > OrcInputFormat.java:998) > ... 118 more > > > Any help would be appreciated. > > Thanks and Regards, > Aviral Agarwal > > >
Re: ORC Transaction Table - Spark
This looks like you have some data written by Hive 2.x and Hive 1.x code trying to read it. That is not supported. From: Aviral Agarwal Reply-To: "user@hive.apache.org" Date: Wednesday, August 23, 2017 at 12:24 AM To: "user@hive.apache.org" Subject: Re: ORC Transaction Table - Spark Hi, Yes it caused by wrong naming convention of the delta directory : /apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_0645253_0645253_0001 How do I solve this ? Thanks ! Aviral Agarwal On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman mailto:ekoif...@hortonworks.com>> wrote: Could you do recursive “ls” in your table or partition that you are trying to read? Most likely you have files that don’t follow expected naming convention Eugene From: Aviral Agarwal mailto:aviral12...@gmail.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>> Date: Tuesday, August 22, 2017 at 5:39 AM To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>> Subject: ORC Transaction Table - Spark Hi, I am trying to read hive orc transaction table through Spark but I am getting the following error Caused by: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) . Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0645253_0001" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 118 more Any help would be appreciated. Thanks and Regards, Aviral Agarwal
Re: ORC Transaction Table - Spark
Hi, Yes it caused by wrong naming convention of the delta directory : /apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_0645253_0645253_0001 How do I solve this ? Thanks ! Aviral Agarwal On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman wrote: > Could you do recursive “ls” in your table or partition that you are trying > to read? > > Most likely you have files that don’t follow expected naming convention > > > > Eugene > > > > > > *From: *Aviral Agarwal > *Reply-To: *"user@hive.apache.org" > *Date: *Tuesday, August 22, 2017 at 5:39 AM > *To: *"user@hive.apache.org" > *Subject: *ORC Transaction Table - Spark > > > > Hi, > > > > I am trying to read hive orc transaction table through Spark but I am > getting the following error > > > Caused by: java.lang.RuntimeException: serious problem > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo( > OrcInputFormat.java:1021) > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits( > OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > . > Caused by: java.util.concurrent.ExecutionException: > java.lang.NumberFormatException: > For input string: "0645253_0001" > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo( > OrcInputFormat.java:998) > ... 118 more > > > Any help would be appreciated. > > Thanks and Regards, > Aviral Agarwal >
Re: ORC Transaction Table - Spark
Could you do recursive “ls” in your table or partition that you are trying to read? Most likely you have files that don’t follow expected naming convention Eugene From: Aviral Agarwal Reply-To: "user@hive.apache.org" Date: Tuesday, August 22, 2017 at 5:39 AM To: "user@hive.apache.org" Subject: ORC Transaction Table - Spark Hi, I am trying to read hive orc transaction table through Spark but I am getting the following error Caused by: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) . Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0645253_0001" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 118 more Any help would be appreciated. Thanks and Regards, Aviral Agarwal
ORC Transaction Table - Spark
Hi, I am trying to read hive orc transaction table through Spark but I am getting the following error Caused by: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) . Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0645253_0001" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 118 more Any help would be appreciated. Thanks and Regards, Aviral Agarwal