Re: NPE when reading Parquet using Hive on Tez
> I dug a little deeper and it appears that the configuration property >"columns.types", which is used >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(), > is not being set. When I manually set that property in hive, your >example works fine. Good to know more about the NPE. ORC uses the exact same parameter. ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java: columnTypeProperty = conf.get(serdeConstants.LIST_COLUMN_TYPES); But I think this could have a very simple explanation. Assuming you have a build of Tez, I would recommend adding a couple of LOG.warn lines in TezGroupedSplitsInputFormat public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException { Particularly whether the "this.conf" or "job" conf object has the column.types set? My guess is that the set; command is setting that up in JobConf & the default compiler places it in the this.conf object. If that is the case, we can fix Parquet to pick it up off the right one. Cheers, Gopal
Re: NPE when reading Parquet using Hive on Tez
HI Gopal, With the release of 0.8.2, I thought I would give tez another shot. Unfortunately, I got the same NPE. I dug a little deeper and it appears that the configuration property "columns.types", which is used org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(), is not being set. When I manually set that property in hive, your example works fine. hive> create temporary table x (x int) stored as parquet; hive> insert into x values(1),(2); hive> set columns.type=int; hive> select count(*) from x where x.x > 1; OK 1 I also saw that the configuration parameter parquet.columns.index.access is also checked in that same function. Setting that property to "true" fixes my issue. hive> create temporary table x (x int) stored as parquet; hive> insert into x values(1),(2); hive> set parquet.column.index.access=true; hive> select count(*) from x where x.x > 1; OK 1 Thanks for your help. Best, Adam On Tue, Jan 5, 2016 at 9:10 AM, Adam Hunt wrote: > Hi Gopal, > > Spark does offer dynamic allocation, but it doesn't always work as > advertised. My experience with Tez has been more in line with my > expectations. I'll bring up my issues with Spark on that list. > > I tried your example and got the same NPE. It might be a mapr-hive issue. > Thanks for your help. > > Adam > > On Mon, Jan 4, 2016 at 12:58 PM, Gopal Vijayaraghavan > wrote: > >> >> > select count(*) from alexa_parquet; >> >> > Caused by: java.lang.NullPointerException >> >at >> >> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni >> >ze(TypeInfoUtils.java:274) >> >at >> >> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser. >> >(TypeInfoUtils.java:293) >> >at >> >> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS >> >tring(TypeInfoUtils.java:764) >> >at >> >> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum >> >nTypes(DataWritableReadSupport.java:76) >> >at >> >> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat >> >aWritableReadSupport.java:220) >> >at >> >> >org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp >> >lit(ParquetRecordReaderWrapper.java:256) >> >> This might be an NPE triggered off by a specific case of the type parser. >> >> I tested it out on my current build with simple types and it looks like >> the issue needs more detail on the column types for a repro. >> >> hive> create temporary table x (x int) stored as parquet; >> hive> insert into x values(1),(2); >> hive> select count(*) from x where x.x > 1; >> Status: DAG finished successfully in 0.18 seconds >> OK >> 1 >> Time taken: 0.792 seconds, Fetched: 1 row(s) >> hive> >> >> Do you have INT96 in the schema? >> >> > I'm currently evaluating Hive on Tez as an alternative to keeping the >> >SparkSQL thrift sever running all the time locking up resources. >> >> Tez has a tunable value in tez.am.session.min.held-containers (i.e >> something small like 10). >> >> And HiveServer2 can be made work similarly because spark >> HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService. >> >> >> >> >> >> >> Cheers, >> Gopal >> >> >> >
Re: NPE when reading Parquet using Hive on Tez
Hi Gopal, Spark does offer dynamic allocation, but it doesn't always work as advertised. My experience with Tez has been more in line with my expectations. I'll bring up my issues with Spark on that list. I tried your example and got the same NPE. It might be a mapr-hive issue. Thanks for your help. Adam On Mon, Jan 4, 2016 at 12:58 PM, Gopal Vijayaraghavan wrote: > > > select count(*) from alexa_parquet; > > > Caused by: java.lang.NullPointerException > >at > >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni > >ze(TypeInfoUtils.java:274) > >at > >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser. > >(TypeInfoUtils.java:293) > >at > >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS > >tring(TypeInfoUtils.java:764) > >at > >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum > >nTypes(DataWritableReadSupport.java:76) > >at > >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat > >aWritableReadSupport.java:220) > >at > >org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp > >lit(ParquetRecordReaderWrapper.java:256) > > This might be an NPE triggered off by a specific case of the type parser. > > I tested it out on my current build with simple types and it looks like > the issue needs more detail on the column types for a repro. > > hive> create temporary table x (x int) stored as parquet; > hive> insert into x values(1),(2); > hive> select count(*) from x where x.x > 1; > Status: DAG finished successfully in 0.18 seconds > OK > 1 > Time taken: 0.792 seconds, Fetched: 1 row(s) > hive> > > Do you have INT96 in the schema? > > > I'm currently evaluating Hive on Tez as an alternative to keeping the > >SparkSQL thrift sever running all the time locking up resources. > > Tez has a tunable value in tez.am.session.min.held-containers (i.e > something small like 10). > > And HiveServer2 can be made work similarly because spark > HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService. > > > > > > > Cheers, > Gopal > > >
Re: NPE when reading Parquet using Hive on Tez
> select count(*) from alexa_parquet; > Caused by: java.lang.NullPointerException >at >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni >ze(TypeInfoUtils.java:274) >at >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser. >(TypeInfoUtils.java:293) >at >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS >tring(TypeInfoUtils.java:764) >at >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum >nTypes(DataWritableReadSupport.java:76) >at >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat >aWritableReadSupport.java:220) >at >org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp >lit(ParquetRecordReaderWrapper.java:256) This might be an NPE triggered off by a specific case of the type parser. I tested it out on my current build with simple types and it looks like the issue needs more detail on the column types for a repro. hive> create temporary table x (x int) stored as parquet; hive> insert into x values(1),(2); hive> select count(*) from x where x.x > 1; Status: DAG finished successfully in 0.18 seconds OK 1 Time taken: 0.792 seconds, Fetched: 1 row(s) hive> Do you have INT96 in the schema? > I'm currently evaluating Hive on Tez as an alternative to keeping the >SparkSQL thrift sever running all the time locking up resources. Tez has a tunable value in tez.am.session.min.held-containers (i.e something small like 10). And HiveServer2 can be made work similarly because spark HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService. Cheers, Gopal
NPE when reading Parquet using Hive on Tez
Hi, When I perform any operation on a data set stored in Parquet format using Hive on Tez, I get an NPE (see bottom for stack trace). The same operation works fine on tables stored as text, Avro, ORC and Sequence files. The same query on the parquet tables also works fine if I use Hive on MR. I'm running MapR 5.0.0 with Hive 1.2.0-mapr-1510, Hadoop 2.7.0-mapr-1506 and Tez 0.7.0 compiled from source. I'm currently evaluating Hive on Tez as an alternative to keeping the SparkSQL thrift sever running all the time locking up resources. Unfortunately, this is a blocker since most of our data is stored in Parquet files. Thanks, Adam select count(*) from alexa_parquet; or create table kmeans_results_100_orc stored as orc as select * from kmeans_results_100; ], TaskAttempt 3 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:337) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:192) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:131) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:97) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:614) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:593) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:141) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:370) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:127) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147) ... 14 more Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:252) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:189) ... 25 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokenize(TypeInfoUtils.java:274) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.(TypeInfoUtils.java:293) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:764) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColumnTypes(DataWritableReadSupport.java:76) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:220) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:256) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:99) at org