Hi Couple of questions:
1. It seems the error is due to number format: Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0003024_0000" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat. generateSplitsInfo(OrcInputFormat.java:998) ... 75 more Why do you think it is due to ACID? 2. You should not be creating Hive Context again in REPL, no need for that. REPL already reports: SparkContext available as sc, HiveContext available as sqlContext. 3. Have you tried the same with spark 2.x? On Sat, Mar 3, 2018 at 5:00 AM, Debabrata Ghosh <mailford...@gmail.com> wrote: > Hi All, > Greetings ! I needed some help to read a Hive table > via Pyspark for which the transactional property is set to 'True' (In other > words ACID property is enabled). Following is the entire stacktrace and the > description of the hive table. Would you please be able to help me resolve > the error: > > 18/03/01 11:06:22 INFO BlockManagerMaster: Registered BlockManager > 18/03/01 11:06:22 INFO EventLoggingListener: Logging events to > hdfs:///spark-history/local-1519923982155 > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 1.6.3 > /_/ > > Using Python version 2.7.12 (default, Jul 2 2016 17:42:40) > SparkContext available as sc, HiveContext available as sqlContext. > >>> from pyspark.sql import HiveContext > >>> hive_context = HiveContext(sc) > >>> hive_context.sql("select count(*) from load_etl.trpt_geo_defect_prod_ > dec07_del_blank").show() > 18/03/01 11:09:45 INFO HiveContext: Initializing execution hive, version > 1.2.1 > 18/03/01 11:09:45 INFO ClientWrapper: Inspected Hadoop version: > 2.7.3.2.6.0.3-8 > 18/03/01 11:09:45 INFO ClientWrapper: Loaded > org.apache.hadoop.hive.shims.Hadoop23Shims > for Hadoop version 2.7.3.2.6.0.3-8 > 18/03/01 11:09:46 INFO HiveMetaStore: 0: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 18/03/01 11:09:46 INFO ObjectStore: ObjectStore, initialize called > 18/03/01 11:09:46 INFO Persistence: Property > hive.metastore.integral.jdo.pushdown > unknown - will be ignored > 18/03/01 11:09:46 INFO Persistence: Property datanucleus.cache.level2 > unknown - will be ignored > 18/03/01 11:09:50 INFO ObjectStore: Setting MetaStore object pin classes > with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo, > Partition,Database,Type,FieldSchema,Order" > 18/03/01 11:09:50 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" > is tagged as "embedded-only" so does not have its own datastore table. > 18/03/01 11:09:50 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" > is tagged as "embedded-only" so does not have its own datastore table. > 18/03/01 11:09:53 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" > is tagged as "embedded-only" so does not have its own datastore table. > 18/03/01 11:09:53 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" > is tagged as "embedded-only" so does not have its own datastore table. > 18/03/01 11:09:54 INFO MetaStoreDirectSql: Using direct SQL, underlying DB > is DERBY > 18/03/01 11:09:54 INFO ObjectStore: Initialized ObjectStore > 18/03/01 11:09:54 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.2.0 > 18/03/01 11:09:54 WARN ObjectStore: Failed to get database default, > returning NoSuchObjectException > 18/03/01 11:09:54 INFO HiveMetaStore: Added admin role in metastore > 18/03/01 11:09:54 INFO HiveMetaStore: Added public role in metastore > 18/03/01 11:09:55 INFO HiveMetaStore: No user is added in admin role, > since config is empty > 18/03/01 11:09:55 INFO HiveMetaStore: 0: get_all_databases > 18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com ip=unknown-ip-addr > cmd=get_all_databases > 18/03/01 11:09:55 INFO HiveMetaStore: 0: get_functions: db=default pat=* > 18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com ip=unknown-ip-addr > cmd=get_functions: db=default pat=* > 18/03/01 11:09:55 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MResourceUri" > is tagged as "embedded-only" so does not have its own datastore table. > 18/03/01 11:09:55 INFO SessionState: Created local directory: > /tmp/22ea9ac9-23d1-4247-9e02-ce45809cd9ae_resources > 18/03/01 11:09:55 INFO SessionState: Created HDFS directory: > /tmp/hive/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae > 18/03/01 11:09:55 INFO SessionState: Created local directory: > /tmp/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae > 18/03/01 11:09:55 INFO SessionState: Created HDFS directory: > /tmp/hive/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae/_tmp_space.db > 18/03/01 11:09:55 INFO HiveContext: default warehouse location is > /user/hive/warehouse > 18/03/01 11:09:55 INFO HiveContext: Initializing HiveMetastoreConnection > version 1.2.1 using Spark classes. > 18/03/01 11:09:55 INFO ClientWrapper: Inspected Hadoop version: > 2.7.3.2.6.0.3-8 > 18/03/01 11:09:55 INFO ClientWrapper: Loaded > org.apache.hadoop.hive.shims.Hadoop23Shims > for Hadoop version 2.7.3.2.6.0.3-8 > 18/03/01 11:09:56 INFO metastore: Trying to connect to metastore with URI > thrift://ip.com:9083 > 18/03/01 11:09:56 INFO metastore: Connected to metastore. > 18/03/01 11:09:56 INFO SessionState: Created local directory: > /tmp/24379bb3-8ddf-4716-b68d-07ac0f92d9f1_resources > 18/03/01 11:09:56 INFO SessionState: Created HDFS directory: > /tmp/hive/hdetldev/24379bb3-8ddf-4716-b68d-07ac0f92d9f1 > 18/03/01 11:09:56 INFO SessionState: Created local directory: > /tmp/hdetldev/24379bb3-8ddf-4716-b68d-07ac0f92d9f1 > 18/03/01 11:09:56 INFO SessionState: Created HDFS directory: > /tmp/hive/hdetldev/24379bb3-8ddf-4716-b68d-07ac0f92d9f1/_tmp_space.db > 18/03/01 11:09:56 INFO ParseDriver: Parsing command: select count(*) from > load_etl.trpt_geo_defect_prod_dec07_del_blank > 18/03/01 11:09:57 INFO ParseDriver: Parse Completed > 18/03/01 11:09:57 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 813.6 KB, free 510.3 MB) > 18/03/01 11:09:57 INFO MemoryStore: Block broadcast_0_piece0 stored as > bytes in memory (estimated size 57.5 KB, free 510.3 MB) > 18/03/01 11:09:57 INFO BlockManagerInfo: Added broadcast_0_piece0 in > memory on localhost:35508 (size: 57.5 KB, free: 511.1 MB) > 18/03/01 11:09:57 INFO SparkContext: Created broadcast 0 from showString > at NativeMethodAccessorImpl.java:-2 > 18/03/01 11:09:58 INFO PerfLogger: <PERFLOG method=OrcGetSplits > from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> > 18/03/01 11:09:58 INFO deprecation: mapred.input.dir is deprecated. > Instead, use mapreduce.input.fileinputformat.inputdir > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.py", > line 257, in show > print(self._jdf.showString(n, truncate)) > File "/var/opt/teradata/anaconda4.1.1/anaconda/lib/python2.7/ > site-packages/py4j-0.10.6-py2.7.egg/py4j/java_gateway.py", line 1160, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line > 45, in deco > return f(*a, **kw) > File "/var/opt/teradata/anaconda4.1.1/anaconda/lib/python2.7/ > site-packages/py4j-0.10.6-py2.7.egg/py4j/protocol.py", line 320, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > o44.showString. > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > execute, tree: > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#60L]) > +- TungstenExchange SinglePartition, None > +- TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[count#63L]) > +- HiveTableScan MetastoreRelation load_etl, > trpt_geo_defect_prod_dec07_del_blank, None > > at org.apache.spark.sql.catalyst.errors.package$.attachTree( > package.scala:49) > at org.apache.spark.sql.execution.aggregate. > TungstenAggregate.doExecute(TungstenAggregate.scala:80) > at org.apache.spark.sql.execution.SparkPlan$$anonfun$ > execute$5.apply(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan$$anonfun$ > execute$5.apply(SparkPlan.scala:130) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute( > SparkPlan.scala:130) > at org.apache.spark.sql.execution.ConvertToSafe. > doExecute(rowFormatConverters.scala:56) > at org.apache.spark.sql.execution.SparkPlan$$anonfun$ > execute$5.apply(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan$$anonfun$ > execute$5.apply(SparkPlan.scala:130) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute( > SparkPlan.scala:130) > at org.apache.spark.sql.execution.SparkPlan. > executeTake(SparkPlan.scala:187) > at org.apache.spark.sql.execution.Limit. > executeCollect(basicOperators.scala:165) > at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic( > SparkPlan.scala:174) > at org.apache.spark.sql.DataFrame$$anonfun$org$apache$ > spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500) > at org.apache.spark.sql.DataFrame$$anonfun$org$apache$ > spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500) > at org.apache.spark.sql.execution.SQLExecution$. > withNewExecutionId(SQLExecution.scala:56) > at org.apache.spark.sql.DataFrame.withNewExecutionId( > DataFrame.scala:2087) > at org.apache.spark.sql.DataFrame.org$apache$spark$ > sql$DataFrame$$execute$1(DataFrame.scala:1499) > at org.apache.spark.sql.DataFrame.org$apache$spark$ > sql$DataFrame$$collect(DataFrame.scala:1506) > at org.apache.spark.sql.DataFrame$$anonfun$head$1. > apply(DataFrame.scala:1376) > at org.apache.spark.sql.DataFrame$$anonfun$head$1. > apply(DataFrame.scala:1375) > at org.apache.spark.sql.DataFrame.withCallback( > DataFrame.scala:2100) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1375) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1457) > at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke( > ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand. > java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > execute, tree: > TungstenExchange SinglePartition, None > +- TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[count#63L]) > +- HiveTableScan MetastoreRelation load_etl, > trpt_geo_defect_prod_dec07_del_blank, > None > > at org.apache.spark.sql.catalyst.errors.package$.attachTree( > package.scala:49) > at org.apache.spark.sql.execution.Exchange.doExecute( > Exchange.scala:247) > at org.apache.spark.sql.execution.SparkPlan$$anonfun$ > execute$5.apply(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan$$anonfun$ > execute$5.apply(SparkPlan.scala:130) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute( > SparkPlan.scala:130) > at org.apache.spark.sql.execution.aggregate. > TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:86) > at org.apache.spark.sql.execution.aggregate. > TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:80) > at org.apache.spark.sql.catalyst.errors.package$.attachTree( > package.scala:48) > ... 36 more > Caused by: java.lang.RuntimeException: serious problem > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat. > generateSplitsInfo(OrcInputFormat.java:1021) > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits( > OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions( > HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( > MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( > MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( > MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( > MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91) > at org.apache.spark.sql.execution.Exchange. > prepareShuffleDependency(Exchange.scala:220) > at org.apache.spark.sql.execution.Exchange$$anonfun$ > doExecute$1.apply(Exchange.scala:254) > at org.apache.spark.sql.execution.Exchange$$anonfun$ > doExecute$1.apply(Exchange.scala:248) > at org.apache.spark.sql.catalyst.errors.package$.attachTree( > package.scala:48) > ... 44 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.NumberFormatException: > For input string: "0003024_0000" > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat. > generateSplitsInfo(OrcInputFormat.java:998) > ... 75 more > Caused by: java.lang.NumberFormatException: For input string: > "0003024_0000" > at java.lang.NumberFormatException.forInputString( > NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:589) > at java.lang.Long.parseLong(Long.java:631) > at org.apache.hadoop.hive.ql.io.AcidUtils.parseDelta( > AcidUtils.java:310) > at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState( > AcidUtils.java:379) > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ > FileGenerator.call(OrcInputFormat.java:634) > at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ > FileGenerator.call(OrcInputFormat.java:620) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1149) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:624) > ... 1 more > > > > > > > > > Here is the detail of the table creation: > > > > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 1.2.1000.2.6.0.3-8 by Apache Hive > 0: jdbc:hive2://toplxhdmd001.rights.com> show create table > load_etl.trpt_geo_defect_prod_dec07_del_blank; > +----------------------------------------------------------- > ------------------------------------+--+ > | createtab_stmt > | > +----------------------------------------------------------- > ------------------------------------+--+ > | CREATE TABLE `load_etl.trpt_geo_defect_prod_dec07_del_blank`( > | > | `line_seg_nbr` int, > | > | `track_type` string, > | > | `track_sdtk_nbr` string, > | > | `mile_post_beg` double, > | > | `ss_nbr` int, > | > | `ss_len` int, > | > | `ris1mpb` double, > | > | `mile_label` string, > | > | `test_dt` string, > | > | `def_prty` string, > | > | `def_nbr` int, > | > | `def_type` string, > | > | `def_ampltd` double, > | > | `def_lgth` int, > | > | `car_cd` string, > | > | `tsc_cd` string, > | > | `class` string, > | > | `test_fspd` string, > | > | `test_pspd` string, > | > | `restr_fspd` string, > | > | `restr_pspd` string, > | > | `def_land_mark` string, > | > | `repeat_cd` string, > | > | `mp_incr_cd` string, > | > | `test_trk_dir` string, > | > | `eff_dt` string, > | > | `trk_file` string, > | > | `dfct_cor_dt` string, > | > | `dfct_acvt` string, > | > | `dfct_slw_ord_ind` string, > | > | `emp_id` string, > | > | `eff_ts` string, > | > | `dfct_cor_tm` string, > | > | `dfct_freight_spd` int, > | > | `dfct_amtrak_spd` int, > | > | `mile_post_sfx` string, > | > | `work_order_id` string, > | > | `loc_id_beg` string, > | > | `loc_id_end` string, > | > | `link_id` string, > | > | `lst_maint_ts` string, > | > | `del_ts` string, > | > | `gps_longitude` double, > | > | `gps_latitude` double, > | > | `geo_car_nme` string, > | > | `rept_gc_nme` string, > | > | `rept_dfct_tst` string, > | > | `rept_dfct_nbr` int, > | > | `restr_trk_cls` string, > | > | `tst_hist_cd` string, > | > | `cret_ts` string, > | > | `ylw_grp_nbr` int, > | > | `geo_dfct_grp_nme` string, > | > | `supv_rollup_cd` string, > | > | `dfct_stat_cd` string, > | > | `lst_maint_id` string, > | > | `del_rsn_cd` string, > | > | `umt_prcs_user_id` string, > | > | `gdfct_vinsp_srestr` string, > | > | `gc_opr_init` string) > | > | CLUSTERED BY ( > | > | geo_car_nme) > | > | INTO 2 BUCKETS > | > | ROW FORMAT SERDE > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > | > | STORED AS INPUTFORMAT > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > | > | OUTPUTFORMAT > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > | > | LOCATION > | > | 'hdfs://HADOOP02/apps/hive/warehouse/load_etl.db/trpt_ > geo_defect_prod_dec07_del_blank' | > | TBLPROPERTIES ( > | > | 'numFiles'='4', > | > | 'numRows'='0', > | > | 'rawDataSize'='0', > | > | 'totalSize'='2566942', > | > | 'transactional'='true', > | > | 'transient_lastDdlTime'='1518695199') > | > +----------------------------------------------------------- > ------------------------------------+--+ > > > Thanks, > D > -- Best Regards, Ayan Guha