Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

Mich Talebzadeh Sat, 30 Jul 2016 04:59:20 -0700

Yes. Something is wrong even when I query table in Hive with correct data
it throws error about corrupt stats before showing the result of 1 row


hive> select * from abc limit 1;

Jul 30, 2016 12:52:14 PM WARNING: org.apache.parquet.CorruptStatistics:
Ignoring statistics because created_by could not be parsed (see
PARQUET-251): parquet-mr version 1.6.0
org.apache.parquet.VersionParser$VersionParseException: Could not parse
created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*)
)?\(build ?(.*)\)
        at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
        at
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
        at
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
        at
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
        at
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
        at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:130)
        at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
        at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
        at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:117)
        at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:80)
        at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
        at
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:682)
        at org.2009-12-31       CPT     30-79-72       18780869       LTSB
STH KENSINGTO CD 5710 31DEC09      90.0    NULL    400.0

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 July 2016 at 12:39, Chanh Le <giaosu...@gmail.com> wrote:

> I received this log when recent debug.
> Is that related to *PARQUET-251*
> But I checked Spark current using parquet 1.8.1 means it already fixed.
>
>
> *16/07/30 18:32:11 INFO SparkExecuteStatementOperation: Running query
> 'select * from topic18' with 72649e37-3ef4-4acd-8d01-4a28e79a1f9a*
> 16/07/30 18:32:11 INFO SparkSqlParser: Parsing command: select * from
> topic18
> 16/07/30 18:32:11 INFO SessionState: Created local directory:
> /var/folders/3c/_6cznybx2571l0b7f5dstkfr0000gn/T/e8d2eb4d-1682-40fd-ad66-f0643692ded7_resources
> 16/07/30 18:32:11 INFO SessionState: Created HDFS directory:
> /tmp/hive/anonymous/e8d2eb4d-1682-40fd-ad66-f0643692ded7
> 16/07/30 18:32:11 INFO SessionState: Created local directory:
> /var/folders/3c/_6cznybx2571l0b7f5dstkfr0000gn/T/giaosudau/e8d2eb4d-1682-40fd-ad66-f0643692ded7
> 16/07/30 18:32:11 INFO SessionState: Created HDFS directory:
> /tmp/hive/anonymous/e8d2eb4d-1682-40fd-ad66-f0643692ded7/_tmp_space.db
> 16/07/30 18:32:11 INFO HiveClientImpl: Warehouse location for Hive client
> (version 1.2.1) is file:/Users/giaosudau/IdeaProjects/spark/spark-warehouse
> 16/07/30 18:32:12 INFO HiveMetaStore: 1: create_database:
> Database(name:default, description:default database,
> locationUri:file:/Users/giaosudau/IdeaProjects/spark/spark-warehouse,
> parameters:{})
> 16/07/30 18:32:12 INFO audit: ugi=anonymous ip=unknown-ip-addr 
> cmd=create_database:
> Database(name:default, description:default database,
> locationUri:file:/Users/giaosudau/IdeaProjects/spark/spark-warehouse,
> parameters:{})
> 16/07/30 18:32:12 INFO HiveMetaStore: 1: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/07/30 18:32:12 INFO ObjectStore: ObjectStore, initialize called
> 16/07/30 18:32:12 INFO Query: Reading in results for query
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used
> is closing
> 16/07/30 18:32:12 INFO MetaStoreDirectSql: Using direct SQL, underlying DB
> is DERBY
> 16/07/30 18:32:12 INFO ObjectStore: Initialized ObjectStore
> 16/07/30 18:32:12 INFO HiveMetaStore: 1: get_table : db=default tbl=topic18
> 16/07/30 18:32:12 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table
> : db=default tbl=topic18
> 16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: int
> 16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: string
> 16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: int
> 16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: string
> 16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: int
> 16/07/30 18:32:23 INFO FileSourceStrategy: Pruning directories with:
> 16/07/30 18:32:23 INFO FileSourceStrategy: Post-Scan Filters:
> *16/07/30 18:32:23 INFO FileSourceStrategy: Pruned Data Schema:
> struct<topic_id: int, topic_name_en: string, parent_id: int, full_parent:
> string, level_id: int ... 3 more fields>*
> 16/07/30 18:32:23 INFO FileSourceStrategy: Pushed Filters:
> 16/07/30 18:32:24 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 142.6 KB, free 2004.5 MB)
> 16/07/30 18:32:24 INFO MemoryStore: Block broadcast_0_piece0 stored as
> bytes in memory (estimated size 15.2 KB, free 2004.4 MB)
> 16/07/30 18:32:24 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on 192.168.1.101:64196 (size: 15.2 KB, free: 2004.6 MB)
> 16/07/30 18:32:24 INFO SparkContext: Created broadcast 0 from run at
> AccessController.java:-2
> 16/07/30 18:32:24 INFO FileSourceStrategy: Planning scan with bin packing,
> max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
> 16/07/30 18:32:26 INFO CodeGenerator: Code generated in 292.668421 ms
> 16/07/30 18:32:26 INFO SparkContext: Starting job: run at
> AccessController.java:-2
> 16/07/30 18:32:26 INFO DAGScheduler: Got job 0 (run at
> AccessController.java:-2) with 2 output partitions
> 16/07/30 18:32:26 INFO DAGScheduler: Final stage: ResultStage 0 (run at
> AccessController.java:-2)
> 16/07/30 18:32:26 INFO DAGScheduler: Parents of final stage: List()
> 16/07/30 18:32:26 INFO DAGScheduler: Missing parents: List()
> 16/07/30 18:32:26 INFO DAGScheduler: Submitting ResultStage 0
> (MapPartitionsRDD[2] at run at AccessController.java:-2), which has no
> missing parents
> 16/07/30 18:32:26 INFO MemoryStore: Block broadcast_1 stored as values in
> memory (estimated size 9.5 KB, free 2004.4 MB)
> 16/07/30 18:32:26 INFO MemoryStore: Block broadcast_1_piece0 stored as
> bytes in memory (estimated size 4.5 KB, free 2004.4 MB)
> 16/07/30 18:32:26 INFO BlockManagerInfo: Added broadcast_1_piece0 in
> memory on 192.168.1.101:64196 (size: 4.5 KB, free: 2004.6 MB)
> 16/07/30 18:32:26 INFO SparkContext: Created broadcast 1 from broadcast at
> DAGScheduler.scala:996
> 16/07/30 18:32:26 INFO DAGScheduler: Submitting 2 missing tasks from
> ResultStage 0 (MapPartitionsRDD[2] at run at AccessController.java:-2)
> 16/07/30 18:32:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
> 16/07/30 18:32:26 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
> 0, localhost, partition 0, PROCESS_LOCAL, 6067 bytes)
> 16/07/30 18:32:26 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
> 1, localhost, partition 1, PROCESS_LOCAL, 6067 bytes)
> 16/07/30 18:32:26 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 16/07/30 18:32:26 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> *16/07/30 18:32:26 INFO FileScanRDD: Reading File path:
> file:///Users/giaosudau/topic.parquet/part-r-00001-98ce3a7b-0a80-4ee6-8f8b-a9d6c4d621d6.gz.parquet,
> range: 0-2231, partition values: [empty row]*
> *16/07/30 18:32:26 INFO FileScanRDD: Reading File path:
> file:///Users/giaosudau/topic.parquet/part-r-00000-98ce3a7b-0a80-4ee6-8f8b-a9d6c4d621d6.gz.parquet,
> range: 0-2256, partition values: [empty row]*
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> *Jul 30, 2016 6:32:27 PM WARNING: org.apache.parquet.CorruptStatistics:
> Ignoring statistics because created_by could not be parsed (see
> PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)*
> *org.apache.parquet.VersionParser$VersionParseException: Could not parse
> created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
> using format: (.+) version ((.*) )?\(build ?(.*)\)*
> * at org.apache.parquet.VersionParser.parse(VersionParser.java:112)*
> * at
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)*
> * at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)*
> * at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)*
> * at
> org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)*
> * at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)*
> * at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:101)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)*
> * at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:122)*
> * at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)*
> * at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
> Source)*
> * at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)*
> * at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)*
> * at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)*
> * at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)*
> * at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)*
> * at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790)*
> * at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790)*
> * at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)*
> * at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)*
> * at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)*
> * at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)*
> * at org.apache.spark.scheduler.Task.run(Task.scala:85)*
> * at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)*
> * at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*
> * at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*
> * at java.lang.Thread.run(Thread.java:745)*
> *Jul 30, 2016 6:32:27 PM WARNING: org.apache.parquet.CorruptStatistics:
> Ignoring statistics because created_by could not be parsed (see
> PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)*
> *org.apache.parquet.VersionParser$VersionParseException: Could not parse
> created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
> using format: (.+) version ((.*) )?\(build ?(.*)\)*
> * at org.apache.parquet.VersionParser.parse(VersionParser.java:112)*
> * at
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)*
> * at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)*
> * at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)*
> * at
> org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)*
> * at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)*
> * at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:101)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)*
> * at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:122)*
> * at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)*
> * at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
> Source)*
> * at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)*
> * at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)*
> * at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)*
> * at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)*
> * at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)*
> * at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790)*
> * at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790)*
> * at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)*
> * at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)*
> * at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)*
> * at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)*
> * at org.apache.spark.scheduler.Task.run(Task.scala:85)*
> * at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)*
> * at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*
> * at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*
> * at java.lang.Thread.run(Thread.java:745)*
> *Jul 30, 2016 6:32:27 PM WARNING: org.apache.parquet.CorruptStatistics:
> Ignoring statistics because created_by could not be parsed (see
> PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)*
> *org.apache.parquet.VersionParser$VersionParseException: Could not parse
> created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
> using format: (.+) version ((.*) )?\(build ?(.*)\)*
> * at org.apache.parquet.VersionParser.parse(VersionParser.java:112)*
> * at
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)*
> * at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)*
> * at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)*
> * at
> org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)*
> * at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)*
> * at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:101)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)*
> * at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)*
> * at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterato16/07/30
> 18:32:27 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1423 bytes
> result sent to driver*
> 16/07/30 18:32:27 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1).
> 1422 bytes result sent to driver
> 16/07/30 18:32:27 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID
> 0) in 482 ms on localhost (1/2)
> 16/07/30 18:32:27 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID
> 1) in 454 ms on localhost (2/2)
> 16/07/30 18:32:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
> 16/07/30 18:32:27 INFO DAGScheduler: ResultStage 0 (run at
> AccessController.java:-2) finished in 0.509 s
> 16/07/30 18:32:27 INFO DAGScheduler: Job 0 finished: run at
> AccessController.java:-2, took 0.625829 s
> 16/07/30 18:32:27 INFO CodeGenerator: Code generated in 18.418581 ms
> 16/07/30 18:32:28 INFO SparkExecuteStatementOperation: Result Schema:
> List(topic_id#0, topic_name_en#1, parent_id#2, full_parent#3, level_id#4)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Jul 30, 2016, at 6:08 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Actually Hive SQL is a superset of Spark SQL. Data type may not be an
> issue.
>
> If I create the table after DataFrame creation as explicitly a Hive
> parquet table through Spark, Hive sees it and you can see it in Spark
> thrift server with data in it (basically you are using Hive Thrift server
> under the bonnet).
>
> If I let Spark create table with df.write.mode("overwrite").
> parquet("/user/hduser/ll_18740868.parquet")
>
> Then Hive does not seem to see the data when an external Hive table is
> created on it!
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 July 2016 at 11:52, Chanh Le <giaosu...@gmail.com> wrote:
>
>> I agree with you. Maybe some change on data type in Spark that Hive still
>> not support or not competitive so that why It shows NULL.
>>
>>
>> On Jul 30, 2016, at 5:47 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> I think it is still a Hive problem because Spark thrift server is
>> basically a Hive thrift server.
>>
>> An ACID test would be to log in to Hive CLI or Hive thrift server (you
>> are actually using Hive thrift server on port 10000 when using Spark thrift
>> server) and see whether you see data
>>
>> When you use Spark it should work.
>>
>> I still believe it is a bug in Hive
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 July 2016 at 11:43, Chanh Le <giaosu...@gmail.com> wrote:
>>
>>> Hi Mich,
>>> Thanks for supporting. Here some of my thoughts.
>>>
>>> BTW can you log in to thrift server and do select * from <TABLE> limit 10
>>>
>>> Do you see the rows?
>>>
>>>
>>> Yes I can see the row but all the fields value NULL.
>>>
>>> Works OK for me
>>>
>>>
>>> You just test the number of row. In my case I check and it shows 117
>>> rows but the problem is about the data is NULL in all fields.
>>>
>>>
>>> AS I see it the issue is that Hive table created as external on Parquet
>>> table somehow does not see data. Rows are all nulls.
>>>
>>> I don't think this is specific to thrift server. Just log in to Hive and
>>> see you can read the data from your table topic created as external.
>>>
>>> I noticed the same issue
>>>
>>>
>>> I don’t think it’s a Hive issue. Right now I am using Spark and Zeppelin.
>>>
>>>
>>> And the point is why with the same parquet file ( I convert from CSV to
>>> parquet)* it can be read in Spark but not in STS*.
>>>
>>> One more thing is with the same file and method to create table in STS
>>> in *Spark 1.6.1 it works fine.*
>>>
>>>
>>> Regards,
>>> Chanh
>>>
>>>
>>>
>>> On Jul 30, 2016, at 2:10 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> BTW can you log in to thrift server and do select * from <TABLE> limit 10
>>>
>>> Do you see the rows?
>>>
>>> Dr Mich Talebzadeh
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>> On 30 July 2016 at 07:20, Mich Talebzadeh <mich.talebza...@gmail.com
>>> > wrote:
>>> Works OK for me
>>>
>>> scala> val df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
>>> "true").option("header", "false").load("
>>> hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
>>> df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2:
>>> string, C3: string, C4: string, C5: string, C6: string, C7: string, C8:
>>> string]
>>> scala>
>>> df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet")
>>> scala> sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")count
>>> res2: Long = 3651
>>> scala> val ff =
>>> sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")
>>> ff: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2:
>>> string, C3: string, C4: string, C5: string, C6: string, C7: string, C8:
>>> string]
>>> scala> ff.take(5)
>>> res3: Array[org.apache.spark.sql.Row] = Array([Transaction
>>> Date,Transaction Type,Sort Code,Account
>>> Number,Transaction Description,Debit Amount,Credit Amount,Balance,],
>>> [31/12/2009,CPT,'30-64-72,18740868,LTSB STH KENSINGTO CD 5710
>>> 31DEC09 ,90.00,,400.00,null], [31/12/2009,CPT,'30-64-72,18740868,LTSB
>>> CHELSEA (3091 CD 5710 31DEC09
>>> ,10.00,,490.00,null], [31/12/2009,DEP,'30-64-72,18740868,CHELSEA
>>> ,,500.00,500.00,null], [Transaction Date,Transaction Type,Sort
>>> Code,Account Number,Transaction Description,Debit Amount,Credit
>>> Amount,Balance,])
>>>
>>> Now in Zeppelin create an external table and read it
>>>
>>> <image.png>
>>>
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>> On 29 July 2016 at 09:04, Chanh Le <giaosu...@gmail.com> wrote:
>>> I continue to debug
>>>
>>> 16/07/29 13:57:35 INFO FileScanRDD: Reading File path:
>>> file:///Users/giaosudau/Documents/Topics.parquet/part-r-00000-8997050f-e063-427e-b53c-f0a61739706f.gz.parquet,
>>>  range:
>>> 0-3118, partition values: [empty row]
>>> vs OK one
>>> 16/07/29 15:02:47 INFO FileScanRDD: Reading File path:
>>> file:///Users/giaosudau/data_example/FACT_ADMIN_HOURLY/time=2016-07-24-18/network_id=30206/part-r-00000-c5f5e18d-c8a1-4831-8903-3c60b02bdfe8.snappy.parquet,
>>> range: 0-6050, partition values: [2016-07-24-18,30206]
>>>
>>> I attached 2 files.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Jul 29, 2016, at 9:44 AM, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>> Hi everyone,
>>>
>>> For more investigation I attached the file that I convert CSV to parquet.
>>>
>>> Spark Code
>>>
>>> I loaded from CSV file
>>> val df = spark.sqlContext.read
>>> .format("com.databricks.spark.csv").option("delimiter",
>>> ",").option("header",
>>> "true").option("inferSchema", 
>>> "true").load("/Users/giaosudau/Downloads/Topics.xls
>>> - Sheet 1.csv")
>>> I create a Parquet
>>>
>>> df.write.mode("overwrite").parquet("/Users/giaosudau/Documents/Topics.parquet”)
>>>
>>> It’s OK in Spark-Shell
>>>
>>> scala> df.take(5)
>>> res22: Array[org.apache.spark.sql.Row] = Array([124,Nghệ thuật & Giải
>>> trí,Arts & Entertainment,0,124,1], [53,Scandal,Scandal,124,124,53,2],
>>> [54,Showbiz - World,Showbiz-World,124,124,54,2], [52,Âm
>>> nhạc,Entertainment-Music,124,124,52,2], [47,Bar - Karaoke -
>>> Massage,Bar-Karaoke-Massage-Prostitution,124,124,47,2])
>>>
>>> When Create a table in STS
>>>
>>> 0: jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE topic (TOPIC_ID
>>> int, TOPIC_NAME_VN String, TOPIC_NAME_EN String, PARENT_ID int,
>>> FULL_PARENT String, LEVEL_ID int) STORED AS PARQUET LOCATION
>>> '/Users/giaosudau/Documents/Topics.parquet’;
>>>
>>> But I get all result NULL
>>>
>>> <Screen Shot 2016-07-29 at 9.42.26 AM.png>
>>>
>>>
>>>
>>> I think it’s really a BUG right?
>>>
>>> Regards,
>>> Chanh
>>>
>>>
>>> <Topics.parquet>
>>>
>>>
>>> <Topics.xls - Sheet 1.csv>
>>>
>>>
>>>
>>>
>>>
>>> On Jul 28, 2016, at 4:25 PM, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>> Hi everyone,
>>>
>>> I have problem when I create a external table in Spark Thrift Server
>>> (STS) and query the data.
>>>
>>> Scenario:
>>> Spark 2.0
>>> Alluxio 1.2.0
>>> Zeppelin 0.7.0
>>> STS start script
>>> /home/spark/spark-2.0.0-bin-hadoop2.6/sbin/start-thriftserver.sh
>>> --master mesos://zk://master1:2181,master2:2181,master3:2181/mesos --conf
>>> spark.driver.memory=5G --conf spark.scheduler.mode=FAIR --class
>>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --jars
>>> /home/spark/spark-2.0.0-bin-hadoop2.6/jars/alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar
>>> --total-executor-cores 35 spark-internal --hiveconf
>>> hive.server2.thrift.port=10000
>>> --hiveconf hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf
>>> hive.metastore.metadb.dir=/user/hive/metadb --conf
>>> spark.sql.shuffle.partitions=20
>>>
>>> I have a file store in Alluxio alluxio://master2:19998/etl_info/TOPIC
>>>
>>> then I create a table in STS by
>>> CREATE EXTERNAL TABLE topic (topic_id int, topic_name_vn String,
>>> topic_name_en String, parent_id int, full_parent String, level_id int)
>>> STORED AS PARQUET LOCATION 'alluxio://master2:19998/etl_info/TOPIC';
>>>
>>> to compare STS with Spark I create a temp table with name topics
>>> spark.sqlContext.read.parquet("alluxio://master2:19998/etl_info/TOPIC
>>> ").registerTempTable("topics")
>>>
>>> Then I do query and compare.
>>> <Screen Shot 2016-07-28 at 4.18.59 PM.png>
>>>
>>>
>>> As you can see the result is different.
>>> Is that a bug? Or I did something wrong
>>>
>>> Regards,
>>> Chanh
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

Reply via email to