Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

Chanh Le Sat, 30 Jul 2016 04:40:07 -0700

I received this log when recent debug.
Is that related to PARQUET-251
But I checked Spark current using parquet 1.8.1 means it already fixed.



16/07/30 18:32:11 INFO SparkExecuteStatementOperation: Running query 'select * 
from topic18' with 72649e37-3ef4-4acd-8d01-4a28e79a1f9a
16/07/30 18:32:11 INFO SparkSqlParser: Parsing command: select * from topic18
16/07/30 18:32:11 INFO SessionState: Created local directory: 
/var/folders/3c/_6cznybx2571l0b7f5dstkfr0000gn/T/e8d2eb4d-1682-40fd-ad66-f0643692ded7_resources
16/07/30 18:32:11 INFO SessionState: Created HDFS directory: 
/tmp/hive/anonymous/e8d2eb4d-1682-40fd-ad66-f0643692ded7
16/07/30 18:32:11 INFO SessionState: Created local directory: 
/var/folders/3c/_6cznybx2571l0b7f5dstkfr0000gn/T/giaosudau/e8d2eb4d-1682-40fd-ad66-f0643692ded7
16/07/30 18:32:11 INFO SessionState: Created HDFS directory: 
/tmp/hive/anonymous/e8d2eb4d-1682-40fd-ad66-f0643692ded7/_tmp_space.db
16/07/30 18:32:11 INFO HiveClientImpl: Warehouse location for Hive client 
(version 1.2.1) is file:/Users/giaosudau/IdeaProjects/spark/spark-warehouse
16/07/30 18:32:12 INFO HiveMetaStore: 1: create_database: 
Database(name:default, description:default database, 
locationUri:file:/Users/giaosudau/IdeaProjects/spark/spark-warehouse, 
parameters:{})
16/07/30 18:32:12 INFO audit: ugi=anonymous     ip=unknown-ip-addr      
cmd=create_database: Database(name:default, description:default database, 
locationUri:file:/Users/giaosudau/IdeaProjects/spark/spark-warehouse, 
parameters:{})  
16/07/30 18:32:12 INFO HiveMetaStore: 1: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
16/07/30 18:32:12 INFO ObjectStore: ObjectStore, initialize called
16/07/30 18:32:12 INFO Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
16/07/30 18:32:12 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is 
DERBY
16/07/30 18:32:12 INFO ObjectStore: Initialized ObjectStore
16/07/30 18:32:12 INFO HiveMetaStore: 1: get_table : db=default tbl=topic18
16/07/30 18:32:12 INFO audit: ugi=anonymous     ip=unknown-ip-addr      
cmd=get_table : db=default tbl=topic18  
16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: int
16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: string
16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: int
16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: string
16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: int
16/07/30 18:32:23 INFO FileSourceStrategy: Pruning directories with: 
16/07/30 18:32:23 INFO FileSourceStrategy: Post-Scan Filters: 
16/07/30 18:32:23 INFO FileSourceStrategy: Pruned Data Schema: struct<topic_id: 
int, topic_name_en: string, parent_id: int, full_parent: string, level_id: int 
... 3 more fields>
16/07/30 18:32:23 INFO FileSourceStrategy: Pushed Filters: 
16/07/30 18:32:24 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 142.6 KB, free 2004.5 MB)
16/07/30 18:32:24 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 15.2 KB, free 2004.4 MB)
16/07/30 18:32:24 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
192.168.1.101:64196 (size: 15.2 KB, free: 2004.6 MB)
16/07/30 18:32:24 INFO SparkContext: Created broadcast 0 from run at 
AccessController.java:-2
16/07/30 18:32:24 INFO FileSourceStrategy: Planning scan with bin packing, max 
size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
16/07/30 18:32:26 INFO CodeGenerator: Code generated in 292.668421 ms
16/07/30 18:32:26 INFO SparkContext: Starting job: run at 
AccessController.java:-2
16/07/30 18:32:26 INFO DAGScheduler: Got job 0 (run at 
AccessController.java:-2) with 2 output partitions
16/07/30 18:32:26 INFO DAGScheduler: Final stage: ResultStage 0 (run at 
AccessController.java:-2)
16/07/30 18:32:26 INFO DAGScheduler: Parents of final stage: List()
16/07/30 18:32:26 INFO DAGScheduler: Missing parents: List()
16/07/30 18:32:26 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[2] at run at AccessController.java:-2), which has no missing 
parents
16/07/30 18:32:26 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 9.5 KB, free 2004.4 MB)
16/07/30 18:32:26 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 4.5 KB, free 2004.4 MB)
16/07/30 18:32:26 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
192.168.1.101:64196 (size: 4.5 KB, free: 2004.6 MB)
16/07/30 18:32:26 INFO SparkContext: Created broadcast 1 from broadcast at 
DAGScheduler.scala:996
16/07/30 18:32:26 INFO DAGScheduler: Submitting 2 missing tasks from 
ResultStage 0 (MapPartitionsRDD[2] at run at AccessController.java:-2)
16/07/30 18:32:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/07/30 18:32:26 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, partition 0, PROCESS_LOCAL, 6067 bytes)
16/07/30 18:32:26 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
localhost, partition 1, PROCESS_LOCAL, 6067 bytes)
16/07/30 18:32:26 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/07/30 18:32:26 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/07/30 18:32:26 INFO FileScanRDD: Reading File path: 
file:///Users/giaosudau/topic.parquet/part-r-00001-98ce3a7b-0a80-4ee6-8f8b-a9d6c4d621d6.gz.parquet,
 range: 0-2231, partition values: [empty row]
16/07/30 18:32:26 INFO FileScanRDD: Reading File path: 
file:///Users/giaosudau/topic.parquet/part-r-00000-98ce3a7b-0a80-4ee6-8f8b-a9d6c4d621d6.gz.parquet,
 range: 0-2256, partition values: [empty row]
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
Jul 30, 2016 6:32:27 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring 
statistics because created_by could not be parsed (see PARQUET-251): parquet-mr 
(build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) using 
format: (.+) version ((.*) )?\(build ?(.*)\)
        at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
        at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
        at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:101)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:122)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Jul 30, 2016 6:32:27 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring 
statistics because created_by could not be parsed (see PARQUET-251): parquet-mr 
(build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) using 
format: (.+) version ((.*) )?\(build ?(.*)\)
        at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
        at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
        at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:101)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:122)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Jul 30, 2016 6:32:27 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring 
statistics because created_by could not be parsed (see PARQUET-251): parquet-mr 
(build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) using 
format: (.+) version ((.*) )?\(build ?(.*)\)
        at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
        at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
        at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:101)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterato16/07/30
 18:32:27 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1423 bytes 
result sent to driver
16/07/30 18:32:27 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1422 
bytes result sent to driver
16/07/30 18:32:27 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
in 482 ms on localhost (1/2)
16/07/30 18:32:27 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) 
in 454 ms on localhost (2/2)
16/07/30 18:32:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
16/07/30 18:32:27 INFO DAGScheduler: ResultStage 0 (run at 
AccessController.java:-2) finished in 0.509 s
16/07/30 18:32:27 INFO DAGScheduler: Job 0 finished: run at 
AccessController.java:-2, took 0.625829 s
16/07/30 18:32:27 INFO CodeGenerator: Code generated in 18.418581 ms
16/07/30 18:32:28 INFO SparkExecuteStatementOperation: Result Schema: 
List(topic_id#0, topic_name_en#1, parent_id#2, full_parent#3, level_id#4)















> On Jul 30, 2016, at 6:08 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Actually Hive SQL is a superset of Spark SQL. Data type may not be an issue.
> 
> If I create the table after DataFrame creation as explicitly a Hive parquet 
> table through Spark, Hive sees it and you can see it in Spark thrift server 
> with data in it (basically you are using Hive Thrift server under the bonnet).
> 
> If I let Spark create table with 
> df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet")
> 
> Then Hive does not seem to see the data when an external Hive table is 
> created on it!
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 30 July 2016 at 11:52, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> I agree with you. Maybe some change on data type in Spark that Hive still not 
> support or not competitive so that why It shows NULL.
> 
> 
>> On Jul 30, 2016, at 5:47 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> I think it is still a Hive problem because Spark thrift server is basically 
>> a Hive thrift server.
>> 
>> An ACID test would be to log in to Hive CLI or Hive thrift server (you are 
>> actually using Hive thrift server on port 10000 when using Spark thrift 
>> server) and see whether you see data
>> 
>> When you use Spark it should work.
>> 
>> I still believe it is a bug in Hive
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 30 July 2016 at 11:43, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Mich,
>> Thanks for supporting. Here some of my thoughts.
>> 
>>> BTW can you log in to thrift server and do select * from <TABLE> limit 10
>>> 
>>> Do you see the rows?
>> 
>> Yes I can see the row but all the fields value NULL.
>> 
>>> Works OK for me
>> 
>> You just test the number of row. In my case I check and it shows 117 rows 
>> but the problem is about the data is NULL in all fields.
>> 
>> 
>>> AS I see it the issue is that Hive table created as external on Parquet 
>>> table somehow does not see data. Rows are all nulls.
>>> 
>>> I don't think this is specific to thrift server. Just log in to Hive and 
>>> see you can read the data from your table topic created as external.
>>> 
>>> I noticed the same issue
>> 
>> I don’t think it’s a Hive issue. Right now I am using Spark and Zeppelin.
>> 
>> 
>> And the point is why with the same parquet file ( I convert from CSV to 
>> parquet) it can be read in Spark but not in STS.
>> 
>> One more thing is with the same file and method to create table in STS in 
>> Spark 1.6.1 it works fine.
>> 
>> 
>> Regards,
>> Chanh
>> 
>> 
>> 
>>> On Jul 30, 2016, at 2:10 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> BTW can you log in to thrift server and do select * from <TABLE> limit 10
>>> 
>>> Do you see the rows?
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> On 30 July 2016 at 07:20, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> Works OK for me
>>> 
>>> scala> val df = 
>>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
>>> "true").option("header", 
>>> "false").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868 <>")
>>> df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, 
>>> C3: string, C4: string, C5: string, C6: string, C7: string, C8: string]
>>> scala> 
>>> df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet")
>>> scala> sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")count
>>> res2: Long = 3651
>>> scala> val ff = sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")
>>> ff: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, 
>>> C3: string, C4: string, C5: string, C6: string, C7: string, C8: string]
>>> scala> ff.take(5)
>>> res3: Array[org.apache.spark.sql.Row] = Array([Transaction Date,Transaction 
>>> Type,Sort Code,Account Number,Transaction Description,Debit Amount,Credit 
>>> Amount,Balance,], [31/12/2009,CPT,'30-64-72,18740868,LTSB STH KENSINGTO CD 
>>> 5710 31DEC09 ,90.00,,400.00,null], [31/12/2009,CPT,'30-64-72,18740868,LTSB 
>>> CHELSEA (3091 CD 5710 31DEC09 ,10.00,,490.00,null], 
>>> [31/12/2009,DEP,'30-64-72,18740868,CHELSEA ,,500.00,500.00,null], 
>>> [Transaction Date,Transaction Type,Sort Code,Account Number,Transaction 
>>> Description,Debit Amount,Credit Amount,Balance,])
>>> 
>>> Now in Zeppelin create an external table and read it
>>> 
>>> <image.png>
>>> 
>>> 
>>> 
>>> HTH
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> On 29 July 2016 at 09:04, Chanh Le <giaosu...@gmail.com 
>>> <mailto:giaosu...@gmail.com>> wrote:
>>> I continue to debug
>>> 
>>> 16/07/29 13:57:35 INFO FileScanRDD: Reading File path: 
>>> file:///Users/giaosudau/Documents/Topics.parquet/part-r-00000-8997050f-e063-427e-b53c-f0a61739706f.gz.parquet
>>>  <>, range: 0-3118, partition values: [empty row]
>>> vs OK one
>>> 16/07/29 15:02:47 INFO FileScanRDD: Reading File path: 
>>> file:///Users/giaosudau/data_example/FACT_ADMIN_HOURLY/time=2016-07-24-18/network_id=30206/part-r-00000-c5f5e18d-c8a1-4831-8903-3c60b02bdfe8.snappy.parquet
>>>  <>, range: 0-6050, partition values: [2016-07-24-18,30206]
>>> 
>>> I attached 2 files.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Jul 29, 2016, at 9:44 AM, Chanh Le <giaosu...@gmail.com 
>>>> <mailto:giaosu...@gmail.com>> wrote:
>>>> 
>>>> Hi everyone,
>>>> 
>>>> For more investigation I attached the file that I convert CSV to parquet.
>>>> 
>>>> Spark Code
>>>> 
>>>> I loaded from CSV file
>>>> val df = spark.sqlContext.read 
>>>> .format("com.databricks.spark.csv").option("delimiter", 
>>>> ",").option("header", "true").option("inferSchema", 
>>>> "true").load("/Users/giaosudau/Downloads/Topics.xls - Sheet 1.csv")
>>>> I create a Parquet
>>>> df.write.mode("overwrite").parquet("/Users/giaosudau/Documents/Topics.parquet”)
>>>> 
>>>> It’s OK in Spark-Shell
>>>> 
>>>> scala> df.take(5)
>>>> res22: Array[org.apache.spark.sql.Row] = Array([124,Nghệ thuật & Giải 
>>>> trí,Arts & Entertainment,0,124,1], [53,Scandal,Scandal,124,124,53,2], 
>>>> [54,Showbiz - World,Showbiz-World,124,124,54,2], [52,Âm 
>>>> nhạc,Entertainment-Music,124,124,52,2], [47,Bar - Karaoke - 
>>>> Massage,Bar-Karaoke-Massage-Prostitution,124,124,47,2])
>>>> 
>>>> When Create a table in STS
>>>> 
>>>> 0: jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE topic (TOPIC_ID 
>>>> int, TOPIC_NAME_VN String, TOPIC_NAME_EN String, PARENT_ID int, 
>>>> FULL_PARENT String, LEVEL_ID int) STORED AS PARQUET LOCATION 
>>>> '/Users/giaosudau/Documents/Topics.parquet’;
>>>> 
>>>> But I get all result NULL
>>>> 
>>>> <Screen Shot 2016-07-29 at 9.42.26 AM.png>
>>>> 
>>>> 
>>>> 
>>>> I think it’s really a BUG right?
>>>> 
>>>> Regards,
>>>> Chanh
>>>> 
>>>> 
>>>> <Topics.parquet>
>>>> 
>>>> 
>>>> <Topics.xls - Sheet 1.csv>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Jul 28, 2016, at 4:25 PM, Chanh Le <giaosu...@gmail.com 
>>>>> <mailto:giaosu...@gmail.com>> wrote:
>>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> I have problem when I create a external table in Spark Thrift Server 
>>>>> (STS) and query the data.
>>>>> 
>>>>> Scenario:
>>>>> Spark 2.0
>>>>> Alluxio 1.2.0 
>>>>> Zeppelin 0.7.0
>>>>> STS start script 
>>>>> /home/spark/spark-2.0.0-bin-hadoop2.6/sbin/start-thriftserver.sh --master 
>>>>> mesos://zk://master1:2181,master2:2181,master3:2181/mesos <> --conf 
>>>>> spark.driver.memory=5G --conf spark.scheduler.mode=FAIR --class 
>>>>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --jars 
>>>>> /home/spark/spark-2.0.0-bin-hadoop2.6/jars/alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar
>>>>>  --total-executor-cores 35 spark-internal --hiveconf 
>>>>> hive.server2.thrift.port=10000 --hiveconf 
>>>>> hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
>>>>> hive.metastore.metadb.dir=/user/hive/metadb --conf 
>>>>> spark.sql.shuffle.partitions=20
>>>>> 
>>>>> I have a file store in Alluxio alluxio://master2:19998/etl_info/TOPIC <>
>>>>> 
>>>>> then I create a table in STS by 
>>>>> CREATE EXTERNAL TABLE topic (topic_id int, topic_name_vn String, 
>>>>> topic_name_en String, parent_id int, full_parent String, level_id int)
>>>>> STORED AS PARQUET LOCATION 'alluxio://master2:19998/etl_info/TOPIC' <>;
>>>>> 
>>>>> to compare STS with Spark I create a temp table with name topics
>>>>> spark.sqlContext.read.parquet("alluxio://master2:19998/etl_info/TOPIC 
>>>>> <>").registerTempTable("topics")
>>>>> 
>>>>> Then I do query and compare.
>>>>> <Screen Shot 2016-07-28 at 4.18.59 PM.png>
>>>>> 
>>>>> 
>>>>> As you can see the result is different.
>>>>> Is that a bug? Or I did something wrong
>>>>> 
>>>>> Regards,
>>>>> Chanh
>>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

Reply via email to