I received this log when recent debug. Is that related to PARQUET-251 But I checked Spark current using parquet 1.8.1 means it already fixed.
16/07/30 18:32:11 INFO SparkExecuteStatementOperation: Running query 'select * from topic18' with 72649e37-3ef4-4acd-8d01-4a28e79a1f9a 16/07/30 18:32:11 INFO SparkSqlParser: Parsing command: select * from topic18 16/07/30 18:32:11 INFO SessionState: Created local directory: /var/folders/3c/_6cznybx2571l0b7f5dstkfr0000gn/T/e8d2eb4d-1682-40fd-ad66-f0643692ded7_resources 16/07/30 18:32:11 INFO SessionState: Created HDFS directory: /tmp/hive/anonymous/e8d2eb4d-1682-40fd-ad66-f0643692ded7 16/07/30 18:32:11 INFO SessionState: Created local directory: /var/folders/3c/_6cznybx2571l0b7f5dstkfr0000gn/T/giaosudau/e8d2eb4d-1682-40fd-ad66-f0643692ded7 16/07/30 18:32:11 INFO SessionState: Created HDFS directory: /tmp/hive/anonymous/e8d2eb4d-1682-40fd-ad66-f0643692ded7/_tmp_space.db 16/07/30 18:32:11 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/Users/giaosudau/IdeaProjects/spark/spark-warehouse 16/07/30 18:32:12 INFO HiveMetaStore: 1: create_database: Database(name:default, description:default database, locationUri:file:/Users/giaosudau/IdeaProjects/spark/spark-warehouse, parameters:{}) 16/07/30 18:32:12 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=create_database: Database(name:default, description:default database, locationUri:file:/Users/giaosudau/IdeaProjects/spark/spark-warehouse, parameters:{}) 16/07/30 18:32:12 INFO HiveMetaStore: 1: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 16/07/30 18:32:12 INFO ObjectStore: ObjectStore, initialize called 16/07/30 18:32:12 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 16/07/30 18:32:12 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY 16/07/30 18:32:12 INFO ObjectStore: Initialized ObjectStore 16/07/30 18:32:12 INFO HiveMetaStore: 1: get_table : db=default tbl=topic18 16/07/30 18:32:12 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=topic18 16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: int 16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: string 16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: int 16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: string 16/07/30 18:32:12 INFO CatalystSqlParser: Parsing command: int 16/07/30 18:32:23 INFO FileSourceStrategy: Pruning directories with: 16/07/30 18:32:23 INFO FileSourceStrategy: Post-Scan Filters: 16/07/30 18:32:23 INFO FileSourceStrategy: Pruned Data Schema: struct<topic_id: int, topic_name_en: string, parent_id: int, full_parent: string, level_id: int ... 3 more fields> 16/07/30 18:32:23 INFO FileSourceStrategy: Pushed Filters: 16/07/30 18:32:24 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 142.6 KB, free 2004.5 MB) 16/07/30 18:32:24 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 15.2 KB, free 2004.4 MB) 16/07/30 18:32:24 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.101:64196 (size: 15.2 KB, free: 2004.6 MB) 16/07/30 18:32:24 INFO SparkContext: Created broadcast 0 from run at AccessController.java:-2 16/07/30 18:32:24 INFO FileSourceStrategy: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes. 16/07/30 18:32:26 INFO CodeGenerator: Code generated in 292.668421 ms 16/07/30 18:32:26 INFO SparkContext: Starting job: run at AccessController.java:-2 16/07/30 18:32:26 INFO DAGScheduler: Got job 0 (run at AccessController.java:-2) with 2 output partitions 16/07/30 18:32:26 INFO DAGScheduler: Final stage: ResultStage 0 (run at AccessController.java:-2) 16/07/30 18:32:26 INFO DAGScheduler: Parents of final stage: List() 16/07/30 18:32:26 INFO DAGScheduler: Missing parents: List() 16/07/30 18:32:26 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at run at AccessController.java:-2), which has no missing parents 16/07/30 18:32:26 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 9.5 KB, free 2004.4 MB) 16/07/30 18:32:26 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.5 KB, free 2004.4 MB) 16/07/30 18:32:26 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.101:64196 (size: 4.5 KB, free: 2004.6 MB) 16/07/30 18:32:26 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996 16/07/30 18:32:26 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at run at AccessController.java:-2) 16/07/30 18:32:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 16/07/30 18:32:26 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0, PROCESS_LOCAL, 6067 bytes) 16/07/30 18:32:26 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1, PROCESS_LOCAL, 6067 bytes) 16/07/30 18:32:26 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 16/07/30 18:32:26 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 16/07/30 18:32:26 INFO FileScanRDD: Reading File path: file:///Users/giaosudau/topic.parquet/part-r-00001-98ce3a7b-0a80-4ee6-8f8b-a9d6c4d621d6.gz.parquet, range: 0-2231, partition values: [empty row] 16/07/30 18:32:26 INFO FileScanRDD: Reading File path: file:///Users/giaosudau/topic.parquet/part-r-00000-98ce3a7b-0a80-4ee6-8f8b-a9d6c4d621d6.gz.parquet, range: 0-2256, partition values: [empty row] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Jul 30, 2016 6:32:27 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) using format: (.+) version ((.*) )?\(build ?(.*)\) at org.apache.parquet.VersionParser.parse(VersionParser.java:112) at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:101) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:122) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Jul 30, 2016 6:32:27 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) using format: (.+) version ((.*) )?\(build ?(.*)\) at org.apache.parquet.VersionParser.parse(VersionParser.java:112) at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:101) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:122) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Jul 30, 2016 6:32:27 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) using format: (.+) version ((.*) )?\(build ?(.*)\) at org.apache.parquet.VersionParser.parse(VersionParser.java:112) at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:101) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterato16/07/30 18:32:27 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1423 bytes result sent to driver 16/07/30 18:32:27 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1422 bytes result sent to driver 16/07/30 18:32:27 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 482 ms on localhost (1/2) 16/07/30 18:32:27 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 454 ms on localhost (2/2) 16/07/30 18:32:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/07/30 18:32:27 INFO DAGScheduler: ResultStage 0 (run at AccessController.java:-2) finished in 0.509 s 16/07/30 18:32:27 INFO DAGScheduler: Job 0 finished: run at AccessController.java:-2, took 0.625829 s 16/07/30 18:32:27 INFO CodeGenerator: Code generated in 18.418581 ms 16/07/30 18:32:28 INFO SparkExecuteStatementOperation: Result Schema: List(topic_id#0, topic_name_en#1, parent_id#2, full_parent#3, level_id#4) > On Jul 30, 2016, at 6:08 PM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > Actually Hive SQL is a superset of Spark SQL. Data type may not be an issue. > > If I create the table after DataFrame creation as explicitly a Hive parquet > table through Spark, Hive sees it and you can see it in Spark thrift server > with data in it (basically you are using Hive Thrift server under the bonnet). > > If I let Spark create table with > df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet") > > Then Hive does not seem to see the data when an external Hive table is > created on it! > > HTH > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > On 30 July 2016 at 11:52, Chanh Le <giaosu...@gmail.com > <mailto:giaosu...@gmail.com>> wrote: > I agree with you. Maybe some change on data type in Spark that Hive still not > support or not competitive so that why It shows NULL. > > >> On Jul 30, 2016, at 5:47 PM, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> >> I think it is still a Hive problem because Spark thrift server is basically >> a Hive thrift server. >> >> An ACID test would be to log in to Hive CLI or Hive thrift server (you are >> actually using Hive thrift server on port 10000 when using Spark thrift >> server) and see whether you see data >> >> When you use Spark it should work. >> >> I still believe it is a bug in Hive >> >> HTH >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> On 30 July 2016 at 11:43, Chanh Le <giaosu...@gmail.com >> <mailto:giaosu...@gmail.com>> wrote: >> Hi Mich, >> Thanks for supporting. Here some of my thoughts. >> >>> BTW can you log in to thrift server and do select * from <TABLE> limit 10 >>> >>> Do you see the rows? >> >> Yes I can see the row but all the fields value NULL. >> >>> Works OK for me >> >> You just test the number of row. In my case I check and it shows 117 rows >> but the problem is about the data is NULL in all fields. >> >> >>> AS I see it the issue is that Hive table created as external on Parquet >>> table somehow does not see data. Rows are all nulls. >>> >>> I don't think this is specific to thrift server. Just log in to Hive and >>> see you can read the data from your table topic created as external. >>> >>> I noticed the same issue >> >> I don’t think it’s a Hive issue. Right now I am using Spark and Zeppelin. >> >> >> And the point is why with the same parquet file ( I convert from CSV to >> parquet) it can be read in Spark but not in STS. >> >> One more thing is with the same file and method to create table in STS in >> Spark 1.6.1 it works fine. >> >> >> Regards, >> Chanh >> >> >> >>> On Jul 30, 2016, at 2:10 PM, Mich Talebzadeh <mich.talebza...@gmail.com >>> <mailto:mich.talebza...@gmail.com>> wrote: >>> >>> BTW can you log in to thrift server and do select * from <TABLE> limit 10 >>> >>> Do you see the rows? >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>> >>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. >>> The author will in no case be liable for any monetary damages arising from >>> such loss, damage or destruction. >>> >>> >>> On 30 July 2016 at 07:20, Mich Talebzadeh <mich.talebza...@gmail.com >>> <mailto:mich.talebza...@gmail.com>> wrote: >>> Works OK for me >>> >>> scala> val df = >>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", >>> "true").option("header", >>> "false").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868 <>") >>> df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, >>> C3: string, C4: string, C5: string, C6: string, C7: string, C8: string] >>> scala> >>> df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet") >>> scala> sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")count >>> res2: Long = 3651 >>> scala> val ff = sqlContext.read.parquet("/user/hduser/ll_18740868.parquet") >>> ff: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, >>> C3: string, C4: string, C5: string, C6: string, C7: string, C8: string] >>> scala> ff.take(5) >>> res3: Array[org.apache.spark.sql.Row] = Array([Transaction Date,Transaction >>> Type,Sort Code,Account Number,Transaction Description,Debit Amount,Credit >>> Amount,Balance,], [31/12/2009,CPT,'30-64-72,18740868,LTSB STH KENSINGTO CD >>> 5710 31DEC09 ,90.00,,400.00,null], [31/12/2009,CPT,'30-64-72,18740868,LTSB >>> CHELSEA (3091 CD 5710 31DEC09 ,10.00,,490.00,null], >>> [31/12/2009,DEP,'30-64-72,18740868,CHELSEA ,,500.00,500.00,null], >>> [Transaction Date,Transaction Type,Sort Code,Account Number,Transaction >>> Description,Debit Amount,Credit Amount,Balance,]) >>> >>> Now in Zeppelin create an external table and read it >>> >>> <image.png> >>> >>> >>> >>> HTH >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>> >>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. >>> The author will in no case be liable for any monetary damages arising from >>> such loss, damage or destruction. >>> >>> >>> On 29 July 2016 at 09:04, Chanh Le <giaosu...@gmail.com >>> <mailto:giaosu...@gmail.com>> wrote: >>> I continue to debug >>> >>> 16/07/29 13:57:35 INFO FileScanRDD: Reading File path: >>> file:///Users/giaosudau/Documents/Topics.parquet/part-r-00000-8997050f-e063-427e-b53c-f0a61739706f.gz.parquet >>> <>, range: 0-3118, partition values: [empty row] >>> vs OK one >>> 16/07/29 15:02:47 INFO FileScanRDD: Reading File path: >>> file:///Users/giaosudau/data_example/FACT_ADMIN_HOURLY/time=2016-07-24-18/network_id=30206/part-r-00000-c5f5e18d-c8a1-4831-8903-3c60b02bdfe8.snappy.parquet >>> <>, range: 0-6050, partition values: [2016-07-24-18,30206] >>> >>> I attached 2 files. >>> >>> >>> >>> >>> >>> >>>> On Jul 29, 2016, at 9:44 AM, Chanh Le <giaosu...@gmail.com >>>> <mailto:giaosu...@gmail.com>> wrote: >>>> >>>> Hi everyone, >>>> >>>> For more investigation I attached the file that I convert CSV to parquet. >>>> >>>> Spark Code >>>> >>>> I loaded from CSV file >>>> val df = spark.sqlContext.read >>>> .format("com.databricks.spark.csv").option("delimiter", >>>> ",").option("header", "true").option("inferSchema", >>>> "true").load("/Users/giaosudau/Downloads/Topics.xls - Sheet 1.csv") >>>> I create a Parquet >>>> df.write.mode("overwrite").parquet("/Users/giaosudau/Documents/Topics.parquet”) >>>> >>>> It’s OK in Spark-Shell >>>> >>>> scala> df.take(5) >>>> res22: Array[org.apache.spark.sql.Row] = Array([124,Nghệ thuật & Giải >>>> trí,Arts & Entertainment,0,124,1], [53,Scandal,Scandal,124,124,53,2], >>>> [54,Showbiz - World,Showbiz-World,124,124,54,2], [52,Âm >>>> nhạc,Entertainment-Music,124,124,52,2], [47,Bar - Karaoke - >>>> Massage,Bar-Karaoke-Massage-Prostitution,124,124,47,2]) >>>> >>>> When Create a table in STS >>>> >>>> 0: jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE topic (TOPIC_ID >>>> int, TOPIC_NAME_VN String, TOPIC_NAME_EN String, PARENT_ID int, >>>> FULL_PARENT String, LEVEL_ID int) STORED AS PARQUET LOCATION >>>> '/Users/giaosudau/Documents/Topics.parquet’; >>>> >>>> But I get all result NULL >>>> >>>> <Screen Shot 2016-07-29 at 9.42.26 AM.png> >>>> >>>> >>>> >>>> I think it’s really a BUG right? >>>> >>>> Regards, >>>> Chanh >>>> >>>> >>>> <Topics.parquet> >>>> >>>> >>>> <Topics.xls - Sheet 1.csv> >>>> >>>> >>>> >>>> >>>> >>>>> On Jul 28, 2016, at 4:25 PM, Chanh Le <giaosu...@gmail.com >>>>> <mailto:giaosu...@gmail.com>> wrote: >>>>> >>>>> Hi everyone, >>>>> >>>>> I have problem when I create a external table in Spark Thrift Server >>>>> (STS) and query the data. >>>>> >>>>> Scenario: >>>>> Spark 2.0 >>>>> Alluxio 1.2.0 >>>>> Zeppelin 0.7.0 >>>>> STS start script >>>>> /home/spark/spark-2.0.0-bin-hadoop2.6/sbin/start-thriftserver.sh --master >>>>> mesos://zk://master1:2181,master2:2181,master3:2181/mesos <> --conf >>>>> spark.driver.memory=5G --conf spark.scheduler.mode=FAIR --class >>>>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --jars >>>>> /home/spark/spark-2.0.0-bin-hadoop2.6/jars/alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar >>>>> --total-executor-cores 35 spark-internal --hiveconf >>>>> hive.server2.thrift.port=10000 --hiveconf >>>>> hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf >>>>> hive.metastore.metadb.dir=/user/hive/metadb --conf >>>>> spark.sql.shuffle.partitions=20 >>>>> >>>>> I have a file store in Alluxio alluxio://master2:19998/etl_info/TOPIC <> >>>>> >>>>> then I create a table in STS by >>>>> CREATE EXTERNAL TABLE topic (topic_id int, topic_name_vn String, >>>>> topic_name_en String, parent_id int, full_parent String, level_id int) >>>>> STORED AS PARQUET LOCATION 'alluxio://master2:19998/etl_info/TOPIC' <>; >>>>> >>>>> to compare STS with Spark I create a temp table with name topics >>>>> spark.sqlContext.read.parquet("alluxio://master2:19998/etl_info/TOPIC >>>>> <>").registerTempTable("topics") >>>>> >>>>> Then I do query and compare. >>>>> <Screen Shot 2016-07-28 at 4.18.59 PM.png> >>>>> >>>>> >>>>> As you can see the result is different. >>>>> Is that a bug? Or I did something wrong >>>>> >>>>> Regards, >>>>> Chanh >>>> >>> >>> >>> >>> >> >> > >