Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

Chanh Le Sat, 30 Jul 2016 03:53:35 -0700

I agree with you. Maybe some change on data type in Spark that Hive still not 
support or not competitive so that why It shows NULL.



> On Jul 30, 2016, at 5:47 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> I think it is still a Hive problem because Spark thrift server is basically a 
> Hive thrift server.
> 
> An ACID test would be to log in to Hive CLI or Hive thrift server (you are 
> actually using Hive thrift server on port 10000 when using Spark thrift 
> server) and see whether you see data
> 
> When you use Spark it should work.
> 
> I still believe it is a bug in Hive
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 30 July 2016 at 11:43, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Mich,
> Thanks for supporting. Here some of my thoughts.
> 
>> BTW can you log in to thrift server and do select * from <TABLE> limit 10
>> 
>> Do you see the rows?
> 
> Yes I can see the row but all the fields value NULL.
> 
>> Works OK for me
> 
> You just test the number of row. In my case I check and it shows 117 rows but 
> the problem is about the data is NULL in all fields.
> 
> 
>> AS I see it the issue is that Hive table created as external on Parquet 
>> table somehow does not see data. Rows are all nulls.
>> 
>> I don't think this is specific to thrift server. Just log in to Hive and see 
>> you can read the data from your table topic created as external.
>> 
>> I noticed the same issue
> 
> I don’t think it’s a Hive issue. Right now I am using Spark and Zeppelin.
> 
> 
> And the point is why with the same parquet file ( I convert from CSV to 
> parquet) it can be read in Spark but not in STS.
> 
> One more thing is with the same file and method to create table in STS in 
> Spark 1.6.1 it works fine.
> 
> 
> Regards,
> Chanh
> 
> 
> 
>> On Jul 30, 2016, at 2:10 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> BTW can you log in to thrift server and do select * from <TABLE> limit 10
>> 
>> Do you see the rows?
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 30 July 2016 at 07:20, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> Works OK for me
>> 
>> scala> val df = 
>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
>> "true").option("header", 
>> "false").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
>> df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, 
>> C3: string, C4: string, C5: string, C6: string, C7: string, C8: string]
>> scala> df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet")
>> scala> sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")count
>> res2: Long = 3651
>> scala> val ff = sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")
>> ff: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, 
>> C3: string, C4: string, C5: string, C6: string, C7: string, C8: string]
>> scala> ff.take(5)
>> res3: Array[org.apache.spark.sql.Row] = Array([Transaction Date,Transaction 
>> Type,Sort Code,Account Number,Transaction Description,Debit Amount,Credit 
>> Amount,Balance,], [31/12/2009,CPT,'30-64-72,18740868,LTSB STH KENSINGTO CD 
>> 5710 31DEC09 ,90.00,,400.00,null], [31/12/2009,CPT,'30-64-72,18740868,LTSB 
>> CHELSEA (3091 CD 5710 31DEC09 ,10.00,,490.00,null], 
>> [31/12/2009,DEP,'30-64-72,18740868,CHELSEA ,,500.00,500.00,null], 
>> [Transaction Date,Transaction Type,Sort Code,Account Number,Transaction 
>> Description,Debit Amount,Credit Amount,Balance,])
>> 
>> Now in Zeppelin create an external table and read it
>> 
>> <image.png>
>> 
>> 
>> 
>> HTH
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 29 July 2016 at 09:04, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> I continue to debug
>> 
>> 16/07/29 13:57:35 INFO FileScanRDD: Reading File path: 
>> file:///Users/giaosudau/Documents/Topics.parquet/part-r-00000-8997050f-e063-427e-b53c-f0a61739706f.gz.parquet,
>>  range: 0-3118, partition values: [empty row]
>> vs OK one
>> 16/07/29 15:02:47 INFO FileScanRDD: Reading File path: 
>> file:///Users/giaosudau/data_example/FACT_ADMIN_HOURLY/time=2016-07-24-18/network_id=30206/part-r-00000-c5f5e18d-c8a1-4831-8903-3c60b02bdfe8.snappy.parquet,
>>  range: 0-6050, partition values: [2016-07-24-18,30206]
>> 
>> I attached 2 files.
>> 
>> 
>> 
>> 
>> 
>> 
>>> On Jul 29, 2016, at 9:44 AM, Chanh Le <giaosu...@gmail.com 
>>> <mailto:giaosu...@gmail.com>> wrote:
>>> 
>>> Hi everyone,
>>> 
>>> For more investigation I attached the file that I convert CSV to parquet.
>>> 
>>> Spark Code
>>> 
>>> I loaded from CSV file
>>> val df = spark.sqlContext.read 
>>> .format("com.databricks.spark.csv").option("delimiter", 
>>> ",").option("header", "true").option("inferSchema", 
>>> "true").load("/Users/giaosudau/Downloads/Topics.xls - Sheet 1.csv")
>>> I create a Parquet
>>> df.write.mode("overwrite").parquet("/Users/giaosudau/Documents/Topics.parquet”)
>>> 
>>> It’s OK in Spark-Shell
>>> 
>>> scala> df.take(5)
>>> res22: Array[org.apache.spark.sql.Row] = Array([124,Nghệ thuật & Giải 
>>> trí,Arts & Entertainment,0,124,1], [53,Scandal,Scandal,124,124,53,2], 
>>> [54,Showbiz - World,Showbiz-World,124,124,54,2], [52,Âm 
>>> nhạc,Entertainment-Music,124,124,52,2], [47,Bar - Karaoke - 
>>> Massage,Bar-Karaoke-Massage-Prostitution,124,124,47,2])
>>> 
>>> When Create a table in STS
>>> 
>>> 0: jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE topic (TOPIC_ID int, 
>>> TOPIC_NAME_VN String, TOPIC_NAME_EN String, PARENT_ID int, FULL_PARENT 
>>> String, LEVEL_ID int) STORED AS PARQUET LOCATION 
>>> '/Users/giaosudau/Documents/Topics.parquet’;
>>> 
>>> But I get all result NULL
>>> 
>>> <Screen Shot 2016-07-29 at 9.42.26 AM.png>
>>> 
>>> 
>>> 
>>> I think it’s really a BUG right?
>>> 
>>> Regards,
>>> Chanh
>>> 
>>> 
>>> <Topics.parquet>
>>> 
>>> 
>>> <Topics.xls - Sheet 1.csv>
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Jul 28, 2016, at 4:25 PM, Chanh Le <giaosu...@gmail.com 
>>>> <mailto:giaosu...@gmail.com>> wrote:
>>>> 
>>>> Hi everyone,
>>>> 
>>>> I have problem when I create a external table in Spark Thrift Server (STS) 
>>>> and query the data.
>>>> 
>>>> Scenario:
>>>> Spark 2.0
>>>> Alluxio 1.2.0 
>>>> Zeppelin 0.7.0
>>>> STS start script 
>>>> /home/spark/spark-2.0.0-bin-hadoop2.6/sbin/start-thriftserver.sh --master 
>>>> mesos://zk://master1:2181,master2:2181,master3:2181/mesos --conf 
>>>> spark.driver.memory=5G --conf spark.scheduler.mode=FAIR --class 
>>>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --jars 
>>>> /home/spark/spark-2.0.0-bin-hadoop2.6/jars/alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar
>>>>  --total-executor-cores 35 spark-internal --hiveconf 
>>>> hive.server2.thrift.port=10000 --hiveconf 
>>>> hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
>>>> hive.metastore.metadb.dir=/user/hive/metadb --conf 
>>>> spark.sql.shuffle.partitions=20
>>>> 
>>>> I have a file store in Alluxio alluxio://master2:19998/etl_info/TOPIC
>>>> 
>>>> then I create a table in STS by 
>>>> CREATE EXTERNAL TABLE topic (topic_id int, topic_name_vn String, 
>>>> topic_name_en String, parent_id int, full_parent String, level_id int)
>>>> STORED AS PARQUET LOCATION 'alluxio://master2:19998/etl_info/TOPIC';
>>>> 
>>>> to compare STS with Spark I create a temp table with name topics
>>>> spark.sqlContext.read.parquet("alluxio://master2:19998/etl_info/TOPIC").registerTempTable("topics")
>>>> 
>>>> Then I do query and compare.
>>>> <Screen Shot 2016-07-28 at 4.18.59 PM.png>
>>>> 
>>>> 
>>>> As you can see the result is different.
>>>> Is that a bug? Or I did something wrong
>>>> 
>>>> Regards,
>>>> Chanh
>>> 
>> 
>> 
>> 
>> 
> 
>

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

Reply via email to