Pyspark - Kerberos authentication error

2019-08-07 Thread Pravinkumar vp
Hello, I’m new to pyspark programming and facing Issue when try to connect the HDFS with Kerberos auth principal and keytab. My environment is a docker container environment, so installed necessary libraries including pyspark, krb5-* for client In my scenario, before creating spark context I

Sharing ideas on using Databricks Delta Lake

2019-08-07 Thread Mich Talebzadeh
I upgraded my Spark to 2.4.3 that allows using the storage layer Delta Lake . Actually, I wish Databricks would have chosen a different name for it :) Anyhow although most example of storage are on normal file system, (/tmp/), I managed to put data

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Mich Talebzadeh
Have you updated partition statistics by any chance? I assume you can access the table and data though Hive itself? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Patrick McCarthy
Do the permissions on the hive table files on HDFS correspond with what the spark user is able to read? This might arise from spark being run as different users. On Wed, Aug 7, 2019 at 3:15 PM Rishikesh Gawade wrote: > Hi, > I did not explicitly create a Hive Context. I have been using the >

Re: Spark scala/Hive scenario

2019-08-07 Thread Jörn Franke
You can use the map datatype on the Hive table for the columns that are uncertain: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes However, maybe you can share more concrete details, because there could be also other solutions. > Am

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade
Hi, I did not explicitly create a Hive Context. I have been using the spark.sqlContext that gets created upon launching the spark-shell. Isn't this sqlContext same as the hiveContext? Thanks, Rishikesh On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke wrote: > Do you use the HiveContext in Spark? Do

Spark scala/Hive scenario

2019-08-07 Thread anbutech
Hi All, I have a scenario in (Spark scala/Hive): Day 1: i have a file with 5 columns which needs to be processed and loaded into hive tables. day2: Next day the same feeds(file) has 8 columns(additional fields) which needs to be processed and loaded into hive tables How do we approach this

Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-07 Thread Hao Ren
Hi, I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS. Here is the SQL query to read just *one partition*. ``` spark.sql(""" SELECT rtb_platform_id, SUM(e_cpm) FROM

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Jörn Franke
Do you use the HiveContext in Spark? Do you configure the same options there? Can you share some code? > Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade : > > Hi. > I am using Spark 2.3.2 and Hive 3.1.0. > Even if i use parquet files the result would be same, because after all > sparkSQL

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade
Hi. I am using Spark 2.3.2 and Hive 3.1.0. Even if i use parquet files the result would be same, because after all sparkSQL isn't able to descend into the subdirectories over which the table is created. Could there be any other way? Thanks, Rishikesh On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh