Re: Spark-SQL idiomatic way of adding a new partition or writing to Partitioned Persistent Table

Deenar Toraskar Sun, 22 Nov 2015 17:33:50 -0800

Thanks Michael

Thanks for the response. Here is my understanding, correct me if I am wrong

1) Spark SQL written partitioned tables do not write metadata to the Hive
metastore. Spark SQL discovers partitions from the table location on the
underlying DFS, and not the metastore. It does this the first time a table
is accessed, so if the underlying partitions change a refresh table
<tableName> is required. Is there a way to see partitions discovered by
Spark SQL, show partitions <tableName> does not work on Spark SQL
partitioned tables. Also hive allows different partitions in different
physical locations, I guess this wont be possibly in Spark SQL.

2) If you want to retain compatibility with other SQL on Hadoop engines,
register your dataframe as a temp table and then use the  Hive's dynamic
partitioned insert syntax. SparkSQL uses this for Hive style tables.

3) Automatic schema discovery. I presume this is parquet only and only
if spark.sql.parquet.mergeSchema
/ mergeSchema is set to true. What happens when mergeSchema is set to false
( i guess i can check this out).

My two cents

a) it would help if there was kind of the hive nonstrict mode equivalent,
which would enforce schema compatibility for all partitions written to a
table.
b) refresh table is annoying for tables where partitions are being written
frequently, for other reasons, not sure if there is way around this.
c) it would be great if DataFrameWriter had an option to maintain
compatibility with the HiveMetastore. registerTempTable and "insert
overwrite table select from" is quite ugly and cumbersome
d) It would be helpful to resurrect the
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/,
happy to help out with the Spark SQL portions.

Regards
Deenar

On 22 November 2015 at 18:54, Michael Armbrust <mich...@databricks.com>
wrote:

> Is it possible to add a new partition to a persistent table using Spark
>> SQL ? The following call works and data gets written in the correct
>> directories, but no partition metadata is not added to the Hive metastore.
>>
> I believe if you use Hive's dynamic partitioned insert syntax then we will
> fall back on metastore and do the update.
>
>> In addition I see nothing preventing any arbitrary schema being appended
>> to the existing table.
>>
> This is perhaps kind of a feature, we do automatic schema discovery and
> merging when loading a new parquet table.
>
>> Does SparkSQL not need partition metadata when reading data back?
>>
> No, we dynamically discover it in a distributed job when the table is
> loaded.
>

Re: Spark-SQL idiomatic way of adding a new partition or writing to Partitioned Persistent Table

Reply via email to