(Cross posted from u...@spark.apache.org)
Hello,
I am in the process of evaluating Spark (1.5.2) for a wide range of use
cases. In particular I'm keen to understand the depth of the integration
with HCatalog (aka the Hive Metastore). I am very encouraged when browsing
the source contained within the org.apache.spark.sql.hive package. My goals
are to evaluate how effectively Spark handles the following scenarios:
1. Reading from an unpartitioned HCatalog table.
2. Reading from a partitioned HCatalog table with partition pruning from
filter pushdown.
3. Writing to a new unpartitioned HCatalog table.
4. Writing to a new partitioned HCatalog table.
5. Adding a partition to a partitioned HCatalog table.
I found that the first three cases appear to function beautifully. However,
I cannot seem to effectively create new HCatalog aware partitions either in
a new table or on and existing table (cases 4 & 5). I suspect this may be
due to my inexperience with Spark so wonder if you could advise me on what
to try next. Here's what I have:
*Case 4: Writing to a new partitioned HCatalog table*
Create a source in Hive (could be plain data file also):
hive (default)> create table foobar ( id int, name string );
hive (default)> insert into table foobar values (1, "xxx"), (2, "zzz");
Read the source with Spark, partition the data, and write to a new table:
sqlContext.sql("select *
from foobar").write.format("orc").partitionBy("id").saveAsTable("raboof")
Check for the new table in Hive, it is partitioned correctly although the
formats and schema are unexpected:
hive (default)> show table extended like 'raboof';
OK
tab_name
tableName: raboof
location:hdfs://host:port/user/hive/warehouse/raboof
inputformat:org.apache.hadoop.mapred.SequenceFileInputFormat
outputformat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
columns:struct columns { list col}
partitioned:true
partitionColumns:struct partition_columns { i32 id}
Check for correctly partitioned data on HDFS, it appears to be there:
[me@host]$ hdfs dfs -ls -R /user/hive/warehouse/raboof
/user/hive/warehouse/raboof/_SUCCESS
/user/hive/warehouse/raboof/id=1
/user/hive/warehouse/raboof/id=1/part-r-0-.orc
/user/hive/warehouse/raboof/id=2
/user/hive/warehouse/raboof/id=2/part-r-0-.orc
Something is wrong however, no data is returned from this query and the
column names appear incorrect:
hive (default)> select * from default.raboof;
OK
col id
HCatalog reports no partitions for the table:
hive (default)> show partitions default.raboof;
OK
partition
*Case 5: Adding a partition to a partitioned HCatalog table*
Created partitioned source table in Hive:
hive (default)> create table foobar ( name string )
> partitioned by ( id int )
> stored as orc;
hive (default)> insert into table foobar PARTITION (id)
> values ("xxx", 1), ("yyy", 2);
Created a source for a new record to add to new_record_source:
hive (default)> create table new_record_source ( id int, name string )
> stored as orc;
hive (default)> insert into table new_record_source
> values (3, "zzz");
Trying to add a partition with:
sqlContext.sql("select *
from
new_record_source").write.mode("append").partitionBy("id").saveAsTable("foobar")
This almost did what I wanted:
hive (default)> show partitions default.foobar;
partition
id=1
id=2
id=__HIVE_DEFAULT_PARTITION__
hive (default)> select * from default.foobar;
name id
xxx 1
yyy 2
3 NULL
Any assistance would be greatly appreciated.
Many thanks - Elliot.