Re: Data Contracts

2023-06-12 Thread Elliot West
Hi Phillip,

While not as fine-grained as your example, there do exist schema systems
such as that in Avro that can can evaluate compatible and incompatible
changes to the schema, from the perspective of the reader, writer, or both.
This provides some potential degree of enforcement, and means to
communicate a contract. Interestingly I believe this approach has been
applied to both JsonSchema and protobuf as part of the Confluent Schema
registry.

Elliot.

On Mon, 12 Jun 2023 at 12:43, Phillip Henry  wrote:

> Hi, folks.
>
> There currently seems to be a buzz around "data contracts". From what I
> can tell, these mainly advocate a cultural solution. But instead, could big
> data tools be used to enforce these contracts?
>
> My questions really are: are there any plans to implement data constraints
> in Spark (eg, an integer must be between 0 and 100; the date in column X
> must be before that in column Y)? And if not, is there an appetite for them?
>
> Maybe we could associate constraints with schema metadata that are
> enforced in the implementation of a FileFormatDataWriter?
>
> Just throwing it out there and wondering what other people think. It's an
> area that interests me as it seems that over half my problems at the day
> job are because of dodgy data.
>
> Regards,
>
> Phillip
>
>


Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Elliot West
Hi Ankit,

While not a part of Spark, there is a project called 'WaggleDance' that can
federate multiple Hive metastores so that they are accessible via a single
URI: https://github.com/ExpediaGroup/waggle-dance

This may be useful or perhaps serve as inspiration.

Thanks,

Elliot.

On Mon, 17 Apr 2023 at 16:38, Ankit Gupta  wrote:

> ++
> User Mailing List
>
> Just a reminder, anyone who can help on this.
>
> Thanks a lot !
>
> Ankit Prakash Gupta
>
> On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta  wrote:
>
>> Hi All
>>
>> The question is regarding the support of multiple Remote Hive Metastore
>> catalogs with Spark. Starting Spark 3, multiple catalog support is added in
>> spark, but have we implemented any CatalogPlugin that can help us configure
>> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with
>> the Fully Qualified Class Name that I can try using for configuring a Hive
>> Metastore Catalog. If not, I would like to work on the implementation of
>> the CatalogPlugin that we can use to configure multiple Hive Metastore
>> Servers' .
>>
>> Thanks and Regards.
>>
>> Ankit Prakash Gupta
>> +91 8750101321
>> info.ank...@gmail.com
>>
>>


Fwd: Spark integration with HCatalog (specifically regarding partitions)

2016-01-19 Thread Elliot West
(Cross posted from u...@spark.apache.org)

Hello,

I am in the process of evaluating Spark (1.5.2) for a wide range of use
cases. In particular I'm keen to understand the depth of the integration
with HCatalog (aka the Hive Metastore). I am very encouraged when browsing
the source contained within the org.apache.spark.sql.hive package. My goals
are to evaluate how effectively Spark handles the following scenarios:

   1. Reading from an unpartitioned HCatalog table.
   2. Reading from a partitioned HCatalog table with partition pruning from
   filter pushdown.
   3. Writing to a new unpartitioned HCatalog table.
   4. Writing to a new partitioned HCatalog table.
   5. Adding a partition to a partitioned HCatalog table.

I found that the first three cases appear to function beautifully. However,
I cannot seem to effectively create new HCatalog aware partitions either in
a new table or on and existing table (cases 4 & 5). I suspect this may be
due to my inexperience with Spark so wonder if you could advise me on what
to try next. Here's what I have:

*Case 4: Writing to a new partitioned HCatalog table*

Create a source in Hive (could be plain data file also):


hive (default)> create table foobar ( id int, name string );
hive (default)> insert into table foobar values (1, "xxx"), (2, "zzz");

Read the source with Spark, partition the data, and write to a new table:

sqlContext.sql("select *
from foobar").write.format("orc").partitionBy("id").saveAsTable("raboof")


Check for the new table in Hive, it is partitioned correctly although the
formats and schema are unexpected:

hive (default)> show table extended like 'raboof';
OK
tab_name
tableName: raboof
location:hdfs://host:port/user/hive/warehouse/raboof
inputformat:org.apache.hadoop.mapred.SequenceFileInputFormat
outputformat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
columns:struct columns { list col}
partitioned:true
partitionColumns:struct partition_columns { i32 id}


Check for correctly partitioned data on HDFS, it appears to be there:

[me@host]$ hdfs dfs -ls -R /user/hive/warehouse/raboof
/user/hive/warehouse/raboof/_SUCCESS
/user/hive/warehouse/raboof/id=1
/user/hive/warehouse/raboof/id=1/part-r-0-.orc
/user/hive/warehouse/raboof/id=2
/user/hive/warehouse/raboof/id=2/part-r-0-.orc


Something is wrong however, no data is returned from this query and the
column names appear incorrect:

hive (default)> select * from default.raboof;
OK
col id

HCatalog reports no partitions for the table:

hive (default)> show partitions default.raboof;
OK
partition

*Case 5: Adding a partition to a partitioned HCatalog table*

Created partitioned source table in Hive:

hive (default)> create table foobar ( name string )
  > partitioned by ( id int )
  > stored as orc;
hive (default)> insert into table foobar PARTITION (id)
  > values ("xxx", 1), ("yyy", 2);


Created a source for a new record to add to new_record_source:

hive (default)> create table new_record_source ( id int, name string )
  > stored as orc;
hive (default)> insert into table new_record_source
  > values (3, "zzz");


Trying to add a partition with:

sqlContext.sql("select *
from 
new_record_source").write.mode("append").partitionBy("id").saveAsTable("foobar")


This almost did what I wanted:

hive (default)> show partitions default.foobar;
partition
id=1
id=2
id=__HIVE_DEFAULT_PARTITION__

hive (default)> select * from default.foobar;
name id
xxx 1
yyy 2
3 NULL

Any assistance would be greatly appreciated.

Many thanks - Elliot.