[DISCUSS] Support detecting table schema from external files.

2024-01-15 Thread Yisha Zhou
Hi dev,

Currently,  we are used to creating a table by listing all physical columns or 
using like syntax to reuse the table schema in Catalogs.  
However, in our company there are many cases that the messages in the external 
systems are with very complex schema. The worst
case is that some protobuf data has even thousands of fields in it. 

In these cases, listing fields in the DDL will be a very hard work. Creating 
and updating such complex schema in Catalogs will also cost a lot.
Therefore, I’d like to introduce an ability for detecting table schema from 
external files in DDL. 

A good precedent from SnowFlake[1] works like below:

CREATE TABLE mytable
  USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
  FROM TABLE(
INFER_SCHEMA(
  LOCATION=>'@mystage/json/',
  FILE_FORMAT=>'my_json_format'
)
  ));

The INFER_SCHEMA is a table function to 'automatically detects the file 
metadata schema in a set of staged data files that contain 
semi-structured data and retrieves the column definitions.’ The files can be in 
Parquet, Avro, ORC, JSON, and CSV.

We don’t need to follow the syntax, but the functionality is exactly what I 
want. In addition, the file can be more than just semi-structured data
file. It can be metadata file. For example, a .proto file, a .thrift file.

As it will be a big feature which deserves a FLIP to describe it in detail. I'm 
forward to your feedback and suggestions before I start to do it.

Best,
Yisha

[1]https://docs.snowflake.com/en/sql-reference/functions/infer_schema 


Re: [DISCUSS] Support detecting table schema from external files.

2024-01-16 Thread Benchao Li
Thanks Yisha for bringing up this discussion. Schema inferring is a
very interesting and useful feature, especially when it comes to
formats with well defined schemas such as Protobuf/Parquet. I'm
looking forward to the FLIP.

Yisha Zhou  于2024年1月15日周一 16:29写道:
>
> Hi dev,
>
> Currently,  we are used to creating a table by listing all physical columns 
> or using like syntax to reuse the table schema in Catalogs.
> However, in our company there are many cases that the messages in the 
> external systems are with very complex schema. The worst
> case is that some protobuf data has even thousands of fields in it.
>
> In these cases, listing fields in the DDL will be a very hard work. Creating 
> and updating such complex schema in Catalogs will also cost a lot.
> Therefore, I’d like to introduce an ability for detecting table schema from 
> external files in DDL.
>
> A good precedent from SnowFlake[1] works like below:
>
> CREATE TABLE mytable
>   USING TEMPLATE (
> SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
>   FROM TABLE(
> INFER_SCHEMA(
>   LOCATION=>'@mystage/json/',
>   FILE_FORMAT=>'my_json_format'
> )
>   ));
>
> The INFER_SCHEMA is a table function to 'automatically detects the file 
> metadata schema in a set of staged data files that contain
> semi-structured data and retrieves the column definitions.’ The files can be 
> in Parquet, Avro, ORC, JSON, and CSV.
>
> We don’t need to follow the syntax, but the functionality is exactly what I 
> want. In addition, the file can be more than just semi-structured data
> file. It can be metadata file. For example, a .proto file, a .thrift file.
>
> As it will be a big feature which deserves a FLIP to describe it in detail. 
> I'm forward to your feedback and suggestions before I start to do it.
>
> Best,
> Yisha
>
> [1]https://docs.snowflake.com/en/sql-reference/functions/infer_schema 
> 



-- 

Best,
Benchao Li


Re: [DISCUSS] Support detecting table schema from external files.

2024-01-23 Thread Jane Chan
Hi Yisha,

Thanks for driving this discussion. I think it's a valuable feature, and
feel free to go ahead.

Best,
Jane

On Tue, Jan 16, 2024 at 6:58 PM Benchao Li  wrote:

> Thanks Yisha for bringing up this discussion. Schema inferring is a
> very interesting and useful feature, especially when it comes to
> formats with well defined schemas such as Protobuf/Parquet. I'm
> looking forward to the FLIP.
>
> Yisha Zhou  于2024年1月15日周一 16:29写道:
> >
> > Hi dev,
> >
> > Currently,  we are used to creating a table by listing all physical
> columns or using like syntax to reuse the table schema in Catalogs.
> > However, in our company there are many cases that the messages in the
> external systems are with very complex schema. The worst
> > case is that some protobuf data has even thousands of fields in it.
> >
> > In these cases, listing fields in the DDL will be a very hard work.
> Creating and updating such complex schema in Catalogs will also cost a lot.
> > Therefore, I’d like to introduce an ability for detecting table schema
> from external files in DDL.
> >
> > A good precedent from SnowFlake[1] works like below:
> >
> > CREATE TABLE mytable
> >   USING TEMPLATE (
> > SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
> >   FROM TABLE(
> > INFER_SCHEMA(
> >   LOCATION=>'@mystage/json/',
> >   FILE_FORMAT=>'my_json_format'
> > )
> >   ));
> >
> > The INFER_SCHEMA is a table function to 'automatically detects the file
> metadata schema in a set of staged data files that contain
> > semi-structured data and retrieves the column definitions.’ The files
> can be in Parquet, Avro, ORC, JSON, and CSV.
> >
> > We don’t need to follow the syntax, but the functionality is exactly
> what I want. In addition, the file can be more than just semi-structured
> data
> > file. It can be metadata file. For example, a .proto file, a .thrift
> file.
> >
> > As it will be a big feature which deserves a FLIP to describe it in
> detail. I'm forward to your feedback and suggestions before I start to do
> it.
> >
> > Best,
> > Yisha
> >
> > [1]https://docs.snowflake.com/en/sql-reference/functions/infer_schema <
> https://docs.snowflake.com/en/sql-reference/functions/infer_schema>
>
>
>
> --
>
> Best,
> Benchao Li
>