Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-18 Thread Charles Givre
Hey Mike,
So it looks like I was wrong and the XML reader does not have the support for 
Arrays.  However... Once DRILL-8450 is merged, I'll add the readers for arrays. 
  The XML reader itself still won't be able to dynamically detect them until we 
finish the XSD support, but at least the infra will be there.
Best,
-- C


> On Aug 15, 2023, at 11:39 PM, Charles Givre  wrote:
> 
> I stand corrected...  It does not look like the XML reader has any support 
> for arrays.
> -- C
> 
>> On Aug 15, 2023, at 12:01 AM, Paul Rogers  wrote:
>> 
>> IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such
>> as "ARRAY". This works, however, only if the XML reader uses the
>> (very complex) EVF framework and has a way to control parsing based on the
>> data type (and to set the data type based on parsing). The JSON reader has
>> such an integration. Charles, did you do the work to add that kind of
>> dynamic state machine to the XML parser?
>> 
>> - Paul
>> 
>> On Mon, Aug 14, 2023 at 6:28 PM Charles Givre  wrote:
>> 
>>> Hi Mike,
>>> It is theoretically possible but I don't have an example of the syntax.
>>> As you've probably figured out, Drill vectors have both a type and data
>>> mode.  The mode is either NULLABLE or REPEATED if I remember correctly.
>>> Thus, you could tell Drill via the inline schema that the data mode for a
>>> given field is REPEATED and that would be the Drill equivalent of an
>>> Array.  I've never actually done this, so I don't really know if it would
>>> work for inline schemata but I'd assume that it would.
>>> 
>>> I'll do some digging to see whether I have any examples of this.
>>> Best,
>>> --C
>>> 
>>> 
>>> 
>>> 
>>> 
 On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
 
 I'm trying to get my Drill SQL queries to produce the right thing from
>>> XML.
 
 A major thing that you can't easily infer from looking at just XML data
>>> is
 what is an array. XML lacks an array starting indicator.
 
 Is there an inline schema notation in the Drill Query language for
 array-ness, so that one can inform Drill what is an array?
 
 For example this provides simple types for all the fields directly in the
 query.
 
 @Test
 
 public void testSimpleProvidedSchema() throws Exception {
 
 String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
 (type => 'xml', schema " +
 
  "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
 FLOAT, `double_field` DOUBLE, `boolean_field` " +
 
  "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
 TIMESTAMP, `string_field`" +
 
  " VARCHAR, `date2_field` DATE properties {`drill.format` =
 `MM/dd/`})'))";
 
 RowSet results = client.queryBuilder().sql(sql).rowSet();
 
 assertEquals(2, results.rowCount());
 
 
 Can one also tell Drill what fields or child elements are arrays?
>>> 
>>> 
> 



signature.asc
Description: Message signed with OpenPGP


Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-15 Thread Charles Givre
I stand corrected...  It does not look like the XML reader has any support for 
arrays.
-- C

> On Aug 15, 2023, at 12:01 AM, Paul Rogers  wrote:
> 
> IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such
> as "ARRAY". This works, however, only if the XML reader uses the
> (very complex) EVF framework and has a way to control parsing based on the
> data type (and to set the data type based on parsing). The JSON reader has
> such an integration. Charles, did you do the work to add that kind of
> dynamic state machine to the XML parser?
> 
> - Paul
> 
> On Mon, Aug 14, 2023 at 6:28 PM Charles Givre  wrote:
> 
>> Hi Mike,
>> It is theoretically possible but I don't have an example of the syntax.
>> As you've probably figured out, Drill vectors have both a type and data
>> mode.  The mode is either NULLABLE or REPEATED if I remember correctly.
>> Thus, you could tell Drill via the inline schema that the data mode for a
>> given field is REPEATED and that would be the Drill equivalent of an
>> Array.  I've never actually done this, so I don't really know if it would
>> work for inline schemata but I'd assume that it would.
>> 
>> I'll do some digging to see whether I have any examples of this.
>> Best,
>> --C
>> 
>> 
>> 
>> 
>> 
>>> On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
>>> 
>>> I'm trying to get my Drill SQL queries to produce the right thing from
>> XML.
>>> 
>>> A major thing that you can't easily infer from looking at just XML data
>> is
>>> what is an array. XML lacks an array starting indicator.
>>> 
>>> Is there an inline schema notation in the Drill Query language for
>>> array-ness, so that one can inform Drill what is an array?
>>> 
>>> For example this provides simple types for all the fields directly in the
>>> query.
>>> 
>>> @Test
>>> 
>>> public void testSimpleProvidedSchema() throws Exception {
>>> 
>>> String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
>>> (type => 'xml', schema " +
>>> 
>>>   "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
>>> FLOAT, `double_field` DOUBLE, `boolean_field` " +
>>> 
>>>   "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
>>> TIMESTAMP, `string_field`" +
>>> 
>>>   " VARCHAR, `date2_field` DATE properties {`drill.format` =
>>> `MM/dd/`})'))";
>>> 
>>> RowSet results = client.queryBuilder().sql(sql).rowSet();
>>> 
>>> assertEquals(2, results.rowCount());
>>> 
>>> 
>>> Can one also tell Drill what fields or child elements are arrays?
>> 
>> 



signature.asc
Description: Message signed with OpenPGP


Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-15 Thread Charles Givre
Hey Paul,
The XML reader was implemented using the EVF2 Framework and in theory does have 
writers for repeated data types.  I'm not sure to what extent this has been 
tested.
Best,
-- C

> On Aug 15, 2023, at 12:01 AM, Paul Rogers  wrote:
> 
> IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such
> as "ARRAY". This works, however, only if the XML reader uses the
> (very complex) EVF framework and has a way to control parsing based on the
> data type (and to set the data type based on parsing). The JSON reader has
> such an integration. Charles, did you do the work to add that kind of
> dynamic state machine to the XML parser?
> 
> - Paul
> 
> On Mon, Aug 14, 2023 at 6:28 PM Charles Givre  wrote:
> 
>> Hi Mike,
>> It is theoretically possible but I don't have an example of the syntax.
>> As you've probably figured out, Drill vectors have both a type and data
>> mode.  The mode is either NULLABLE or REPEATED if I remember correctly.
>> Thus, you could tell Drill via the inline schema that the data mode for a
>> given field is REPEATED and that would be the Drill equivalent of an
>> Array.  I've never actually done this, so I don't really know if it would
>> work for inline schemata but I'd assume that it would.
>> 
>> I'll do some digging to see whether I have any examples of this.
>> Best,
>> --C
>> 
>> 
>> 
>> 
>> 
>>> On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
>>> 
>>> I'm trying to get my Drill SQL queries to produce the right thing from
>> XML.
>>> 
>>> A major thing that you can't easily infer from looking at just XML data
>> is
>>> what is an array. XML lacks an array starting indicator.
>>> 
>>> Is there an inline schema notation in the Drill Query language for
>>> array-ness, so that one can inform Drill what is an array?
>>> 
>>> For example this provides simple types for all the fields directly in the
>>> query.
>>> 
>>> @Test
>>> 
>>> public void testSimpleProvidedSchema() throws Exception {
>>> 
>>> String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
>>> (type => 'xml', schema " +
>>> 
>>>   "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
>>> FLOAT, `double_field` DOUBLE, `boolean_field` " +
>>> 
>>>   "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
>>> TIMESTAMP, `string_field`" +
>>> 
>>>   " VARCHAR, `date2_field` DATE properties {`drill.format` =
>>> `MM/dd/`})'))";
>>> 
>>> RowSet results = client.queryBuilder().sql(sql).rowSet();
>>> 
>>> assertEquals(2, results.rowCount());
>>> 
>>> 
>>> Can one also tell Drill what fields or child elements are arrays?
>> 
>> 



signature.asc
Description: Message signed with OpenPGP


Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-14 Thread Paul Rogers
IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such
as "ARRAY". This works, however, only if the XML reader uses the
(very complex) EVF framework and has a way to control parsing based on the
data type (and to set the data type based on parsing). The JSON reader has
such an integration. Charles, did you do the work to add that kind of
dynamic state machine to the XML parser?

- Paul

On Mon, Aug 14, 2023 at 6:28 PM Charles Givre  wrote:

> Hi Mike,
> It is theoretically possible but I don't have an example of the syntax.
> As you've probably figured out, Drill vectors have both a type and data
> mode.  The mode is either NULLABLE or REPEATED if I remember correctly.
> Thus, you could tell Drill via the inline schema that the data mode for a
> given field is REPEATED and that would be the Drill equivalent of an
> Array.  I've never actually done this, so I don't really know if it would
> work for inline schemata but I'd assume that it would.
>
> I'll do some digging to see whether I have any examples of this.
> Best,
> --C
>
>
>
>
>
> > On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
> >
> > I'm trying to get my Drill SQL queries to produce the right thing from
> XML.
> >
> > A major thing that you can't easily infer from looking at just XML data
> is
> > what is an array. XML lacks an array starting indicator.
> >
> > Is there an inline schema notation in the Drill Query language for
> > array-ness, so that one can inform Drill what is an array?
> >
> > For example this provides simple types for all the fields directly in the
> > query.
> >
> > @Test
> >
> > public void testSimpleProvidedSchema() throws Exception {
> >
> >  String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
> > (type => 'xml', schema " +
> >
> >"=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
> > FLOAT, `double_field` DOUBLE, `boolean_field` " +
> >
> >"BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
> > TIMESTAMP, `string_field`" +
> >
> >" VARCHAR, `date2_field` DATE properties {`drill.format` =
> > `MM/dd/`})'))";
> >
> >  RowSet results = client.queryBuilder().sql(sql).rowSet();
> >
> >  assertEquals(2, results.rowCount());
> >
> >
> > Can one also tell Drill what fields or child elements are arrays?
>
>


Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-14 Thread Charles Givre
Hi Mike,
It is theoretically possible but I don't have an example of the syntax.  As 
you've probably figured out, Drill vectors have both a type and data mode.  The 
mode is either NULLABLE or REPEATED if I remember correctly.  Thus, you could 
tell Drill via the inline schema that the data mode for a given field is 
REPEATED and that would be the Drill equivalent of an Array.  I've never 
actually done this, so I don't really know if it would work for inline schemata 
but I'd assume that it would.

I'll do some digging to see whether I have any examples of this.
Best,
--C





> On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
> 
> I'm trying to get my Drill SQL queries to produce the right thing from XML.
> 
> A major thing that you can't easily infer from looking at just XML data is
> what is an array. XML lacks an array starting indicator.
> 
> Is there an inline schema notation in the Drill Query language for
> array-ness, so that one can inform Drill what is an array?
> 
> For example this provides simple types for all the fields directly in the
> query.
> 
> @Test
> 
> public void testSimpleProvidedSchema() throws Exception {
> 
>  String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
> (type => 'xml', schema " +
> 
>"=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
> FLOAT, `double_field` DOUBLE, `boolean_field` " +
> 
>"BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
> TIMESTAMP, `string_field`" +
> 
>" VARCHAR, `date2_field` DATE properties {`drill.format` =
> `MM/dd/`})'))";
> 
>  RowSet results = client.queryBuilder().sql(sql).rowSet();
> 
>  assertEquals(2, results.rowCount());
> 
> 
> Can one also tell Drill what fields or child elements are arrays?



signature.asc
Description: Message signed with OpenPGP


is there a way to provide inline array metadata to inform the xml_reader?

2023-08-14 Thread Mike Beckerle
I'm trying to get my Drill SQL queries to produce the right thing from XML.

A major thing that you can't easily infer from looking at just XML data is
what is an array. XML lacks an array starting indicator.

Is there an inline schema notation in the Drill Query language for
array-ness, so that one can inform Drill what is an array?

For example this provides simple types for all the fields directly in the
query.

@Test

public void testSimpleProvidedSchema() throws Exception {

  String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
(type => 'xml', schema " +

"=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
FLOAT, `double_field` DOUBLE, `boolean_field` " +

"BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
TIMESTAMP, `string_field`" +

" VARCHAR, `date2_field` DATE properties {`drill.format` =
`MM/dd/`})'))";

  RowSet results = client.queryBuilder().sql(sql).rowSet();

  assertEquals(2, results.rowCount());


Can one also tell Drill what fields or child elements are arrays?