Re: default to waiting for net/full schema? [was: Re: Desired Behavior when a table has both files and folders?]

Jacques Nadeau Thu, 30 Apr 2015 19:12:32 -0700

removing all the cross-posting...

As a streaming engine, there is no way to know the schema of all data of
schemaless sources without first reading them.  Holding the entire dataset
in memory (or reading it twice) is too big a penalty.  Enhancements to do
sampling would be ideal.  Generally, we either guess or we know.  Right now
we guess (with not very good information).  It seems like we should
definitely improve our guesses.  Knowing is too expensive in some cases.


On Thu, Apr 30, 2015 at 6:08 PM, Daniel Barclay <[email protected]>
wrote:

> Should Drill default to not sending changing schema information (that is,
> to waiting until it has all the schema information before returning any
> through JDBC), and only send changing schemas when the client has somehow
> told Drill that it can handle changing schemas (e.g., when the client
> registers a handler for schema changes, or in some connection property)?
>
> Then Drill would work "normally" in regular JDBC tools (they won't fail
> to show columns that didn't exist in earlier rows or--worse--crash trying
> to access columns that no longer exist in later rows), but Drill could
> still incrementally return changing schema information to clients that
> can handle it.
>
> Daniel
>
>
> Steven Phillips wrote:
>
>> I believe the missing columns is due to a limitation in sqlline itself.
>> For
>> this query, Drill don't know in advance what columns will be returned. It
>> just returns them as they come. When the first batch get back to sqlline,
>> it will assume that whatever columns it receives in that batch are the
>> only
>> columns this query will return. And it ignores any new columns that show
>> up.
>>
>> On Wed, Apr 29, 2015 at 6:20 PM, Hao Zhu <[email protected]> wrote:
>>
>>  You can specify the column names.
>>> "select *"  explores the schema by itself.
>>>
>>>  select * from `data`;
>>>>
>>> +------------+------------+
>>> |    dir0    |    col1    |
>>> +------------+------------+
>>> | null       | 1          |
>>> | folder1    | null       |
>>> | folder1    | null       |
>>> | folder1    | 4          |
>>> +------------+------------+
>>> 4 rows selected (0.074 seconds)
>>>
>>>> select dir0,col1,col2 from `data`;
>>>>
>>> +------------+------------+------------+
>>> |    dir0    |    col1    |    col2    |
>>> +------------+------------+------------+
>>> | null       | 1          | null       |
>>> | folder1    | null       | 3          |
>>> | folder1    | null       | 2          |
>>> | folder1    | 4          | null       |
>>> +------------+------------+------------+
>>> 4 rows selected (0.088 seconds)
>>>
>>>> select dir0,col1,col2,col3 from `data`;
>>>>
>>> +------------+------------+------------+------------+
>>> |    dir0    |    col1    |    col2    |    col3    |
>>> +------------+------------+------------+------------+
>>> | null       | 1          | null       | null       |
>>> | folder1    | null       | 3          | null       |
>>> | folder1    | null       | 2          | null       |
>>> | folder1    | 4          | null       | null       |
>>> +------------+------------+------------+------------+
>>> 4 rows selected (0.098 seconds)
>>>
>>> Thanks,
>>> Hao
>>>
>>> On Wed, Apr 29, 2015 at 5:14 PM, rahul challapalli <
>>> [email protected]> wrote:
>>>
>>>  What is the desired behavior when I run "select * from data;" on the
>>>>
>>> below
>>>
>>>> structure?
>>>>
>>>> data/
>>>>    -- file1.json
>>>>    -- folder1/
>>>>         -- file2.json
>>>>
>>>> file1.json : {"col1" : 1}
>>>> file2.json : {"col2" : 2}
>>>>
>>>> This is what drill returns :
>>>> +------------+------------+
>>>> |    dir0    |    col2    |
>>>> +------------+------------+
>>>> | folder1   | 2          |
>>>> | null       | null       |
>>>> +------------+------------+
>>>>
>>>> Looks like drill ignored the columns from the first file.
>>>>
>>>> - Rahul
>>>>
>>>>
>>>
>>
>>
>>
>
> --
> Daniel Barclay
> MapR Technologies
>

Re: default to waiting for net/full schema? [was: Re: Desired Behavior when a table has both files and folders?]

Reply via email to