Re: querying nested parquet directory structure

Charles Givre Fri, 20 Aug 2021 07:39:38 -0700

Rafael, 
Thanks for the pointers James.  The INFORMATION_SCHEMA provides a lot of info 
and you can filter by whether a file is a directory or not.



There's one other things you should be aware of and that is when you are 
querying nested directories, Drill adds an implicit field called dirN, where n 
is the level of nesting, starting at zero.  

So ... Using the structure provided below, you could do something like:

SELECT *, dir0, dir1 
FROM dfs.studies
WHERE dir1='study1'

There are a few functions for directory traversals as this can be an expensive 
operation.  I can point you to the docs for those as well. 
Best,
-- C



> On Aug 20, 2021, at 8:34 AM, James Turton <[email protected]> 
> wrote:
> 
> Some pointers, in case you're not already aware of them.
> 
> https://drill.apache.org/docs/querying-the-information-schema/
> 
> show files in dfs.foo;
> show files in dfs.`/foo/bar`;
> 
> select * from information_schema.`files`;
> 
> In my experience, be careful of performance when using the last one.  
> Especially if you've set the option storage.list_files_recursively = true;
> 
> 
> On 2021/08/20 13:57, Rafael Jaimes III wrote:
>> Thanks Charles.
>> 
>> I'm wondering more along the lines if you don't know the name of the 
>> directories. In your example, you have to know that data1 and study1 are 
>> called that. How do you find this information? Sure you can examine the file 
>> system separate from Drill.
>> 
>> Is there information of the file system path names within Drill, such as in 
>> INFORMATION_SCHEMA or similar?
>> 
>> In short I'm wondering if it's possible to have a command like LIST TABLES; 
>> and have returned study1.data1 , study1.data2
>> 
>> On August 20, 2021 6:49:42 AM EDT, luoc <[email protected]> wrote:
>>> Best practices. Schema-free in Drill.
>>> 
>>>> 在 2021年8月20日，12:04，Charles Givre <[email protected]> 写道：
>>>> 
>>>> Hi Rafael,
>>>> If you're asking what I think you're asking, it sounds as if you'd like to 
>>>> query multiple files in a nested directory.  If that's the case, I have 
>>>> some good news...
>>>> Drill allows you to query entire directories as if they were one big file. 
>>>>  Effectively Drill performs a UNION on those files, so the end result is 
>>>> that they appear to be one big table.
>>>> Thus, with the structure you provided, you could do the following:
>>>> 
>>>> SELECT ...
>>>> FROM dfs.`<path>/study1/data1`
>>>> 
>>>> That would roll up all the files under that directory path.  Now, there 
>>>> are some tricks that you should be aware of.  The first are implicit 
>>>> columns.  These can help you figure out the directory structure as well as 
>>>> some basic filtering.  There are also some specific functions that are 
>>>> unique to querying directories.  Take a look at the links below for 
>>>> references about the implicit fields as well as the directory functions.
>>>> 
>>>> https://drill.apache.org/docs/querying-a-file-system-introduction/ 
>>>> <https://drill.apache.org/docs/querying-a-file-system-introduction/>
>>>> https://drill.apache.org/docs/querying-directories/ 
>>>> <https://drill.apache.org/docs/querying-directories/>
>>>> 
>>>> Best,
>>>> -- C
>>>> 
>>>> 
>>>> 
>>>>> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I have a large dataset of parquet files that are nested within several
>>>>> subdirectories. For example:
>>>>> 
>>>>> study1
>>>>> |----data1
>>>>>   |----2020-01-01
>>>>>       |---0001.parquet
>>>>> |----data2
>>>>> 
>>>>> study2
>>>>> |----dataA
>>>>> |----dataB
>>>>> 
>>>>> Is it possible for Drill to report back the "directories" as "tables"? For
>>>>> example to perform a query and return something that tells me the 
>>>>> directory
>>>>> structure?
>>>>> 
>>>>> I've read something about creating workspaces, but to do so for each of 
>>>>> the
>>>>> directories seems onerous, and also requires going into the storage plugin
>>>>> configuration.
>>>>> 
>>>>> The alternative would be to implement some logic and traverse the file
>>>>> system, outside of Drill, and then use that information to drive the
>>>>> "tables" for the queries. Although, that seems unintuitive provided 
>>>>> Drill's
>>>>> ability to traverse the file system, infer schema, create cache, and so 
>>>>> on.
>>>>> 
>>>>> Thanks,
>>>>> Rafael
>

Re: querying nested parquet directory structure

Reply via email to