Re: querying nested parquet directory structure

Rafael Jaimes III Fri, 20 Aug 2021 04:57:32 -0700

Thanks Charles.

I'm wondering more along the lines if you don't know the name of the 
directories. In your example, you have to know that data1 and study1 are called 
that. How do you find this information? Sure you can examine the file system 
separate from Drill.


Is there information of the file system path names within Drill, such as in 
INFORMATION_SCHEMA or similar?

In short I'm wondering if it's possible to have a command like LIST TABLES; and 
have returned study1.data1 , study1.data2

On August 20, 2021 6:49:42 AM EDT, luoc <[email protected]> wrote:
>
>Best practices. Schema-free in Drill.
>
>> 在 2021年8月20日，12:04，Charles Givre <[email protected]> 写道：
>> 
>> Hi Rafael, 
>> If you're asking what I think you're asking, it sounds as if you'd like to 
>> query multiple files in a nested directory.  If that's the case, I have some 
>> good news...
>> Drill allows you to query entire directories as if they were one big file.  
>> Effectively Drill performs a UNION on those files, so the end result is that 
>> they appear to be one big table. 
>> Thus, with the structure you provided, you could do the following:
>> 
>> SELECT ...
>> FROM dfs.`<path>/study1/data1`
>> 
>> That would roll up all the files under that directory path.  Now, there are 
>> some tricks that you should be aware of.  The first are implicit columns.  
>> These can help you figure out the directory structure as well as some basic 
>> filtering.  There are also some specific functions that are unique to 
>> querying directories.  Take a look at the links below for references about 
>> the implicit fields as well as the directory functions.  
>> 
>> https://drill.apache.org/docs/querying-a-file-system-introduction/ 
>> <https://drill.apache.org/docs/querying-a-file-system-introduction/>
>> https://drill.apache.org/docs/querying-directories/ 
>> <https://drill.apache.org/docs/querying-directories/>
>> 
>> Best,
>> -- C
>> 
>> 
>> 
>>> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <[email protected]> wrote:
>>> 
>>> Hi all,
>>> 
>>> I have a large dataset of parquet files that are nested within several
>>> subdirectories. For example:
>>> 
>>> study1
>>> |----data1
>>>   |----2020-01-01
>>>       |---0001.parquet
>>> |----data2
>>> 
>>> study2
>>> |----dataA
>>> |----dataB
>>> 
>>> Is it possible for Drill to report back the "directories" as "tables"? For
>>> example to perform a query and return something that tells me the directory
>>> structure?
>>> 
>>> I've read something about creating workspaces, but to do so for each of the
>>> directories seems onerous, and also requires going into the storage plugin
>>> configuration.
>>> 
>>> The alternative would be to implement some logic and traverse the file
>>> system, outside of Drill, and then use that information to drive the
>>> "tables" for the queries. Although, that seems unintuitive provided Drill's
>>> ability to traverse the file system, infer schema, create cache, and so on.
>>> 
>>> Thanks,
>>> Rafael
>> 
>

Re: querying nested parquet directory structure

Reply via email to