Re: querying nested parquet directory structure

luoc Fri, 20 Aug 2021 03:50:16 -0700


Best practices. Schema-free in Drill.


> 在 2021年8月20日，12:04，Charles Givre <[email protected]> 写道：
> 
> Hi Rafael, 
> If you're asking what I think you're asking, it sounds as if you'd like to 
> query multiple files in a nested directory.  If that's the case, I have some 
> good news...
> Drill allows you to query entire directories as if they were one big file.  
> Effectively Drill performs a UNION on those files, so the end result is that 
> they appear to be one big table. 
> Thus, with the structure you provided, you could do the following:
> 
> SELECT ...
> FROM dfs.`<path>/study1/data1`
> 
> That would roll up all the files under that directory path.  Now, there are 
> some tricks that you should be aware of.  The first are implicit columns.  
> These can help you figure out the directory structure as well as some basic 
> filtering.  There are also some specific functions that are unique to 
> querying directories.  Take a look at the links below for references about 
> the implicit fields as well as the directory functions.  
> 
> https://drill.apache.org/docs/querying-a-file-system-introduction/ 
> <https://drill.apache.org/docs/querying-a-file-system-introduction/>
> https://drill.apache.org/docs/querying-directories/ 
> <https://drill.apache.org/docs/querying-directories/>
> 
> Best,
> -- C
> 
> 
> 
>> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <[email protected]> wrote:
>> 
>> Hi all,
>> 
>> I have a large dataset of parquet files that are nested within several
>> subdirectories. For example:
>> 
>> study1
>> |----data1
>>   |----2020-01-01
>>       |---0001.parquet
>> |----data2
>> 
>> study2
>> |----dataA
>> |----dataB
>> 
>> Is it possible for Drill to report back the "directories" as "tables"? For
>> example to perform a query and return something that tells me the directory
>> structure?
>> 
>> I've read something about creating workspaces, but to do so for each of the
>> directories seems onerous, and also requires going into the storage plugin
>> configuration.
>> 
>> The alternative would be to implement some logic and traverse the file
>> system, outside of Drill, and then use that information to drive the
>> "tables" for the queries. Although, that seems unintuitive provided Drill's
>> ability to traverse the file system, infer schema, create cache, and so on.
>> 
>> Thanks,
>> Rafael
>

Re: querying nested parquet directory structure

Reply via email to