Best practices. Schema-free in Drill.
> 在 2021年8月20日,12:04,Charles Givre <[email protected]> 写道: > > Hi Rafael, > If you're asking what I think you're asking, it sounds as if you'd like to > query multiple files in a nested directory. If that's the case, I have some > good news... > Drill allows you to query entire directories as if they were one big file. > Effectively Drill performs a UNION on those files, so the end result is that > they appear to be one big table. > Thus, with the structure you provided, you could do the following: > > SELECT ... > FROM dfs.`<path>/study1/data1` > > That would roll up all the files under that directory path. Now, there are > some tricks that you should be aware of. The first are implicit columns. > These can help you figure out the directory structure as well as some basic > filtering. There are also some specific functions that are unique to > querying directories. Take a look at the links below for references about > the implicit fields as well as the directory functions. > > https://drill.apache.org/docs/querying-a-file-system-introduction/ > <https://drill.apache.org/docs/querying-a-file-system-introduction/> > https://drill.apache.org/docs/querying-directories/ > <https://drill.apache.org/docs/querying-directories/> > > Best, > -- C > > > >> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <[email protected]> wrote: >> >> Hi all, >> >> I have a large dataset of parquet files that are nested within several >> subdirectories. For example: >> >> study1 >> |----data1 >> |----2020-01-01 >> |---0001.parquet >> |----data2 >> >> study2 >> |----dataA >> |----dataB >> >> Is it possible for Drill to report back the "directories" as "tables"? For >> example to perform a query and return something that tells me the directory >> structure? >> >> I've read something about creating workspaces, but to do so for each of the >> directories seems onerous, and also requires going into the storage plugin >> configuration. >> >> The alternative would be to implement some logic and traverse the file >> system, outside of Drill, and then use that information to drive the >> "tables" for the queries. Although, that seems unintuitive provided Drill's >> ability to traverse the file system, infer schema, create cache, and so on. >> >> Thanks, >> Rafael >
