Rafael, Thanks for the pointers James. The INFORMATION_SCHEMA provides a lot of info and you can filter by whether a file is a directory or not.
There's one other things you should be aware of and that is when you are querying nested directories, Drill adds an implicit field called dirN, where n is the level of nesting, starting at zero. So ... Using the structure provided below, you could do something like: SELECT *, dir0, dir1 FROM dfs.studies WHERE dir1='study1' There are a few functions for directory traversals as this can be an expensive operation. I can point you to the docs for those as well. Best, -- C > On Aug 20, 2021, at 8:34 AM, James Turton <[email protected]> > wrote: > > Some pointers, in case you're not already aware of them. > > https://drill.apache.org/docs/querying-the-information-schema/ > > show files in dfs.foo; > show files in dfs.`/foo/bar`; > > select * from information_schema.`files`; > > In my experience, be careful of performance when using the last one. > Especially if you've set the option storage.list_files_recursively = true; > > > On 2021/08/20 13:57, Rafael Jaimes III wrote: >> Thanks Charles. >> >> I'm wondering more along the lines if you don't know the name of the >> directories. In your example, you have to know that data1 and study1 are >> called that. How do you find this information? Sure you can examine the file >> system separate from Drill. >> >> Is there information of the file system path names within Drill, such as in >> INFORMATION_SCHEMA or similar? >> >> In short I'm wondering if it's possible to have a command like LIST TABLES; >> and have returned study1.data1 , study1.data2 >> >> On August 20, 2021 6:49:42 AM EDT, luoc <[email protected]> wrote: >>> Best practices. Schema-free in Drill. >>> >>>> 在 2021年8月20日,12:04,Charles Givre <[email protected]> 写道: >>>> >>>> Hi Rafael, >>>> If you're asking what I think you're asking, it sounds as if you'd like to >>>> query multiple files in a nested directory. If that's the case, I have >>>> some good news... >>>> Drill allows you to query entire directories as if they were one big file. >>>> Effectively Drill performs a UNION on those files, so the end result is >>>> that they appear to be one big table. >>>> Thus, with the structure you provided, you could do the following: >>>> >>>> SELECT ... >>>> FROM dfs.`<path>/study1/data1` >>>> >>>> That would roll up all the files under that directory path. Now, there >>>> are some tricks that you should be aware of. The first are implicit >>>> columns. These can help you figure out the directory structure as well as >>>> some basic filtering. There are also some specific functions that are >>>> unique to querying directories. Take a look at the links below for >>>> references about the implicit fields as well as the directory functions. >>>> >>>> https://drill.apache.org/docs/querying-a-file-system-introduction/ >>>> <https://drill.apache.org/docs/querying-a-file-system-introduction/> >>>> https://drill.apache.org/docs/querying-directories/ >>>> <https://drill.apache.org/docs/querying-directories/> >>>> >>>> Best, >>>> -- C >>>> >>>> >>>> >>>>> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <[email protected]> >>>>> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I have a large dataset of parquet files that are nested within several >>>>> subdirectories. For example: >>>>> >>>>> study1 >>>>> |----data1 >>>>> |----2020-01-01 >>>>> |---0001.parquet >>>>> |----data2 >>>>> >>>>> study2 >>>>> |----dataA >>>>> |----dataB >>>>> >>>>> Is it possible for Drill to report back the "directories" as "tables"? For >>>>> example to perform a query and return something that tells me the >>>>> directory >>>>> structure? >>>>> >>>>> I've read something about creating workspaces, but to do so for each of >>>>> the >>>>> directories seems onerous, and also requires going into the storage plugin >>>>> configuration. >>>>> >>>>> The alternative would be to implement some logic and traverse the file >>>>> system, outside of Drill, and then use that information to drive the >>>>> "tables" for the queries. Although, that seems unintuitive provided >>>>> Drill's >>>>> ability to traverse the file system, infer schema, create cache, and so >>>>> on. >>>>> >>>>> Thanks, >>>>> Rafael >
