Hi John, I've tried something like the following successfully -
select foo from tablename/*/p_hour=1 and that will read all directories 'p_day=nnn' where the subdirectory is 'p_hour=1' Parth On Thu, Jun 30, 2016 at 6:43 AM, John Omernik <[email protected]> wrote: > Vince - That is what I am doing now, using MapR Volumes, I am creating a > .stage_%epoch% for each file copy. Once the data is fully copied (and no > longer has the _COPYING_) I do a NFS filesystem mv to the directory it > actually belongs in. > > Now, this is message, and forced me to add more to my ETL. A couple of > ideal things > > 1. In my view. I could use the select with options feature to add a > filemask. I.e. if I am looking at a directory (any directory, not just > parquet) let me specify a filesystem glob (or fancier, a regex) that would > allow me to tell Drill, only use these files. It has to be a select with > options type thing, because a setting like this should be a on a per table > basis, not a system or session level options. > > 2. Make Drill smart enough to handle wildcards in directories (in the > "FROM" definitions) > > 3. Allow a global "ignore these files for everything" user configurable > settings. Drill already does this for hidden files (proceeded with a .) But > given everyones unique snowflake systems, an admin may have other "always > ignore this in queries. (hadoop fs client users may specific *._COPYING_ as > an always exclude. But there may be others) > > > > On Thu, Jun 30, 2016 at 7:08 AM, Vince Gonzalez <[email protected]> > wrote: > > > I know it doesn't go right to the question of how to make drill ignore > > things, but could you copy the data into some parallel tree, then rename > it > > into the appropriate directory once the copy is done? > > > > Or could that still cause a running query to fail? > > > > On Thursday, June 30, 2016, John Omernik <[email protected]> wrote: > > > > > I am doing query of source data that is two levels deep. > > > > > > tablename/p_day=2016-05-01/p_hour=1/file1.parquet > > > > > > I wasn't able to get wildcards at that level to work with dir0 etc. > > > > > > > > > > > > > > > On Thu, Jun 30, 2016 at 12:39 AM, Ted Dunning <[email protected] > > > <javascript:;>> wrote: > > > > > > > Does it work to provide a wild card in your source spec? > > > > > > > > a la dfs.tdunning.`/user/tdunning/foo/data/*.parquet` > > > > > > > > ? > > > > > > > > > > > > > > > > On Wed, Jun 29, 2016 at 1:06 PM, John Omernik <[email protected] > > > <javascript:;>> wrote: > > > > > > > > > When the Hadoop FS client copies files (say parquet files) It adds > a > > > > > ._COPYING_ at the end of the file until it's complete. If that's > > there > > > > > Drill fails (partial files etc). > > > > > > > > > > I know I can ignore files that start with . (or directories) but is > > > > there a > > > > > good way to tell Drill to ignore files that are not *.parquet, or > > that > > > > have > > > > > ._COPYING_ at the end of them? > > > > > > > > > > Thanks! > > > > > > > > > > John > > > > > > > > > > > > > > > > > > -- > > ---- > > Vince Gonzalez > > Systems Engineer > > 212.694.3879 > > > > mapr.com > > >
