Vince - That is what I am doing now, using MapR Volumes, I am creating a .stage_%epoch% for each file copy. Once the data is fully copied (and no longer has the _COPYING_) I do a NFS filesystem mv to the directory it actually belongs in.
Now, this is message, and forced me to add more to my ETL. A couple of ideal things 1. In my view. I could use the select with options feature to add a filemask. I.e. if I am looking at a directory (any directory, not just parquet) let me specify a filesystem glob (or fancier, a regex) that would allow me to tell Drill, only use these files. It has to be a select with options type thing, because a setting like this should be a on a per table basis, not a system or session level options. 2. Make Drill smart enough to handle wildcards in directories (in the "FROM" definitions) 3. Allow a global "ignore these files for everything" user configurable settings. Drill already does this for hidden files (proceeded with a .) But given everyones unique snowflake systems, an admin may have other "always ignore this in queries. (hadoop fs client users may specific *._COPYING_ as an always exclude. But there may be others) On Thu, Jun 30, 2016 at 7:08 AM, Vince Gonzalez <[email protected]> wrote: > I know it doesn't go right to the question of how to make drill ignore > things, but could you copy the data into some parallel tree, then rename it > into the appropriate directory once the copy is done? > > Or could that still cause a running query to fail? > > On Thursday, June 30, 2016, John Omernik <[email protected]> wrote: > > > I am doing query of source data that is two levels deep. > > > > tablename/p_day=2016-05-01/p_hour=1/file1.parquet > > > > I wasn't able to get wildcards at that level to work with dir0 etc. > > > > > > > > > > On Thu, Jun 30, 2016 at 12:39 AM, Ted Dunning <[email protected] > > <javascript:;>> wrote: > > > > > Does it work to provide a wild card in your source spec? > > > > > > a la dfs.tdunning.`/user/tdunning/foo/data/*.parquet` > > > > > > ? > > > > > > > > > > > > On Wed, Jun 29, 2016 at 1:06 PM, John Omernik <[email protected] > > <javascript:;>> wrote: > > > > > > > When the Hadoop FS client copies files (say parquet files) It adds a > > > > ._COPYING_ at the end of the file until it's complete. If that's > there > > > > Drill fails (partial files etc). > > > > > > > > I know I can ignore files that start with . (or directories) but is > > > there a > > > > good way to tell Drill to ignore files that are not *.parquet, or > that > > > have > > > > ._COPYING_ at the end of them? > > > > > > > > Thanks! > > > > > > > > John > > > > > > > > > > > > -- > ---- > Vince Gonzalez > Systems Engineer > 212.694.3879 > > mapr.com >
