Vince - That is what I am doing now, using MapR Volumes, I am creating a
.stage_%epoch% for each file copy. Once the data is fully copied (and no
longer has the _COPYING_) I do a NFS filesystem mv to the directory it
actually belongs in.

Now, this is message, and forced me to add more to my ETL.  A couple of
ideal things

1. In my view. I could use the select with options feature to add a
filemask.  I.e. if I am looking at a directory (any directory, not just
parquet) let me specify a filesystem glob (or fancier, a regex) that would
allow me to tell Drill, only use these files.  It has to be a select with
options type thing, because a setting like this should be a on a per table
basis, not a system or session level options.

2.  Make Drill smart enough to handle wildcards in directories (in the
"FROM" definitions)

3. Allow a global "ignore these files for everything" user configurable
settings. Drill already does this for hidden files (proceeded with a .) But
given everyones unique snowflake systems, an admin may have other "always
ignore this in queries. (hadoop fs client users may specific *._COPYING_ as
an always exclude. But there may be others)



On Thu, Jun 30, 2016 at 7:08 AM, Vince Gonzalez <[email protected]> wrote:

> I know it doesn't go right to the question of how to make drill ignore
> things, but could you copy the data into some parallel tree, then rename it
> into the appropriate directory once the copy is done?
>
> Or could that still cause a running query to fail?
>
> On Thursday, June 30, 2016, John Omernik <[email protected]> wrote:
>
> > I am doing query of source data that is two levels deep.
> >
> > tablename/p_day=2016-05-01/p_hour=1/file1.parquet
> >
> > I wasn't able to get wildcards at that level to work with dir0 etc.
> >
> >
> >
> >
> > On Thu, Jun 30, 2016 at 12:39 AM, Ted Dunning <[email protected]
> > <javascript:;>> wrote:
> >
> > > Does it work to provide a wild card in your source spec?
> > >
> > > a la dfs.tdunning.`/user/tdunning/foo/data/*.parquet`
> > >
> > > ?
> > >
> > >
> > >
> > > On Wed, Jun 29, 2016 at 1:06 PM, John Omernik <[email protected]
> > <javascript:;>> wrote:
> > >
> > > > When the Hadoop FS client copies files (say parquet files) It adds a
> > > > ._COPYING_ at the end of the file until it's complete.  If that's
> there
> > > > Drill fails (partial files etc).
> > > >
> > > > I know I can ignore files that start with . (or directories) but is
> > > there a
> > > > good way to tell Drill to ignore files that are not *.parquet, or
> that
> > > have
> > > > ._COPYING_ at the end of them?
> > > >
> > > > Thanks!
> > > >
> > > > John
> > > >
> > >
> >
>
>
> --
>  ----
>  Vince Gonzalez
>  Systems Engineer
>  212.694.3879
>
>  mapr.com
>

Reply via email to