Re: Tricks for Copying Data where Drill is actively querying

Parth Chandra Thu, 30 Jun 2016 09:57:18 -0700

Hi John,

I've tried something like the following successfully -


select foo from  tablename/*/p_hour=1 and that will read all directories
'p_day=nnn' where the subdirectory is 'p_hour=1'


Parth


On Thu, Jun 30, 2016 at 6:43 AM, John Omernik <[email protected]> wrote:

> Vince - That is what I am doing now, using MapR Volumes, I am creating a
> .stage_%epoch% for each file copy. Once the data is fully copied (and no
> longer has the _COPYING_) I do a NFS filesystem mv to the directory it
> actually belongs in.
>
> Now, this is message, and forced me to add more to my ETL.  A couple of
> ideal things
>
> 1. In my view. I could use the select with options feature to add a
> filemask.  I.e. if I am looking at a directory (any directory, not just
> parquet) let me specify a filesystem glob (or fancier, a regex) that would
> allow me to tell Drill, only use these files.  It has to be a select with
> options type thing, because a setting like this should be a on a per table
> basis, not a system or session level options.
>
> 2.  Make Drill smart enough to handle wildcards in directories (in the
> "FROM" definitions)
>
> 3. Allow a global "ignore these files for everything" user configurable
> settings. Drill already does this for hidden files (proceeded with a .) But
> given everyones unique snowflake systems, an admin may have other "always
> ignore this in queries. (hadoop fs client users may specific *._COPYING_ as
> an always exclude. But there may be others)
>
>
>
> On Thu, Jun 30, 2016 at 7:08 AM, Vince Gonzalez <[email protected]>
> wrote:
>
> > I know it doesn't go right to the question of how to make drill ignore
> > things, but could you copy the data into some parallel tree, then rename
> it
> > into the appropriate directory once the copy is done?
> >
> > Or could that still cause a running query to fail?
> >
> > On Thursday, June 30, 2016, John Omernik <[email protected]> wrote:
> >
> > > I am doing query of source data that is two levels deep.
> > >
> > > tablename/p_day=2016-05-01/p_hour=1/file1.parquet
> > >
> > > I wasn't able to get wildcards at that level to work with dir0 etc.
> > >
> > >
> > >
> > >
> > > On Thu, Jun 30, 2016 at 12:39 AM, Ted Dunning <[email protected]
> > > <javascript:;>> wrote:
> > >
> > > > Does it work to provide a wild card in your source spec?
> > > >
> > > > a la dfs.tdunning.`/user/tdunning/foo/data/*.parquet`
> > > >
> > > > ?
> > > >
> > > >
> > > >
> > > > On Wed, Jun 29, 2016 at 1:06 PM, John Omernik <[email protected]
> > > <javascript:;>> wrote:
> > > >
> > > > > When the Hadoop FS client copies files (say parquet files) It adds
> a
> > > > > ._COPYING_ at the end of the file until it's complete.  If that's
> > there
> > > > > Drill fails (partial files etc).
> > > > >
> > > > > I know I can ignore files that start with . (or directories) but is
> > > > there a
> > > > > good way to tell Drill to ignore files that are not *.parquet, or
> > that
> > > > have
> > > > > ._COPYING_ at the end of them?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > John
> > > > >
> > > >
> > >
> >
> >
> > --
> >  ----
> >  Vince Gonzalez
> >  Systems Engineer
> >  212.694.3879
> >
> >  mapr.com
> >
>

Re: Tricks for Copying Data where Drill is actively querying

Reply via email to