Re: Excluding HDFS .tmp file from multi-file query?

Robin Moffatt Thu, 22 Sep 2016 07:37:18 -0700

Thanks Andries. Good point about organisation of the data too!

Robin.


On 22 September 2016 at 15:27, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

> I noticed if you specifically use * for file matching it will still read
> hidden files. However if you only point Drill at a directory it will read
> the directory and sub structure without reading any hidden files.
>
> select * from `/dir1/*`  - will read hidden files
> select * from `/dir1` will not read hidden files
>
> So it depends if you need to use file name pattern matching or not. Most
> of the time it is a good idea not to mix different data in the same
> directory structure, but rather use the directory structures to separate
> different data and types as it makes for easier matching management down
> the road.
>
> At either rate you found a solution for your needs. I have not looked if
> there are exclusion parameters for pattern matching.
>
> --Andries
>
> > On Sep 22, 2016, at 4:11 AM, Robin Moffatt <
> robin.moff...@rittmanmead.com> wrote:
> >
> > Hi,
> >
> > It still tried to read it, even with a . prefix:
> >
> > 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/22/*`(type =>
> 'json'));
> > Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block
> > length for
> > LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_
> 1074005711_265071;
> > getBlockSize()=39945; corrupt=false; offset=0;
> > locs=[DatanodeInfoWithStorage[192.168.10.115:50010,DS-
> a0e97909-3d40-4f49-b67f-636e9f10928a,DISK],
> > DatanodeInfoWithStorage[192.168.10.114:50010,DS-6c2cd5a6-
> 22c5-4445-9018-ca0f2549a6cf,DISK],
> > DatanodeInfoWithStorage[192.168.10.117:50010
> > ,DS-70946f9d-95d9-4f35-b19d-97b8dc01cb88,DISK]]}
> >
> > File  /user/flume/incoming/twitter/2016/09/22/.FlumeData.
> 1474530954642.tmp
> > Record  1
> > Fragment 0:0
> >
> > [Error Id: 04d44b5d-6d02-4062-9f4a-ebf3831d9ba1 on
> > cdh57-01-node-01.moffatt.me:31010] (state=,code=0)
> >
> >
> > However - good call on the Flume hdfs.inUsePrefix - by using that I can
> > then set my Drill file pattern sufficiently so that it doesn't match on
> the
> > in-use file:
> >
> > 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/22/Flume*`(type =>
> > 'json'));
> > +---------+
> > | EXPR$0  |
> > +---------+
> > | 12652   |
> > +---------+
> > 1 row selected (6.34 seconds)
> >
> >
> > So problem sidestepped for now - but would be good to understand if I
> > couldn't modify the producer, if there's a way to get Drill to omit
> certain
> > patterns from its file matching?
> >
> > thanks, Robin.
> >
> >
> > On 21 September 2016 at 23:15, Andries Engelbrecht <
> > aengelbre...@maprtech.com> wrote:
> >
> >> Add a . prefix to the Flume temp files. Drill will ignore the hidden
> files
> >> when you query the directory structure.
> >>
> >> --Andries
> >>
> >>> On Sep 21, 2016, at 2:36 PM, Robin Moffatt <
> >> robin.moff...@rittmanmead.com> wrote:
> >>>
> >>> Hi,
> >>> I have a stream of data from Flume landing in HDFS in files of a set
> >> size.
> >>> I can query these files individually just fine, and across multiple
> ones
> >>> too - except if the wildcard encompasses the *currently open HDFS file
> >> that
> >>> Flume is writing to*. When this happens, Drill understandably barfs.
> >>>
> >>> 0: jdbc:drill:drillbit=localhost> show files in
> >>> `hdfs`.`/user/flume/incoming/twitter/2016/09/21/`;
> >>> +------------------------------+--------------+---------+---
> >> -------+--------+-------------+--------------+--------------
> >> ------------+--------------------------+
> >>> |             name             | isDirectory  | isFile  |  length  |
> >> owner
> >>> |    group    | permissions  |        accessTime        |
> >>> modificationTime     |
> >>> +------------------------------+--------------+---------+---
> >> -------+--------+-------------+--------------+--------------
> >> ------------+--------------------------+
> >>> [...]
> >>> | FlumeData.1474467815652      | false        | true    | 1055490  |
> >> flume
> >>> | supergroup  | rw-r--r--    | 2016-09-21 21:52:07.219  | 2016-09-21
> >>> 21:58:58.28   |
> >>> | FlumeData.1474467815653      | false        | true    | 1050470  |
> >> flume
> >>> | supergroup  | rw-r--r--    | 2016-09-21 21:58:58.556  | 2016-09-21
> >>> 22:06:28.636  |
> >>> | FlumeData.1474467815654      | false        | true    | 1051043  |
> >> flume
> >>> | supergroup  | rw-r--r--    | 2016-09-21 22:06:29.564  | 2016-09-21
> >>> 22:13:40.808  |
> >>> | FlumeData.1474467815655      | false        | true    | 1052657  |
> >> flume
> >>> | supergroup  | rw-r--r--    | 2016-09-21 22:13:40.978  | 2016-09-21
> >>> 22:23:00.409  |
> >>> | FlumeData.1474467815656.tmp  | false        | true    | 9447     |
> >> flume
> >>> | supergroup  | rw-r--r--    | 2016-09-21 22:23:00.788  | 2016-09-21
> >>> 22:23:00.788  |
> >>> +------------------------------+--------------+---------+---
> >> -------+--------+-------------+--------------+--------------
> >> ------------+--------------------------+
> >>> 59 rows selected (0.265 seconds)
> >>>
> >>> Note the .tmp file as the last one in the folder
> >>>
> >>> Querying a single file works :
> >>>
> >>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> >>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/
> >> FlumeData.1474467815655`(type
> >>> => 'json'));
> >>> +---------+
> >>> | EXPR$0  |
> >>> +---------+
> >>> | 221     |
> >>> +---------+
> >>> 1 row selected (0.685 seconds)
> >>>
> >>>
> >>> As does across multiple files where the wildcard pattern would exclude
> >> the
> >>> .tmp file:
> >>>
> >>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> >>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/
> >> FlumeData.147446781564*`(type
> >>> => 'json'));
> >>> +---------+
> >>> | EXPR$0  |
> >>> +---------+
> >>> | 2178    |
> >>> +---------+
> >>> 1 row selected (1.24 seconds)
> >>>
> >>>
> >>> But if I try to query all the files, Drill includes the .tmp file and
> >>> errors:
> >>>
> >>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> >>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/*`(type =>
> >> 'json'));
> >>> Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block
> >>> length for
> >>> LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_
> >> 1074004983_264343;
> >>> getBlockSize()=9447; corrupt=false; offset=0;
> >> locs=[DatanodeInfoWithStorage[
> >>> 192.168.10.116:50010,DS-39bf5e74-3eec-4447-9cd2-f17b5cc259b8,DISK],
> >>> DatanodeInfoWithStorage[192.168.10.113:50010,DS-845945e7-
> >> 0bc8-44aa-945c-a140ad1f55ab,DISK],
> >>> DatanodeInfoWithStorage[192.168.10.115:50010
> >>> ,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK]]}
> >>>
> >>> File  /user/flume/incoming/twitter/2016/09/21/FlumeData.
> >> 1474467815656.tmp
> >>> Record  1
> >>> Fragment 0:0
> >>>
> >>> [Error Id: d3f322cb-c64d-43c8-9231-fb2c96e8589d on
> >>> cdh57-01-node-01.moffatt.me:31010] (state=,code=0)
> >>> 0: jdbc:drill:drillbit=localhost>
> >>>
> >>>
> >>> Is there a way around this with Drill? For example, can I use a regex
> in
> >>> the path? I've tried, but just hit
> >>> Error: VALIDATION ERROR: null
> >>>
> >>> thanks, Robin.
> >>
> >>
>
>

Re: Excluding HDFS .tmp file from multi-file query?

Reply via email to