Thanks Andries. Good point about organisation of the data too! Robin.
On 22 September 2016 at 15:27, Andries Engelbrecht < aengelbre...@maprtech.com> wrote: > I noticed if you specifically use * for file matching it will still read > hidden files. However if you only point Drill at a directory it will read > the directory and sub structure without reading any hidden files. > > select * from `/dir1/*` - will read hidden files > select * from `/dir1` will not read hidden files > > So it depends if you need to use file name pattern matching or not. Most > of the time it is a good idea not to mix different data in the same > directory structure, but rather use the directory structures to separate > different data and types as it makes for easier matching management down > the road. > > At either rate you found a solution for your needs. I have not looked if > there are exclusion parameters for pattern matching. > > --Andries > > > On Sep 22, 2016, at 4:11 AM, Robin Moffatt < > robin.moff...@rittmanmead.com> wrote: > > > > Hi, > > > > It still tried to read it, even with a . prefix: > > > > 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM > > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/22/*`(type => > 'json')); > > Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block > > length for > > LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_ > 1074005711_265071; > > getBlockSize()=39945; corrupt=false; offset=0; > > locs=[DatanodeInfoWithStorage[192.168.10.115:50010,DS- > a0e97909-3d40-4f49-b67f-636e9f10928a,DISK], > > DatanodeInfoWithStorage[192.168.10.114:50010,DS-6c2cd5a6- > 22c5-4445-9018-ca0f2549a6cf,DISK], > > DatanodeInfoWithStorage[192.168.10.117:50010 > > ,DS-70946f9d-95d9-4f35-b19d-97b8dc01cb88,DISK]]} > > > > File /user/flume/incoming/twitter/2016/09/22/.FlumeData. > 1474530954642.tmp > > Record 1 > > Fragment 0:0 > > > > [Error Id: 04d44b5d-6d02-4062-9f4a-ebf3831d9ba1 on > > cdh57-01-node-01.moffatt.me:31010] (state=,code=0) > > > > > > However - good call on the Flume hdfs.inUsePrefix - by using that I can > > then set my Drill file pattern sufficiently so that it doesn't match on > the > > in-use file: > > > > 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM > > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/22/Flume*`(type => > > 'json')); > > +---------+ > > | EXPR$0 | > > +---------+ > > | 12652 | > > +---------+ > > 1 row selected (6.34 seconds) > > > > > > So problem sidestepped for now - but would be good to understand if I > > couldn't modify the producer, if there's a way to get Drill to omit > certain > > patterns from its file matching? > > > > thanks, Robin. > > > > > > On 21 September 2016 at 23:15, Andries Engelbrecht < > > aengelbre...@maprtech.com> wrote: > > > >> Add a . prefix to the Flume temp files. Drill will ignore the hidden > files > >> when you query the directory structure. > >> > >> --Andries > >> > >>> On Sep 21, 2016, at 2:36 PM, Robin Moffatt < > >> robin.moff...@rittmanmead.com> wrote: > >>> > >>> Hi, > >>> I have a stream of data from Flume landing in HDFS in files of a set > >> size. > >>> I can query these files individually just fine, and across multiple > ones > >>> too - except if the wildcard encompasses the *currently open HDFS file > >> that > >>> Flume is writing to*. When this happens, Drill understandably barfs. > >>> > >>> 0: jdbc:drill:drillbit=localhost> show files in > >>> `hdfs`.`/user/flume/incoming/twitter/2016/09/21/`; > >>> +------------------------------+--------------+---------+--- > >> -------+--------+-------------+--------------+-------------- > >> ------------+--------------------------+ > >>> | name | isDirectory | isFile | length | > >> owner > >>> | group | permissions | accessTime | > >>> modificationTime | > >>> +------------------------------+--------------+---------+--- > >> -------+--------+-------------+--------------+-------------- > >> ------------+--------------------------+ > >>> [...] > >>> | FlumeData.1474467815652 | false | true | 1055490 | > >> flume > >>> | supergroup | rw-r--r-- | 2016-09-21 21:52:07.219 | 2016-09-21 > >>> 21:58:58.28 | > >>> | FlumeData.1474467815653 | false | true | 1050470 | > >> flume > >>> | supergroup | rw-r--r-- | 2016-09-21 21:58:58.556 | 2016-09-21 > >>> 22:06:28.636 | > >>> | FlumeData.1474467815654 | false | true | 1051043 | > >> flume > >>> | supergroup | rw-r--r-- | 2016-09-21 22:06:29.564 | 2016-09-21 > >>> 22:13:40.808 | > >>> | FlumeData.1474467815655 | false | true | 1052657 | > >> flume > >>> | supergroup | rw-r--r-- | 2016-09-21 22:13:40.978 | 2016-09-21 > >>> 22:23:00.409 | > >>> | FlumeData.1474467815656.tmp | false | true | 9447 | > >> flume > >>> | supergroup | rw-r--r-- | 2016-09-21 22:23:00.788 | 2016-09-21 > >>> 22:23:00.788 | > >>> +------------------------------+--------------+---------+--- > >> -------+--------+-------------+--------------+-------------- > >> ------------+--------------------------+ > >>> 59 rows selected (0.265 seconds) > >>> > >>> Note the .tmp file as the last one in the folder > >>> > >>> Querying a single file works : > >>> > >>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM > >>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/ > >> FlumeData.1474467815655`(type > >>> => 'json')); > >>> +---------+ > >>> | EXPR$0 | > >>> +---------+ > >>> | 221 | > >>> +---------+ > >>> 1 row selected (0.685 seconds) > >>> > >>> > >>> As does across multiple files where the wildcard pattern would exclude > >> the > >>> .tmp file: > >>> > >>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM > >>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/ > >> FlumeData.147446781564*`(type > >>> => 'json')); > >>> +---------+ > >>> | EXPR$0 | > >>> +---------+ > >>> | 2178 | > >>> +---------+ > >>> 1 row selected (1.24 seconds) > >>> > >>> > >>> But if I try to query all the files, Drill includes the .tmp file and > >>> errors: > >>> > >>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM > >>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/*`(type => > >> 'json')); > >>> Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block > >>> length for > >>> LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_ > >> 1074004983_264343; > >>> getBlockSize()=9447; corrupt=false; offset=0; > >> locs=[DatanodeInfoWithStorage[ > >>> 192.168.10.116:50010,DS-39bf5e74-3eec-4447-9cd2-f17b5cc259b8,DISK], > >>> DatanodeInfoWithStorage[192.168.10.113:50010,DS-845945e7- > >> 0bc8-44aa-945c-a140ad1f55ab,DISK], > >>> DatanodeInfoWithStorage[192.168.10.115:50010 > >>> ,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK]]} > >>> > >>> File /user/flume/incoming/twitter/2016/09/21/FlumeData. > >> 1474467815656.tmp > >>> Record 1 > >>> Fragment 0:0 > >>> > >>> [Error Id: d3f322cb-c64d-43c8-9231-fb2c96e8589d on > >>> cdh57-01-node-01.moffatt.me:31010] (state=,code=0) > >>> 0: jdbc:drill:drillbit=localhost> > >>> > >>> > >>> Is there a way around this with Drill? For example, can I use a regex > in > >>> the path? I've tried, but just hit > >>> Error: VALIDATION ERROR: null > >>> > >>> thanks, Robin. > >> > >> > >