Add a . prefix to the Flume temp files. Drill will ignore the hidden files when you query the directory structure.
--Andries > On Sep 21, 2016, at 2:36 PM, Robin Moffatt <[email protected]> > wrote: > > Hi, > I have a stream of data from Flume landing in HDFS in files of a set size. > I can query these files individually just fine, and across multiple ones > too - except if the wildcard encompasses the *currently open HDFS file that > Flume is writing to*. When this happens, Drill understandably barfs. > > 0: jdbc:drill:drillbit=localhost> show files in > `hdfs`.`/user/flume/incoming/twitter/2016/09/21/`; > +------------------------------+--------------+---------+----------+--------+-------------+--------------+--------------------------+--------------------------+ > | name | isDirectory | isFile | length | owner > | group | permissions | accessTime | > modificationTime | > +------------------------------+--------------+---------+----------+--------+-------------+--------------+--------------------------+--------------------------+ > [...] > | FlumeData.1474467815652 | false | true | 1055490 | flume > | supergroup | rw-r--r-- | 2016-09-21 21:52:07.219 | 2016-09-21 > 21:58:58.28 | > | FlumeData.1474467815653 | false | true | 1050470 | flume > | supergroup | rw-r--r-- | 2016-09-21 21:58:58.556 | 2016-09-21 > 22:06:28.636 | > | FlumeData.1474467815654 | false | true | 1051043 | flume > | supergroup | rw-r--r-- | 2016-09-21 22:06:29.564 | 2016-09-21 > 22:13:40.808 | > | FlumeData.1474467815655 | false | true | 1052657 | flume > | supergroup | rw-r--r-- | 2016-09-21 22:13:40.978 | 2016-09-21 > 22:23:00.409 | > | FlumeData.1474467815656.tmp | false | true | 9447 | flume > | supergroup | rw-r--r-- | 2016-09-21 22:23:00.788 | 2016-09-21 > 22:23:00.788 | > +------------------------------+--------------+---------+----------+--------+-------------+--------------+--------------------------+--------------------------+ > 59 rows selected (0.265 seconds) > > Note the .tmp file as the last one in the folder > > Querying a single file works : > > 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/FlumeData.1474467815655`(type > => 'json')); > +---------+ > | EXPR$0 | > +---------+ > | 221 | > +---------+ > 1 row selected (0.685 seconds) > > > As does across multiple files where the wildcard pattern would exclude the > .tmp file: > > 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/FlumeData.147446781564*`(type > => 'json')); > +---------+ > | EXPR$0 | > +---------+ > | 2178 | > +---------+ > 1 row selected (1.24 seconds) > > > But if I try to query all the files, Drill includes the .tmp file and > errors: > > 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/*`(type => 'json')); > Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block > length for > LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_1074004983_264343; > getBlockSize()=9447; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[ > 192.168.10.116:50010,DS-39bf5e74-3eec-4447-9cd2-f17b5cc259b8,DISK], > DatanodeInfoWithStorage[192.168.10.113:50010,DS-845945e7-0bc8-44aa-945c-a140ad1f55ab,DISK], > DatanodeInfoWithStorage[192.168.10.115:50010 > ,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK]]} > > File /user/flume/incoming/twitter/2016/09/21/FlumeData.1474467815656.tmp > Record 1 > Fragment 0:0 > > [Error Id: d3f322cb-c64d-43c8-9231-fb2c96e8589d on > cdh57-01-node-01.moffatt.me:31010] (state=,code=0) > 0: jdbc:drill:drillbit=localhost> > > > Is there a way around this with Drill? For example, can I use a regex in > the path? I've tried, but just hit > Error: VALIDATION ERROR: null > > thanks, Robin.
