Re: Excluding HDFS .tmp file from multi-file query?

Andries Engelbrecht Wed, 21 Sep 2016 15:16:16 -0700

Add a . prefix to the Flume temp files. Drill will ignore the hidden files when 
you query the directory structure.


--Andries

> On Sep 21, 2016, at 2:36 PM, Robin Moffatt <[email protected]> 
> wrote:
> 
> Hi,
> I have a stream of data from Flume landing in HDFS in files of a set size.
> I can query these files individually just fine, and across multiple ones
> too - except if the wildcard encompasses the *currently open HDFS file that
> Flume is writing to*. When this happens, Drill understandably barfs.
> 
> 0: jdbc:drill:drillbit=localhost> show files in
> `hdfs`.`/user/flume/incoming/twitter/2016/09/21/`;
> +------------------------------+--------------+---------+----------+--------+-------------+--------------+--------------------------+--------------------------+
> |             name             | isDirectory  | isFile  |  length  | owner
> |    group    | permissions  |        accessTime        |
> modificationTime     |
> +------------------------------+--------------+---------+----------+--------+-------------+--------------+--------------------------+--------------------------+
> [...]
> | FlumeData.1474467815652      | false        | true    | 1055490  | flume
> | supergroup  | rw-r--r--    | 2016-09-21 21:52:07.219  | 2016-09-21
> 21:58:58.28   |
> | FlumeData.1474467815653      | false        | true    | 1050470  | flume
> | supergroup  | rw-r--r--    | 2016-09-21 21:58:58.556  | 2016-09-21
> 22:06:28.636  |
> | FlumeData.1474467815654      | false        | true    | 1051043  | flume
> | supergroup  | rw-r--r--    | 2016-09-21 22:06:29.564  | 2016-09-21
> 22:13:40.808  |
> | FlumeData.1474467815655      | false        | true    | 1052657  | flume
> | supergroup  | rw-r--r--    | 2016-09-21 22:13:40.978  | 2016-09-21
> 22:23:00.409  |
> | FlumeData.1474467815656.tmp  | false        | true    | 9447     | flume
> | supergroup  | rw-r--r--    | 2016-09-21 22:23:00.788  | 2016-09-21
> 22:23:00.788  |
> +------------------------------+--------------+---------+----------+--------+-------------+--------------+--------------------------+--------------------------+
> 59 rows selected (0.265 seconds)
> 
> Note the .tmp file as the last one in the folder
> 
> Querying a single file works :
> 
> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/FlumeData.1474467815655`(type
> => 'json'));
> +---------+
> | EXPR$0  |
> +---------+
> | 221     |
> +---------+
> 1 row selected (0.685 seconds)
> 
> 
> As does across multiple files where the wildcard pattern would exclude the
> .tmp file:
> 
> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/FlumeData.147446781564*`(type
> => 'json'));
> +---------+
> | EXPR$0  |
> +---------+
> | 2178    |
> +---------+
> 1 row selected (1.24 seconds)
> 
> 
> But if I try to query all the files, Drill includes the .tmp file and
> errors:
> 
> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/*`(type => 'json'));
> Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block
> length for
> LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_1074004983_264343;
> getBlockSize()=9447; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[
> 192.168.10.116:50010,DS-39bf5e74-3eec-4447-9cd2-f17b5cc259b8,DISK],
> DatanodeInfoWithStorage[192.168.10.113:50010,DS-845945e7-0bc8-44aa-945c-a140ad1f55ab,DISK],
> DatanodeInfoWithStorage[192.168.10.115:50010
> ,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK]]}
> 
> File  /user/flume/incoming/twitter/2016/09/21/FlumeData.1474467815656.tmp
> Record  1
> Fragment 0:0
> 
> [Error Id: d3f322cb-c64d-43c8-9231-fb2c96e8589d on
> cdh57-01-node-01.moffatt.me:31010] (state=,code=0)
> 0: jdbc:drill:drillbit=localhost>
> 
> 
> Is there a way around this with Drill? For example, can I use a regex in
> the path? I've tried, but just hit
> Error: VALIDATION ERROR: null
> 
> thanks, Robin.

Re: Excluding HDFS .tmp file from multi-file query?

Reply via email to