Excluding HDFS .tmp file from multi-file query?

Robin Moffatt Wed, 21 Sep 2016 14:37:52 -0700

Hi,
I have a stream of data from Flume landing in HDFS in files of a set size.
I can query these files individually just fine, and across multiple ones
too - except if the wildcard encompasses the *currently open HDFS file that
Flume is writing to*. When this happens, Drill understandably barfs.


0: jdbc:drill:drillbit=localhost> show files in
`hdfs`.`/user/flume/incoming/twitter/2016/09/21/`;
+------------------------------+--------------+---------+----------+--------+-------------+--------------+--------------------------+--------------------------+
|             name             | isDirectory  | isFile  |  length  | owner
 |    group    | permissions  |        accessTime        |
modificationTime     |
+------------------------------+--------------+---------+----------+--------+-------------+--------------+--------------------------+--------------------------+
[...]
| FlumeData.1474467815652      | false        | true    | 1055490  | flume
 | supergroup  | rw-r--r--    | 2016-09-21 21:52:07.219  | 2016-09-21
21:58:58.28   |
| FlumeData.1474467815653      | false        | true    | 1050470  | flume
 | supergroup  | rw-r--r--    | 2016-09-21 21:58:58.556  | 2016-09-21
22:06:28.636  |
| FlumeData.1474467815654      | false        | true    | 1051043  | flume
 | supergroup  | rw-r--r--    | 2016-09-21 22:06:29.564  | 2016-09-21
22:13:40.808  |
| FlumeData.1474467815655      | false        | true    | 1052657  | flume
 | supergroup  | rw-r--r--    | 2016-09-21 22:13:40.978  | 2016-09-21
22:23:00.409  |
| FlumeData.1474467815656.tmp  | false        | true    | 9447     | flume
 | supergroup  | rw-r--r--    | 2016-09-21 22:23:00.788  | 2016-09-21
22:23:00.788  |
+------------------------------+--------------+---------+----------+--------+-------------+--------------+--------------------------+--------------------------+
59 rows selected (0.265 seconds)

Note the .tmp file as the last one in the folder

Querying a single file works :

0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/FlumeData.1474467815655`(type
=> 'json'));
+---------+
| EXPR$0  |
+---------+
| 221     |
+---------+
1 row selected (0.685 seconds)


As does across multiple files where the wildcard pattern would exclude the
.tmp file:

0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/FlumeData.147446781564*`(type
=> 'json'));
+---------+
| EXPR$0  |
+---------+
| 2178    |
+---------+
1 row selected (1.24 seconds)


But if I try to query all the files, Drill includes the .tmp file and
errors:

0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/*`(type => 'json'));
Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block
length for
LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_1074004983_264343;
getBlockSize()=9447; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[
192.168.10.116:50010,DS-39bf5e74-3eec-4447-9cd2-f17b5cc259b8,DISK],
DatanodeInfoWithStorage[192.168.10.113:50010,DS-845945e7-0bc8-44aa-945c-a140ad1f55ab,DISK],
DatanodeInfoWithStorage[192.168.10.115:50010
,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK]]}

File  /user/flume/incoming/twitter/2016/09/21/FlumeData.1474467815656.tmp
Record  1
Fragment 0:0

[Error Id: d3f322cb-c64d-43c8-9231-fb2c96e8589d on
cdh57-01-node-01.moffatt.me:31010] (state=,code=0)
0: jdbc:drill:drillbit=localhost>


Is there a way around this with Drill? For example, can I use a regex in
the path? I've tried, but just hit
Error: VALIDATION ERROR: null

thanks, Robin.

Excluding HDFS .tmp file from multi-file query?

Reply via email to