Hi,
It still tried to read it, even with a . prefix:
0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
table(`hdfs`.`/user/flume/incoming/twitter/2016/09/22/*`(type => 'json'));
Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block
length for
LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_1074005711_265071;
getBlockSize()=39945; corrupt=false; offset=0;
locs=[DatanodeInfoWithStorage[192.168.10.115:50010,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK],
DatanodeInfoWithStorage[192.168.10.114:50010,DS-6c2cd5a6-22c5-4445-9018-ca0f2549a6cf,DISK],
DatanodeInfoWithStorage[192.168.10.117:50010
,DS-70946f9d-95d9-4f35-b19d-97b8dc01cb88,DISK]]}
File /user/flume/incoming/twitter/2016/09/22/.FlumeData.1474530954642.tmp
Record 1
Fragment 0:0
[Error Id: 04d44b5d-6d02-4062-9f4a-ebf3831d9ba1 on
cdh57-01-node-01.moffatt.me:31010] (state=,code=0)
However - good call on the Flume hdfs.inUsePrefix - by using that I can
then set my Drill file pattern sufficiently so that it doesn't match on the
in-use file:
0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
table(`hdfs`.`/user/flume/incoming/twitter/2016/09/22/Flume*`(type =>
'json'));
+---------+
| EXPR$0 |
+---------+
| 12652 |
+---------+
1 row selected (6.34 seconds)
So problem sidestepped for now - but would be good to understand if I
couldn't modify the producer, if there's a way to get Drill to omit certain
patterns from its file matching?
thanks, Robin.
On 21 September 2016 at 23:15, Andries Engelbrecht <
[email protected]> wrote:
> Add a . prefix to the Flume temp files. Drill will ignore the hidden files
> when you query the directory structure.
>
> --Andries
>
> > On Sep 21, 2016, at 2:36 PM, Robin Moffatt <
> [email protected]> wrote:
> >
> > Hi,
> > I have a stream of data from Flume landing in HDFS in files of a set
> size.
> > I can query these files individually just fine, and across multiple ones
> > too - except if the wildcard encompasses the *currently open HDFS file
> that
> > Flume is writing to*. When this happens, Drill understandably barfs.
> >
> > 0: jdbc:drill:drillbit=localhost> show files in
> > `hdfs`.`/user/flume/incoming/twitter/2016/09/21/`;
> > +------------------------------+--------------+---------+---
> -------+--------+-------------+--------------+--------------
> ------------+--------------------------+
> > | name | isDirectory | isFile | length |
> owner
> > | group | permissions | accessTime |
> > modificationTime |
> > +------------------------------+--------------+---------+---
> -------+--------+-------------+--------------+--------------
> ------------+--------------------------+
> > [...]
> > | FlumeData.1474467815652 | false | true | 1055490 |
> flume
> > | supergroup | rw-r--r-- | 2016-09-21 21:52:07.219 | 2016-09-21
> > 21:58:58.28 |
> > | FlumeData.1474467815653 | false | true | 1050470 |
> flume
> > | supergroup | rw-r--r-- | 2016-09-21 21:58:58.556 | 2016-09-21
> > 22:06:28.636 |
> > | FlumeData.1474467815654 | false | true | 1051043 |
> flume
> > | supergroup | rw-r--r-- | 2016-09-21 22:06:29.564 | 2016-09-21
> > 22:13:40.808 |
> > | FlumeData.1474467815655 | false | true | 1052657 |
> flume
> > | supergroup | rw-r--r-- | 2016-09-21 22:13:40.978 | 2016-09-21
> > 22:23:00.409 |
> > | FlumeData.1474467815656.tmp | false | true | 9447 |
> flume
> > | supergroup | rw-r--r-- | 2016-09-21 22:23:00.788 | 2016-09-21
> > 22:23:00.788 |
> > +------------------------------+--------------+---------+---
> -------+--------+-------------+--------------+--------------
> ------------+--------------------------+
> > 59 rows selected (0.265 seconds)
> >
> > Note the .tmp file as the last one in the folder
> >
> > Querying a single file works :
> >
> > 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/
> FlumeData.1474467815655`(type
> > => 'json'));
> > +---------+
> > | EXPR$0 |
> > +---------+
> > | 221 |
> > +---------+
> > 1 row selected (0.685 seconds)
> >
> >
> > As does across multiple files where the wildcard pattern would exclude
> the
> > .tmp file:
> >
> > 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/
> FlumeData.147446781564*`(type
> > => 'json'));
> > +---------+
> > | EXPR$0 |
> > +---------+
> > | 2178 |
> > +---------+
> > 1 row selected (1.24 seconds)
> >
> >
> > But if I try to query all the files, Drill includes the .tmp file and
> > errors:
> >
> > 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/*`(type =>
> 'json'));
> > Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block
> > length for
> > LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_
> 1074004983_264343;
> > getBlockSize()=9447; corrupt=false; offset=0;
> locs=[DatanodeInfoWithStorage[
> > 192.168.10.116:50010,DS-39bf5e74-3eec-4447-9cd2-f17b5cc259b8,DISK],
> > DatanodeInfoWithStorage[192.168.10.113:50010,DS-845945e7-
> 0bc8-44aa-945c-a140ad1f55ab,DISK],
> > DatanodeInfoWithStorage[192.168.10.115:50010
> > ,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK]]}
> >
> > File /user/flume/incoming/twitter/2016/09/21/FlumeData.
> 1474467815656.tmp
> > Record 1
> > Fragment 0:0
> >
> > [Error Id: d3f322cb-c64d-43c8-9231-fb2c96e8589d on
> > cdh57-01-node-01.moffatt.me:31010] (state=,code=0)
> > 0: jdbc:drill:drillbit=localhost>
> >
> >
> > Is there a way around this with Drill? For example, can I use a regex in
> > the path? I've tried, but just hit
> > Error: VALIDATION ERROR: null
> >
> > thanks, Robin.
>
>