Hi all, 

I'm running multiple CTAS statements that selects and filters a json file to 
produce another json file on my local filesystem. The first ctas works but 
subsequent ones end up not completing fully and somehow end up getting 
cancelled. The table is created (as a json file) but the data in the file gets 
truncated/interrupted. Select statements seem to always complete without issue.

I'm running apache drill 0.8 in distributed mode on my mac (1 zk and 1 
drillbit). I've written a java app that is using the drill-jdbc-all-0.8.jar 
driver. My app:
- creates a new JDBC Connection
- query a json file that contains "rows" of json data. Each row contains a 
timestamp. The query gives me a distinct set of timestamps yyyy-mm-dd
- for each distinct timestamp execute a ctas statement that selects and filters 
the json file based on the timestamp, to create a json file that only contains 
data relating to that timestamp

The reason I'm doing this is that I'd like to partition the json data split by 
timestamp on the file system e.g. dfs.tmp.dataset/yyyy/mm/dd/data.json

Here's a snippet of the drillbit.log that shows that for some reason the ctas 
query is being cancelled:
2015-04-22 22:36:23,580 [2ac86a38-1777-a189-5721-240247cf04d9:foreman] INFO  
o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running on host 
localhost.  Skipping affinity to that host.
2015-04-22 22:36:23,580 [2ac86a38-1777-a189-5721-240247cf04d9:foreman] INFO  
o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 using 
1 threads. Time: 0ms total, 0.569000ms avg
, 0ms max.
2015-04-22 22:36:23,630 [2ac86a38-1777-a189-5721-240247cf04d9:foreman] INFO  
o.a.drill.exec.work.foreman.Foreman - State change requested.  PENDING --> 
RUNNING
2015-04-22 22:36:23,708 [2ac86a38-1777-a189-5721-240247cf04d9:frag:0:0] INFO  
o.a.d.exec.vector.BaseValueVector - Realloc vector null. [16384] -> [32768]
2015-04-22 22:36:23,708 [2ac86a38-1777-a189-5721-240247cf04d9:frag:0:0] INFO  
o.a.d.exec.vector.BaseValueVector - Realloc vector null. [16384] -> [32768]
2015-04-22 22:36:23,708 [2ac86a38-1777-a189-5721-240247cf04d9:frag:0:0] INFO  
o.a.d.exec.vector.BaseValueVector - Realloc vector null. [16384] -> [32768]
2015-04-22 22:36:23,708 [2ac86a38-1777-a189-5721-240247cf04d9:frag:0:0] INFO  
o.a.d.exec.vector.BaseValueVector - Realloc vector null. [16384] -> [32768]
2015-04-22 22:36:23,709 [2ac86a38-1777-a189-5721-240247cf04d9:frag:0:0] INFO  
o.a.d.exec.vector.BaseValueVector - Realloc vector null. [16384] -> [32768]
2015-04-22 22:36:23,709 [2ac86a38-1777-a189-5721-240247cf04d9:frag:0:0] INFO  
o.a.d.exec.vector.BaseValueVector - Realloc vector null. [16384] -> [32768]
2015-04-22 22:36:24,115 [UserServer-1] INFO  
o.a.drill.exec.work.foreman.Foreman - State change requested.  RUNNING --> 
CANCELLATION_REQUESTED
2015-04-22 22:36:24,157 [2ac86a38-1777-a189-5721-240247cf04d9:frag:0:0] INFO  
o.a.drill.exec.work.foreman.Foreman - State change requested.  
CANCELLATION_REQUESTED --> COMPLETED

This is a snippet of my app's log file which contains logging entries from the 
drill components:
, data=DrillBuf(ridx: 0, widx: 19, cap: 19/19, unwrapped: DrillBuf(ridx: 132, 
widx: 132, cap: 132/132, unwrapped: 
UnsafeDirectLittleEndian(PooledUnsafeDirectByteBuf(ridx: 0, widx: 0, cap: 
132/132))))]
22:36:16.801 [Client-1] DEBUG o.a.drill.exec.rpc.user.UserClient - Sending 
response with Sender 2050211006
22:36:16.801 [main] DEBUG o.a.drill.exec.client.DrillClient - Cancelling query 
2ac86a3f-7148-b56c-01cf-1015c2d393fc
22:36:16.801 [main] DEBUG o.a.d.jdbc.DrillStatementRegistry - Removing from 
open-statements registry: 
org.apache.drill.jdbc.DrillJdbc41Factory$DrillJdbc41Statement@7aaba36d
22:36:16.801 [main] INFO  c.c.c.etl.process.JsonFileSplitter - ymd: 2014-11-10 
cnt:319
22:36:16.802 [main] INFO  c.c.c.etl.db.CreateSplitTableCommand - query: create 
table 
dfs.`tmp`.`split/5e1b82f6-6002-458f-882e-f6a6fa528150/cim_clinical.allergyintolerance-2014-11-10`
 as select a.* from `dfs`.`/Users/adam/devel/fs/drop/dataset.json` as a where 
cast(a.`properties`.`dateTimeCreated`.`$date` as date) = date '2014-11-10'
22:36:16.802 [main] DEBUG o.a.d.jdbc.DrillStatementRegistry - Adding to 
open-statements registry: 
org.apache.drill.jdbc.DrillJdbc41Factory$DrillJdbc41Statement@7a5eb93a
22:36:16.808 [Client-1] DEBUG o.a.d.e.rpc.user.QueryResultHandler - 
batchArrived: isLastChunk: true, queryState: PENDING, queryId = part1: 
3082830765651178860
part2: 130340599866037244

22:36:16.809 [Client-1] DEBUG o.a.drill.exec.rpc.user.UserClient - Sending 
response with Sender 1448811781
22:36:16.815 [Client-1] DEBUG o.a.d.e.rpc.user.QueryResultHandler - Received 
QueryId part1: 3082830761881707592
part2: -368065489313249241
 successfully.  Adding results listener 
org.apache.drill.jdbc.DrillResultSet$ResultsListener@77ce5f8d.
22:36:17.017 [Client-1] DEBUG o.a.d.e.rpc.user.QueryResultHandler - 
batchArrived: isLastChunk: true, queryState: CANCELED, queryId = part1: 
3082830765651178860
part2: 130340599866037244

22:36:17.017 [Client-1] DEBUG org.apache.drill.jdbc.DrillResultSet - Result 
arrived QueryResultBatch [header=query_state: CANCELED
query_id {
  part1: 3082830765651178860
  part2: 130340599866037244
}

For the ctas queries that are being reported as cancelled I'm finding that the 
new json file is created and contains json data, but the results are not fully 
written out. That is, the file just ends mid record.

Any idea why subsequent ctas statements are being cancelled? Also, what's the 
recommended approach to connection management when needing to execute a series 
of tasks? Should I configure a datasource instead?

thanks in advance,
a.

Reply via email to