[GitHub] [drill] paul-rogers edited a comment on pull request #2364: DRILL-8037: Add V2 JSON Format Plugin based on EVF

GitBox Tue, 01 Feb 2022 21:28:41 -0800


paul-rogers edited a comment on pull request #2364:
URL: https://github.com/apache/drill/pull/2364#issuecomment-1027598044



   @vdiravka reached out to me on this bug. His explanation:
   
   > The issue is the schema is changed for the second batch and it is reported 
by SchemaTracker#isSameSchema
   I suppose V1 reader just compared the same field type, but the new one 
compares vectors by identity. So schemaChange is right thing for that test 
case. But Drill Hash aggregate doesn't support schema change.
   
   Here is my analysis:
   
   You are tight that the `SchemaTracker` is the thing in EVF which looks for 
schema changes. The rule is pretty simple: if the the kind of vector differs 
from the previous batch, then a schema change occurred. This is based, in part, 
on the observation that the sort operator can't handle any schema changes at 
all: not INT/BIGINT, not NOT NULL to NULL. Sort combines vectors, and if they 
are of a different type, the values won't fit together. The same is probably 
true of the hash agg.
   
   OK, so we agree that a type change is bad. What you're describing is the 
same type, but a different vector. This is a bug in several ways, and not the 
way you might think.
   
   First, it seems logical that if column x is a non-null INT in one batch, and 
is a non-null INT in the next batch, that no schema change has occurred. But, 
Drill will surprise you. Most operators assume that the vector itself has not 
changed: only the data stored within the vector. This is a subtle point, so let 
me expand it.
   
   A vector is a permanent holder or a column. The vector holds one or more 
data buffers. Those buffers change from batch to batch. Think of it as a bucket 
brigade: the old-style way that people fought fires. A chain of people forms 
(the vectors). They hand buckets one to the next (the data buffers).
   
   Now, I was pretty surprised when I first discovered this. So was Boaz when 
he wrote hash agg. But, most of Drill's code assumes that, in setup, the code 
binds to a specific vector. In the next batch, that same binding is valid; only 
the data changes. This shows up in code gen and other places. In fact, a goodly 
part of EVF is concerned with this need for "vector continuity" even across 
readers. (Let that sink in: two readers that have nothing to do with each other 
must somehow still share common vectors. That alone should tell us the vector 
design in Drill is a bit strange and needs rethinking. But, that's another 
topic.)
   
   So, now let's review the bug. On the one hand, `SchemaTracker` has noticed 
the vectors changed, and is (correctly) telling the downstream operators that 
they must redo their bindings to the new vector.
   
   But, on the other hand, `SchemaTracker` has been presented with two 
different vectors for the same column. That should never happen unless there is 
an actual type change (non-null to null, INT to BIGINT, etc.) There are 
elaborate mechanisms to reuse vectors from one reader to the next.
   
   So, the first thing to check is if the types are at all different. (If the 
"major types" differ.) If so, then we have a legitimate schema change that 
Drill can't handle.
   
   Now, if you find that the types are the same, then we have a bug in EVF, 
specifically in the `ResultVectorCacheImpl` class, since that's the thing which 
is supposed to do the magic.
   
   Let me ask another question. When this occurs, is it in the first batch of 
the second (or later) reader? If so, then that reader should have pulled the 
vector from the cache. The question is, why didn't it?
   
   If the error occurs within the first reader, then the bug is more subtle. 
How could the JSON reader accomplish that? The `ResultSetLoader` hides all that 
detail, and it defers to the `ResultVectorCacheImpl `for vector continuity.
   
   Yet another question is why these tests didn't fail 3 months ago when the 
tests first ran. And, why did the tests pass way back when I wrote this code? 
Did anything change elsewhere that could trigger this? I don't see how, but we 
should probably ask the question. 
   
   Still another possibility is that there has always been a schema change, the 
prior code failed to point out the schema change (by returning a 
`OK_NEW_SCHEMA` status) yet somehow the hash agg handled that case. If so, then 
there is a whole chain of bug. That would not surprise me: I had to fix bug 
after bug after bug to get the EVF stuff working: often someone broke something 
in one place to work around something broken in another. When I fixed one, the 
other failed when it should not have. Loads of fun.
   
   Fortunately, these are unit tests, so we should be able to sort out the 
issues without too much trouble.
   
   I'll also note that there were a few fixes to EVF V2 that I made in my PR, 
but those were around a subtle but in implicit columns.
   
   Suggestion: give the above a shot to see if you can see the problem. 
Otherwise, tomorrow I'll try to squeeze in time to grab this branch and do some 
debugging. After all, I wrote this stuff originally, so I should try to make it 
work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] paul-rogers edited a comment on pull request #2364: DRILL-8037: Add V2 JSON Format Plugin based on EVF

Reply via email to