Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

John Omernik Wed, 21 Oct 2015 06:16:13 -0700

AWESOME!

I had just been in the process of writing up a long user story to ask for
and support exactly this.   I modified it and included it here:



To start out, I want to say how much I love the Drill project, and the
potential it has. I've put this together based on my experiences and want
to contribute a perspective as a user, not just put a bunch of critiques in
an email.  I hope it's all taken in that spirit.  Additional note, I wrote
this prior to seeing the Design Document share by Hsuan Yi Chu yesterday.
If you are reading it, and think to yourself “that wording is odd…” please
consider it from the “I didn’t want to throw away the user story”
perspective and the “I wrote it before the design doc” perspective.



Additionally, I understand that some of what I am suggesting may not be
easy from a development perspective.  I am just being upfront with my
experience, so we can look to determine what can be done; I am not looking
for a silver bullet here, just looking for improvement.  Some may be as
simple as better documentation, other suggestions may be harder to
implement.  Either way, I thought a verbose user story might be useful to
the community as a whole.



John



*User Story*



As I have been working with Drill for data exploration, I came across
multiple "things" that just were hard.  In dealing with some data,
especially JSON data, it can be ugly, and scaled ugly is even worse!



For this story, I am working with a JSON dump from MongoDB, and you would
think it would be well structured, and for the most part it is.  There are
some application level mistakes that were made (I will go into that in a
moment), but in general Drill handles this well.  So with this data set,
there are a few main challenges I am seeing:



1.     When there is a field that has a float, and then a later record has
the number 0 in it (which Drill takes as a INT). This is a known problem
and one that Drill has a solution for.

2.     When there is a field is of one type (a map) and then a later record
has a string in it.  No easy solution here.

3.     Select * where there is a json field with a . in the name. I won’t
go into details here, but I feel this factors into data exploration,
because it changes the ability to “stay in Drill” to explore their data (
https://issues.apache.org/jira/browse/DRILL-3922)

4.     Error reporting challenges





With the problem summary laid out, I wanted to walk through my process in
working with this data, and where, if I were a user Drill could have been
much more helpful to the process.



Here is a description of the process I went through:



1.     Copy data into filesystem

2.     Use drill to “Select * from `path_to/dump.json` limit 1

3.     (I just want to see what it looks like!)





Here I get this error:



> select * from `path_to/ dump.json` limit 1;

Error: DATA_READ ERROR: You tried to write a BigInt type when you are using
a ValueWriter of type NullableFloat8WriterImpl.



File  /data/dev/path_to/dump.json

Record  1

Line  1

Column  9054

Field  entropy

Fragment 0:0



This isn’t incredibly helpful from a user perspective.  I.e. When I Google
around, I realize now that in the docs it talks about “Schema Changes” and
one possible item is use the setting below. However, examples of the data
that was trying to be displayed (with it’s implied type) may help users
grok what is happening.  At least in this case it showed me the field name!



ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;



This is a great example where since we have known use case (when numbers
are doubles but someone tries to store 0 an INT) it fails, thus dev’s have
added a setting to allow a user to get through that, that the error message
could be more helpful.   In this case, Showing two record numbers (line
numbers) with different types, the field values with their implied types,
and perhaps a suggestion about using the setting to address the problem.
This could make it more intuitive for the user to stay in Drill, and stay
in the data.   In this case, I looked at a head of the file, and saw the
issue and was able to proceed.



Also, as a corollary here, the user documentation does not show this error
related to the schema change problem. This would be a great place to state,
“if you see an error that looks like X, this is what is happening and what
you can do for it.”





*Side node on documentation*

We should look to have documentation try to be role based.   In this case,
the documentation says use “ALTER SYSTEM” I would argue, and I am guessing
others would concur, that for this use case, “ALTER SESSION” may be a
better suggestion as this is specific alteration to address the use case of
loading/querying a specific data set, and is likely done by a user of the
system.



If a user is doing self-serve data, then in an enterprise environment, they
may not have the ability to use ALTER SYSTEM and get an error, thus may be
confused on how to proceed.   In addition ALTER SYSTEM by a user who
doesn’t understand that they are changing, yet have the rights to change,
may introduce future data problems they didn’t expect.   I like that the
default is a more constrictive method, because it makes people be explicit
about data, yet the documentation should also aim to be explicit about
something like a system wide change.





*Back to the story*

Ok so now I will do ALTER SESSION SET on the read_numbers_as_double setting



I run the query again.



> select * from `path_to/dump.json` limit 1;

Error: DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar
type when you are using a ValueWriter of type SingleMapWriter.



File  /data/dev/path_to/dump.json

Record  4009

Fragment 0:0



Another error   But what does this one mean? Ok, now that I have been
living in the docs and in the Drill user list, and because it’s similar to
the schema change issue, that that is what we are looking at here.  Instead
of double to int, we have one field that is map most of the time, and in
some cases it’s a string.



But this doesn’t really help me as a user.  To troubleshoot this Drill
doesn’t offer any options. This file is 500 MB of dense and nested JSON
data with 51k records.   My solution? I took the record number, then I went
to my NFS mounted clustered file system (thank goodness I had MapR here, I
am not sure how I would have done this with Posix tools)



My command: $ head -4009 dump.json|tail -1



That (I hoped) showed me the record in question, note the error from Drill
didn’t tell me which field was at fault here, so I had to visually align
things to address that.  However, I was able to spot the difference and
work with the dev to understand why that happened. I removed those records,
and things worked correctly.



Could there have been a way to identify that within drill? My solution was
to take a python script and read through, and discard those records that
were not a map, however, on 500MB that can work, but what about 500 GB?  I
guess a Spark job could clean the data…. But could Drill be given some
tools to help with this situation?



For example, the first thing I said was: What field is at issue?  I had no
way to see what was up there.  I had to use other tools to see the data so
I could understand the problem. Then when I understood the problem, I had
to use Python to produce data that was queryable.



Based on the design document Hsuan Yi Chu just posted to the mailing list,
at this point my post is just a user story to support the design document.
To summarize the points I’d like to see included in the design document
(from a user perspective), not understanding “how or why”:









*1.     **Error messages that are more verbose in explaining the problem*

a.     Filename, row number, column number or name

b.     Option to output the “offending row”

c.     Showing the data that is causing the error WITH the type Drill
inferred.

d.     If there are options to help work through dirty data, perhaps the
error message could include those: “Data was an double, then drill found
this data: 0 that was a int in File x, at row 24 in column “myfloatingdata”
consider using store.json.read_numbers_as_double to address the issue.

2.     *A way to determine how common this exception is*

a.     If I am playing with a messy data set, and this error happens, does
it happen on 1 record? 2? 5000?  Knowing that information would:

                                               i.     Help users understand
how Drill is seeing that particular column

                                             ii.     Make decisions on
excluding data rather than just removing it. What if the first 10 records
were errors, and then you excluded the remaining 10 million because they
were correct yet different from the first 10?

b.     Perhaps there could be a “stats” function that only works if it’s
the only selected item or if the select is all those functions (stats
functions)?

                                               i.     Select
type_stats(fieldsname) from data

                                             ii.      (that wouldn’t error
on different types)

3.     *An ability to set a “return null on this field if error or if non
castable to X type, especially in a view, perhaps in a function.*

a.     Allow them to not have to reparse data outside drill

b.     Load it into a sane format (one time loads/ETL to clean data)

c.     Not be system or session wide exception.

                                               i.     I think this is
important because I may have a field where I want it to read the numbers as
double, but what if I have another field in the same dataset where I don’t
want it to read the numbers as double? A SYSTEM or SESSION level variable
takes away that granularity

d.     Select field1, CASTORNULL(field2, int) as field2, CASTORNULL(field3,
double) as field3 from ugly_data.

e.     That’s an example when it’s in the select, but I Could see a where
clause

f.      Select field1, field2, field3 from ugly data where ISTYPE(field2,
int) and ISTYPE(field3, double)

4.     *Updating of the documentation related to ALTER SESSION vs ALTER
SYSTEM with an eye to the context of the majority use case of the
documented feature*

a.     For data loads, the documentation uses ALTER SYSTEM and that’s
problematic because:

                                               i.     Not all users have
the privileges to issue an ALTER SYSTEM. Thus a new user trying to figure
things out may not realize they can just ALTER SESSION after getting an
ALTER SYSTEM error.

                                             ii.     ALTER SYSTEM on data
loading items, especially in areas that make Drill’s data interpretation
more permissive can lead to unintended consequences later. An admin, who
may be a good systems admin, and helps a data user troubleshoot and error
may issue an ALTER SYSTEM not realizing this changes all future data
imports.

b.     Note, I found a few cases, but I would suggest a thorough review of
the various use cases throughout the documentation, and in areas where it
really could be either, have a small paragraph indicating the ramifications
of either command.

*5.     **A Philosophy within the Drill Community to “Stay in Drill” for
data exploration*

a.     This is obviously not as much of a development thing as a mindset.
If someone says “I tried to do X, and I got and error” and the communities
response is Y where Y is “Look through your data and do Z to it so Drill
can read it” then we should reconsider that scenario and try to provide and
option within Drill to intuitively handle the edge case.  This is
difficult.

b.     There are cases even in the documentation where this is the case:
https://drill.apache.org/docs/json-data-model/ talking about arrays at the
root level or reading some empty arrays.  In these cases, we have to leave
drill to fix the problem. This works on small data, but may not work on
large or wide data. Consider the  array at root level limitation.  What if
some process out of the users control produces 1000 100mb json files and we
want to read that. To fix it, we have to address those files. Lots of work
there, either manual or automated.

c.     Once again I know this isn’t easy, but we shouldn’t answer questions
about how to do something by saying “fix this outside of Drill so Drill can
read your data” if at all possible.







I hope this story helps support the design document presented.  I am happy
to participate in more discussion around these topics as I have enjoying
digging into the internals of Drill



John Omernik

Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

Reply via email to