AWESOME! I had just been in the process of writing up a long user story to ask for and support exactly this. I modified it and included it here:
To start out, I want to say how much I love the Drill project, and the potential it has. I've put this together based on my experiences and want to contribute a perspective as a user, not just put a bunch of critiques in an email. I hope it's all taken in that spirit. Additional note, I wrote this prior to seeing the Design Document share by Hsuan Yi Chu yesterday. If you are reading it, and think to yourself “that wording is odd…” please consider it from the “I didn’t want to throw away the user story” perspective and the “I wrote it before the design doc” perspective. Additionally, I understand that some of what I am suggesting may not be easy from a development perspective. I am just being upfront with my experience, so we can look to determine what can be done; I am not looking for a silver bullet here, just looking for improvement. Some may be as simple as better documentation, other suggestions may be harder to implement. Either way, I thought a verbose user story might be useful to the community as a whole. John *User Story* As I have been working with Drill for data exploration, I came across multiple "things" that just were hard. In dealing with some data, especially JSON data, it can be ugly, and scaled ugly is even worse! For this story, I am working with a JSON dump from MongoDB, and you would think it would be well structured, and for the most part it is. There are some application level mistakes that were made (I will go into that in a moment), but in general Drill handles this well. So with this data set, there are a few main challenges I am seeing: 1. When there is a field that has a float, and then a later record has the number 0 in it (which Drill takes as a INT). This is a known problem and one that Drill has a solution for. 2. When there is a field is of one type (a map) and then a later record has a string in it. No easy solution here. 3. Select * where there is a json field with a . in the name. I won’t go into details here, but I feel this factors into data exploration, because it changes the ability to “stay in Drill” to explore their data ( https://issues.apache.org/jira/browse/DRILL-3922) 4. Error reporting challenges With the problem summary laid out, I wanted to walk through my process in working with this data, and where, if I were a user Drill could have been much more helpful to the process. Here is a description of the process I went through: 1. Copy data into filesystem 2. Use drill to “Select * from `path_to/dump.json` limit 1 3. (I just want to see what it looks like!) Here I get this error: > select * from `path_to/ dump.json` limit 1; Error: DATA_READ ERROR: You tried to write a BigInt type when you are using a ValueWriter of type NullableFloat8WriterImpl. File /data/dev/path_to/dump.json Record 1 Line 1 Column 9054 Field entropy Fragment 0:0 This isn’t incredibly helpful from a user perspective. I.e. When I Google around, I realize now that in the docs it talks about “Schema Changes” and one possible item is use the setting below. However, examples of the data that was trying to be displayed (with it’s implied type) may help users grok what is happening. At least in this case it showed me the field name! ALTER SYSTEM SET `store.json.read_numbers_as_double` = true; This is a great example where since we have known use case (when numbers are doubles but someone tries to store 0 an INT) it fails, thus dev’s have added a setting to allow a user to get through that, that the error message could be more helpful. In this case, Showing two record numbers (line numbers) with different types, the field values with their implied types, and perhaps a suggestion about using the setting to address the problem. This could make it more intuitive for the user to stay in Drill, and stay in the data. In this case, I looked at a head of the file, and saw the issue and was able to proceed. Also, as a corollary here, the user documentation does not show this error related to the schema change problem. This would be a great place to state, “if you see an error that looks like X, this is what is happening and what you can do for it.” *Side node on documentation* We should look to have documentation try to be role based. In this case, the documentation says use “ALTER SYSTEM” I would argue, and I am guessing others would concur, that for this use case, “ALTER SESSION” may be a better suggestion as this is specific alteration to address the use case of loading/querying a specific data set, and is likely done by a user of the system. If a user is doing self-serve data, then in an enterprise environment, they may not have the ability to use ALTER SYSTEM and get an error, thus may be confused on how to proceed. In addition ALTER SYSTEM by a user who doesn’t understand that they are changing, yet have the rights to change, may introduce future data problems they didn’t expect. I like that the default is a more constrictive method, because it makes people be explicit about data, yet the documentation should also aim to be explicit about something like a system wide change. *Back to the story* Ok so now I will do ALTER SESSION SET on the read_numbers_as_double setting I run the query again. > select * from `path_to/dump.json` limit 1; Error: DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar type when you are using a ValueWriter of type SingleMapWriter. File /data/dev/path_to/dump.json Record 4009 Fragment 0:0 Another error But what does this one mean? Ok, now that I have been living in the docs and in the Drill user list, and because it’s similar to the schema change issue, that that is what we are looking at here. Instead of double to int, we have one field that is map most of the time, and in some cases it’s a string. But this doesn’t really help me as a user. To troubleshoot this Drill doesn’t offer any options. This file is 500 MB of dense and nested JSON data with 51k records. My solution? I took the record number, then I went to my NFS mounted clustered file system (thank goodness I had MapR here, I am not sure how I would have done this with Posix tools) My command: $ head -4009 dump.json|tail -1 That (I hoped) showed me the record in question, note the error from Drill didn’t tell me which field was at fault here, so I had to visually align things to address that. However, I was able to spot the difference and work with the dev to understand why that happened. I removed those records, and things worked correctly. Could there have been a way to identify that within drill? My solution was to take a python script and read through, and discard those records that were not a map, however, on 500MB that can work, but what about 500 GB? I guess a Spark job could clean the data…. But could Drill be given some tools to help with this situation? For example, the first thing I said was: What field is at issue? I had no way to see what was up there. I had to use other tools to see the data so I could understand the problem. Then when I understood the problem, I had to use Python to produce data that was queryable. Based on the design document Hsuan Yi Chu just posted to the mailing list, at this point my post is just a user story to support the design document. To summarize the points I’d like to see included in the design document (from a user perspective), not understanding “how or why”: *1. **Error messages that are more verbose in explaining the problem* a. Filename, row number, column number or name b. Option to output the “offending row” c. Showing the data that is causing the error WITH the type Drill inferred. d. If there are options to help work through dirty data, perhaps the error message could include those: “Data was an double, then drill found this data: 0 that was a int in File x, at row 24 in column “myfloatingdata” consider using store.json.read_numbers_as_double to address the issue. 2. *A way to determine how common this exception is* a. If I am playing with a messy data set, and this error happens, does it happen on 1 record? 2? 5000? Knowing that information would: i. Help users understand how Drill is seeing that particular column ii. Make decisions on excluding data rather than just removing it. What if the first 10 records were errors, and then you excluded the remaining 10 million because they were correct yet different from the first 10? b. Perhaps there could be a “stats” function that only works if it’s the only selected item or if the select is all those functions (stats functions)? i. Select type_stats(fieldsname) from data ii. (that wouldn’t error on different types) 3. *An ability to set a “return null on this field if error or if non castable to X type, especially in a view, perhaps in a function.* a. Allow them to not have to reparse data outside drill b. Load it into a sane format (one time loads/ETL to clean data) c. Not be system or session wide exception. i. I think this is important because I may have a field where I want it to read the numbers as double, but what if I have another field in the same dataset where I don’t want it to read the numbers as double? A SYSTEM or SESSION level variable takes away that granularity d. Select field1, CASTORNULL(field2, int) as field2, CASTORNULL(field3, double) as field3 from ugly_data. e. That’s an example when it’s in the select, but I Could see a where clause f. Select field1, field2, field3 from ugly data where ISTYPE(field2, int) and ISTYPE(field3, double) 4. *Updating of the documentation related to ALTER SESSION vs ALTER SYSTEM with an eye to the context of the majority use case of the documented feature* a. For data loads, the documentation uses ALTER SYSTEM and that’s problematic because: i. Not all users have the privileges to issue an ALTER SYSTEM. Thus a new user trying to figure things out may not realize they can just ALTER SESSION after getting an ALTER SYSTEM error. ii. ALTER SYSTEM on data loading items, especially in areas that make Drill’s data interpretation more permissive can lead to unintended consequences later. An admin, who may be a good systems admin, and helps a data user troubleshoot and error may issue an ALTER SYSTEM not realizing this changes all future data imports. b. Note, I found a few cases, but I would suggest a thorough review of the various use cases throughout the documentation, and in areas where it really could be either, have a small paragraph indicating the ramifications of either command. *5. **A Philosophy within the Drill Community to “Stay in Drill” for data exploration* a. This is obviously not as much of a development thing as a mindset. If someone says “I tried to do X, and I got and error” and the communities response is Y where Y is “Look through your data and do Z to it so Drill can read it” then we should reconsider that scenario and try to provide and option within Drill to intuitively handle the edge case. This is difficult. b. There are cases even in the documentation where this is the case: https://drill.apache.org/docs/json-data-model/ talking about arrays at the root level or reading some empty arrays. In these cases, we have to leave drill to fix the problem. This works on small data, but may not work on large or wide data. Consider the array at root level limitation. What if some process out of the users control produces 1000 100mb json files and we want to read that. To fix it, we have to address those files. Lots of work there, either manual or automated. c. Once again I know this isn’t easy, but we shouldn’t answer questions about how to do something by saying “fix this outside of Drill so Drill can read your data” if at all possible. I hope this story helps support the design document presented. I am happy to participate in more discussion around these topics as I have enjoying digging into the internals of Drill John Omernik