I'd like to propose a few things to solve this: a) Functions should be able to express failures in a standardized way. I'm thinking a new type of injectable and/or a certain type of exception (although more dangerous/possibly requires rewrite given compilation model). b) Users (session/system level) should be able to set a setting where function errors are handled a certain way. Options could include query failure, ignore + inform as warning/notice, and save records for later analysis (maybe in v2). c) Readers that have a notorious problem (e.g. Text) should support projection/expression pushdown so that they can create these kinds of errors and provide additional context as part of that. d) We should also implement dot drill files so that users can prescribe this projection/data validation process by default for files/diretories (which would provide the behavior as c above. e) We should get more serious about providing useful virtual fields. This should include filename (similar to directory name).
Once a record leaves an operator, I don't think we should carry any additional provenance with it. It would be too heavy weight as a default behavior. -- Jacques Nadeau CTO and Co-Founder, Dremio On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <[email protected]> wrote: > Drill can point out the filename and location of corrupted records in a > file but we don't have a good mechanism to deal with the following > scenario: > > Consider a text file with 2 records: > $ cat t4.csv > 10,2001 > 11,http://www.cnn.com > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true; > > 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as > bigint) from dfs.`/Users/asinha/data/t4.csv`; > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com > > Fragment 0:0 > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010] > > (java.lang.NumberFormatException) http://www.cnn.com > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91 > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62 > org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62 > org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62 > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172 > > The problem is user does not have a clue about the original source of this > error. This is a pain point especially when dealing with thousands of > files. > > 1. We can start by providing the column index where the problem occurred. > 2. Can a scan batch keep track of the file it originated from ? Since the > Project in the > above query is pushed right above the scan, it could get the filename > from the record > batch (assuming we can store this piece of information). This won't > be possible > for other Projects elsewhere in the plan. > 3. What about the location within the file ? Unless the projection is > pushed into the scan > itself, I don't see a good way to provide this information. > > A related topic is how to tell Drill to ignore such records when doing a > query or a CTAS ? > That could be a separate discussion. > > Thoughts ? > Aman >
