Is it possible to let record batch from scan to know "file name" and "the range of line numbers in this batch"?
The second one sounds difficult ? On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <[email protected]> wrote: > Drill can point out the filename and location of corrupted records in a > file but we don't have a good mechanism to deal with the following > scenario: > > Consider a text file with 2 records: > $ cat t4.csv > 10,2001 > 11,http://www.cnn.com > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true; > > 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as > bigint) from dfs.`/Users/asinha/data/t4.csv`; > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com > > Fragment 0:0 > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010] > > (java.lang.NumberFormatException) http://www.cnn.com > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91 > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62 > org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62 > org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62 > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172 > > The problem is user does not have a clue about the original source of this > error. This is a pain point especially when dealing with thousands of > files. > > 1. We can start by providing the column index where the problem occurred. > 2. Can a scan batch keep track of the file it originated from ? Since the > Project in the > above query is pushed right above the scan, it could get the filename > from the record > batch (assuming we can store this piece of information). This won't > be possible > for other Projects elsewhere in the plan. > 3. What about the location within the file ? Unless the projection is > pushed into the scan > itself, I don't see a good way to provide this information. > > A related topic is how to tell Drill to ignore such records when doing a > query or a CTAS ? > That could be a separate discussion. > > Thoughts ? > Aman >
