It seems hard to output the filename, # of records, unless the cast happens exactly at Scan operator. Otherwise, the input of Project could be any operator, including Scan. It's hard to track the source of record in the chain of operators.
I checked how Postgres reported the error in a similar case. Suppose I have two columns : empno : integer ename : character varying(20) mydb=# select empno, ename from emp; empno | ename -------+------------ 100 | John Jones 200 | 200 (2 rows) mydb=# select empno, cast(ename as int) from emp; ERROR: invalid input syntax for integer: "John Jones" Given the casting error, postgres did not report where the error record is. Use has to use a query like "select * from emp where ename = 'John Jones" to figure out which record. In Drill, one probably could use similar approach to locate the error record. select dir0, file_name, columns[1] where columns[1] = 'http://www.cnn.com'; (suppose we extend the dir* and include file_name as well). On Tue, Sep 1, 2015 at 10:46 AM, Hsuan Yi Chu <[email protected]> wrote: > Is it possible to let record batch from scan to know "file name" and "the > range of line numbers in this batch"? > > The second one sounds difficult ? > > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <[email protected]> wrote: > > > Drill can point out the filename and location of corrupted records in a > > file but we don't have a good mechanism to deal with the following > > scenario: > > > > Consider a text file with 2 records: > > $ cat t4.csv > > 10,2001 > > 11,http://www.cnn.com > > > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true; > > > > 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] > as > > bigint) from dfs.`/Users/asinha/data/t4.csv`; > > > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com > > > > Fragment 0:0 > > > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010] > > > > (java.lang.NumberFormatException) http://www.cnn.com > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91 > > > > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62 > > org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62 > > > org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62 > > > > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172 > > > > The problem is user does not have a clue about the original source of > this > > error. This is a pain point especially when dealing with thousands of > > files. > > > > 1. We can start by providing the column index where the problem > occurred. > > 2. Can a scan batch keep track of the file it originated from ? Since > the > > Project in the > > above query is pushed right above the scan, it could get the > filename > > from the record > > batch (assuming we can store this piece of information). This won't > > be possible > > for other Projects elsewhere in the plan. > > 3. What about the location within the file ? Unless the projection is > > pushed into the scan > > itself, I don't see a good way to provide this information. > > > > A related topic is how to tell Drill to ignore such records when doing a > > query or a CTAS ? > > That could be a separate discussion. > > > > Thoughts ? > > Aman > > >
