@Jinfeng, > select dir0, file_name, columns[1] where columns[1] = ' http://www.cnn.com'; > "(suppose we extend the dir* and include file_name as well)"
Getting the file_name is where the details come in. Supporting the select file_name as a special column is more work than what I was initially proposing. @Jason, yes, propagating the file name past joins and other operators seems too much overhead. In my email, item #2 I was only intending this for the Project that has been pushed right above the Scan. This addresses a vast majority of use cases. Aman On Wed, Sep 2, 2015 at 1:39 PM, Jason Altekruse <[email protected]> wrote: > While I initially wanted to say I agree with Jinfeng, I think the user > experience should be better than having to essentially run the query twice. > It doesn't seem crazy idea for Drill to carry around a source table for a > batch of records (maybe even an offset/line number in the scan if it is > known), and to continue to propagate it as long as the data has not been > modified. It would get a little hairier when we look at the case of join, > where most of the columns would be simple copies of the incoming data. If > an expression is evaluated against one of these columns downstream, which > could fail with one of these poor error messages, we should still be able > to propagate the source table information through the join, but > unfortunately that means we would have to store a list of source tables and > a map to particular columns with each batch. > > This would not be a trivial task, but this might be a common enough pain > point that it would be worth at least trying to come up with a design and > work estimate for a decent solution. > > On Tue, Sep 1, 2015 at 12:14 PM, Jinfeng Ni <[email protected]> wrote: > > > It seems hard to output the filename, # of records, unless the cast > happens > > exactly at Scan operator. Otherwise, the input of Project could be any > > operator, including Scan. It's hard to track the source of record in the > > chain of operators. > > > > I checked how Postgres reported the error in a similar case. Suppose I > > have two columns : > > > > empno : integer > > ename : character varying(20) > > > > > > mydb=# select empno, ename from emp; > > empno | ename > > -------+------------ > > 100 | John Jones > > 200 | 200 > > (2 rows) > > > > mydb=# select empno, cast(ename as int) from emp; > > ERROR: invalid input syntax for integer: "John Jones" > > > > Given the casting error, postgres did not report where the error record > is. > > Use has to use a query like "select * from emp where ename = 'John Jones" > > to figure out which record. > > > > In Drill, one probably could use similar approach to locate the error > > record. > > > > select dir0, file_name, columns[1] where columns[1] = ' > http://www.cnn.com > > '; > > > > (suppose we extend the dir* and include file_name as well). > > > > > > > > > > On Tue, Sep 1, 2015 at 10:46 AM, Hsuan Yi Chu <[email protected]> > wrote: > > > > > Is it possible to let record batch from scan to know "file name" and > "the > > > range of line numbers in this batch"? > > > > > > The second one sounds difficult ? > > > > > > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <[email protected]> > wrote: > > > > > > > Drill can point out the filename and location of corrupted records > in a > > > > file but we don't have a good mechanism to deal with the following > > > > scenario: > > > > > > > > Consider a text file with 2 records: > > > > $ cat t4.csv > > > > 10,2001 > > > > 11,http://www.cnn.com > > > > > > > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = > true; > > > > > > > > 0: jdbc:drill:zk=local> select cast(columns[0] as init), > > cast(columns[1] > > > as > > > > bigint) from dfs.`/Users/asinha/data/t4.csv`; > > > > > > > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com > > > > > > > > Fragment 0:0 > > > > > > > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on > 10.250.56.140:31010 > > ] > > > > > > > > (java.lang.NumberFormatException) http://www.cnn.com > > > > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91 > > > > > > > > > > > > > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62 > > > > org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62 > > > > > > > org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62 > > > > > > > > > > > > > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172 > > > > > > > > The problem is user does not have a clue about the original source of > > > this > > > > error. This is a pain point especially when dealing with thousands > of > > > > files. > > > > > > > > 1. We can start by providing the column index where the problem > > > occurred. > > > > 2. Can a scan batch keep track of the file it originated from ? > Since > > > the > > > > Project in the > > > > above query is pushed right above the scan, it could get the > > > filename > > > > from the record > > > > batch (assuming we can store this piece of information). This > > won't > > > > be possible > > > > for other Projects elsewhere in the plan. > > > > 3. What about the location within the file ? Unless the projection > > is > > > > pushed into the scan > > > > itself, I don't see a good way to provide this information. > > > > > > > > A related topic is how to tell Drill to ignore such records when > doing > > a > > > > query or a CTAS ? > > > > That could be a separate discussion. > > > > > > > > Thoughts ? > > > > Aman > > > > > > > > > >
