Re: Identifying the source of problematic records

Jinfeng Ni Tue, 01 Sep 2015 12:14:41 -0700

It seems hard to output the filename, # of records, unless the cast happens
exactly at Scan operator. Otherwise, the input of Project could be any
operator, including Scan. It's hard to track the source of record in the
chain of operators.


I checked how Postgres reported the error in a similar case. Suppose I
have two columns :

empno   : integer
ename   : character varying(20)


mydb=# select empno, ename from emp;
 empno |   ename
-------+------------
   100 | John Jones
   200 | 200
(2 rows)

mydb=# select empno, cast(ename as int) from emp;
ERROR:  invalid input syntax for integer: "John Jones"

Given the casting error, postgres did not report where the error record is.
Use has to use a query like "select * from emp where ename = 'John Jones"
to figure out which record.

In Drill, one probably could use similar approach to locate the error
record.

select dir0, file_name, columns[1] where columns[1]  = 'http://www.cnn.com';

 (suppose we extend the dir* and include file_name as well).




On Tue, Sep 1, 2015 at 10:46 AM, Hsuan Yi Chu <[email protected]> wrote:

> Is it possible to let record batch from scan to know "file name" and "the
> range of line numbers in this batch"?
>
> The second one sounds difficult ?
>
> On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <[email protected]> wrote:
>
> > Drill can point out the filename and location of corrupted records in a
> > file but we don't have a good mechanism to deal with the following
> > scenario:
> >
> > Consider a text file with 2 records:
> > $ cat t4.csv
> > 10,2001
> > 11,http://www.cnn.com
> >
> > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> >
> > 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1]
> as
> > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> >
> > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> >
> > Fragment 0:0
> >
> > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
> >
> >   (java.lang.NumberFormatException) http://www.cnn.com
> >     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> >
>  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> >
> > The problem is user does not have a clue about the original source of
> this
> > error.  This is a pain point especially when dealing with thousands of
> > files.
> >
> > 1.  We can start by providing the column index where the problem
> occurred.
> > 2.  Can a scan batch keep track of the file it originated from ? Since
> the
> > Project in the
> >      above query is pushed right above the scan, it could get the
> filename
> > from the record
> >      batch (assuming we can store this piece of information).  This won't
> > be possible
> >      for other Projects elsewhere in the plan.
> > 3.  What about the location within the file ?   Unless the projection is
> > pushed into the scan
> >      itself, I don't see a good way to provide this information.
> >
> > A related topic is how to tell Drill to ignore such records when doing a
> > query or a CTAS ?
> > That could be a separate discussion.
> >
> > Thoughts ?
> > Aman
> >
>

Re: Identifying the source of problematic records

Reply via email to