Re: Identifying the source of problematic records

Jason Altekruse Wed, 02 Sep 2015 13:39:57 -0700

While I initially wanted to say I agree with Jinfeng, I think the user
experience should be better than having to essentially run the query twice.
It doesn't seem crazy idea for Drill to carry around a source table for a
batch of records (maybe even an offset/line number in the scan if it is
known), and to continue to propagate it as long as the data has not been
modified. It would get a little hairier when we look at the case of join,
where most of the columns would be simple copies of the incoming data. If
an expression is evaluated against one of these columns downstream, which
could fail with one of these poor error messages, we should still be able
to propagate the source table information through the join, but
unfortunately that means we would have to store a list of source tables and
a map to particular columns with each batch.


This would not be a trivial task, but this might be a common enough pain
point that it would be worth at least trying to come up with a design and
work estimate for a decent solution.

On Tue, Sep 1, 2015 at 12:14 PM, Jinfeng Ni <[email protected]> wrote:

> It seems hard to output the filename, # of records, unless the cast happens
> exactly at Scan operator. Otherwise, the input of Project could be any
> operator, including Scan. It's hard to track the source of record in the
> chain of operators.
>
> I checked how Postgres reported the error in a similar case. Suppose I
> have two columns :
>
> empno   : integer
> ename   : character varying(20)
>
>
> mydb=# select empno, ename from emp;
>  empno |   ename
> -------+------------
>    100 | John Jones
>    200 | 200
> (2 rows)
>
> mydb=# select empno, cast(ename as int) from emp;
> ERROR:  invalid input syntax for integer: "John Jones"
>
> Given the casting error, postgres did not report where the error record is.
> Use has to use a query like "select * from emp where ename = 'John Jones"
> to figure out which record.
>
> In Drill, one probably could use similar approach to locate the error
> record.
>
> select dir0, file_name, columns[1] where columns[1]  = 'http://www.cnn.com
> ';
>
>  (suppose we extend the dir* and include file_name as well).
>
>
>
>
> On Tue, Sep 1, 2015 at 10:46 AM, Hsuan Yi Chu <[email protected]> wrote:
>
> > Is it possible to let record batch from scan to know "file name" and "the
> > range of line numbers in this batch"?
> >
> > The second one sounds difficult ?
> >
> > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <[email protected]> wrote:
> >
> > > Drill can point out the filename and location of corrupted records in a
> > > file but we don't have a good mechanism to deal with the following
> > > scenario:
> > >
> > > Consider a text file with 2 records:
> > > $ cat t4.csv
> > > 10,2001
> > > 11,http://www.cnn.com
> > >
> > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> > >
> > > 0: jdbc:drill:zk=local> select cast(columns[0] as init),
> cast(columns[1]
> > as
> > > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> > >
> > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> > >
> > > Fragment 0:0
> > >
> > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010
> ]
> > >
> > >   (java.lang.NumberFormatException) http://www.cnn.com
> > >     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> > >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> > >
> >  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> > >
> > > The problem is user does not have a clue about the original source of
> > this
> > > error.  This is a pain point especially when dealing with thousands of
> > > files.
> > >
> > > 1.  We can start by providing the column index where the problem
> > occurred.
> > > 2.  Can a scan batch keep track of the file it originated from ? Since
> > the
> > > Project in the
> > >      above query is pushed right above the scan, it could get the
> > filename
> > > from the record
> > >      batch (assuming we can store this piece of information).  This
> won't
> > > be possible
> > >      for other Projects elsewhere in the plan.
> > > 3.  What about the location within the file ?   Unless the projection
> is
> > > pushed into the scan
> > >      itself, I don't see a good way to provide this information.
> > >
> > > A related topic is how to tell Drill to ignore such records when doing
> a
> > > query or a CTAS ?
> > > That could be a separate discussion.
> > >
> > > Thoughts ?
> > > Aman
> > >
> >
>

Re: Identifying the source of problematic records

Reply via email to