Re: Identifying the source of problematic records

Aman Sinha Wed, 02 Sep 2015 18:43:45 -0700

@Jinfeng,
  > select dir0, file_name, columns[1] where columns[1]  = '
http://www.cnn.com';
  > "(suppose we extend the dir* and include file_name as well)"


Getting the file_name is where the details come in.   Supporting the select
file_name  as a special column is more work than
what I was initially proposing.

@Jason,
yes, propagating the file name past joins and other operators seems too
much overhead.  In my email,  item #2 I was only
intending this for the Project that has been pushed right above the Scan.
This addresses a vast majority of use cases.

Aman




On Wed, Sep 2, 2015 at 1:39 PM, Jason Altekruse <[email protected]>
wrote:

> While I initially wanted to say I agree with Jinfeng, I think the user
> experience should be better than having to essentially run the query twice.
> It doesn't seem crazy idea for Drill to carry around a source table for a
> batch of records (maybe even an offset/line number in the scan if it is
> known), and to continue to propagate it as long as the data has not been
> modified. It would get a little hairier when we look at the case of join,
> where most of the columns would be simple copies of the incoming data. If
> an expression is evaluated against one of these columns downstream, which
> could fail with one of these poor error messages, we should still be able
> to propagate the source table information through the join, but
> unfortunately that means we would have to store a list of source tables and
> a map to particular columns with each batch.
>
> This would not be a trivial task, but this might be a common enough pain
> point that it would be worth at least trying to come up with a design and
> work estimate for a decent solution.
>
> On Tue, Sep 1, 2015 at 12:14 PM, Jinfeng Ni <[email protected]> wrote:
>
> > It seems hard to output the filename, # of records, unless the cast
> happens
> > exactly at Scan operator. Otherwise, the input of Project could be any
> > operator, including Scan. It's hard to track the source of record in the
> > chain of operators.
> >
> > I checked how Postgres reported the error in a similar case. Suppose I
> > have two columns :
> >
> > empno   : integer
> > ename   : character varying(20)
> >
> >
> > mydb=# select empno, ename from emp;
> >  empno |   ename
> > -------+------------
> >    100 | John Jones
> >    200 | 200
> > (2 rows)
> >
> > mydb=# select empno, cast(ename as int) from emp;
> > ERROR:  invalid input syntax for integer: "John Jones"
> >
> > Given the casting error, postgres did not report where the error record
> is.
> > Use has to use a query like "select * from emp where ename = 'John Jones"
> > to figure out which record.
> >
> > In Drill, one probably could use similar approach to locate the error
> > record.
> >
> > select dir0, file_name, columns[1] where columns[1]  = '
> http://www.cnn.com
> > ';
> >
> >  (suppose we extend the dir* and include file_name as well).
> >
> >
> >
> >
> > On Tue, Sep 1, 2015 at 10:46 AM, Hsuan Yi Chu <[email protected]>
> wrote:
> >
> > > Is it possible to let record batch from scan to know "file name" and
> "the
> > > range of line numbers in this batch"?
> > >
> > > The second one sounds difficult ?
> > >
> > > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <[email protected]>
> wrote:
> > >
> > > > Drill can point out the filename and location of corrupted records
> in a
> > > > file but we don't have a good mechanism to deal with the following
> > > > scenario:
> > > >
> > > > Consider a text file with 2 records:
> > > > $ cat t4.csv
> > > > 10,2001
> > > > 11,http://www.cnn.com
> > > >
> > > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` =
> true;
> > > >
> > > > 0: jdbc:drill:zk=local> select cast(columns[0] as init),
> > cast(columns[1]
> > > as
> > > > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> > > >
> > > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> > > >
> > > > Fragment 0:0
> > > >
> > > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on
> 10.250.56.140:31010
> > ]
> > > >
> > > >   (java.lang.NumberFormatException) http://www.cnn.com
> > > >
>  org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> > > >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> > > >
> > >  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> > > >
> > > > The problem is user does not have a clue about the original source of
> > > this
> > > > error.  This is a pain point especially when dealing with thousands
> of
> > > > files.
> > > >
> > > > 1.  We can start by providing the column index where the problem
> > > occurred.
> > > > 2.  Can a scan batch keep track of the file it originated from ?
> Since
> > > the
> > > > Project in the
> > > >      above query is pushed right above the scan, it could get the
> > > filename
> > > > from the record
> > > >      batch (assuming we can store this piece of information).  This
> > won't
> > > > be possible
> > > >      for other Projects elsewhere in the plan.
> > > > 3.  What about the location within the file ?   Unless the projection
> > is
> > > > pushed into the scan
> > > >      itself, I don't see a good way to provide this information.
> > > >
> > > > A related topic is how to tell Drill to ignore such records when
> doing
> > a
> > > > query or a CTAS ?
> > > > That could be a separate discussion.
> > > >
> > > > Thoughts ?
> > > > Aman
> > > >
> > >
> >
>

Re: Identifying the source of problematic records

Reply via email to