Is it possible to let record batch from scan to know "file name" and "the
range of line numbers in this batch"?

The second one sounds difficult ?

On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <[email protected]> wrote:

> Drill can point out the filename and location of corrupted records in a
> file but we don't have a good mechanism to deal with the following
> scenario:
>
> Consider a text file with 2 records:
> $ cat t4.csv
> 10,2001
> 11,http://www.cnn.com
>
> 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
>
> 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as
> bigint) from dfs.`/Users/asinha/data/t4.csv`;
>
> Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
>
> Fragment 0:0
>
> [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
>
>   (java.lang.NumberFormatException) http://www.cnn.com
>     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
>
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
>
> The problem is user does not have a clue about the original source of this
> error.  This is a pain point especially when dealing with thousands of
> files.
>
> 1.  We can start by providing the column index where the problem occurred.
> 2.  Can a scan batch keep track of the file it originated from ? Since the
> Project in the
>      above query is pushed right above the scan, it could get the filename
> from the record
>      batch (assuming we can store this piece of information).  This won't
> be possible
>      for other Projects elsewhere in the plan.
> 3.  What about the location within the file ?   Unless the projection is
> pushed into the scan
>      itself, I don't see a good way to provide this information.
>
> A related topic is how to tell Drill to ignore such records when doing a
> query or a CTAS ?
> That could be a separate discussion.
>
> Thoughts ?
> Aman
>

Reply via email to