[ 
https://issues.apache.org/jira/browse/DRILL-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950005#comment-14950005
 ] 

Mehant Baid commented on DRILL-3764:
------------------------------------

I had worked on providing a similar functionality with [~jnadeau] on providing 
a framework (annotations for errors in function template and necessary addition 
to the runtime code gen to handle errors) to be able to deal with errors in 
function evaluation. Here is the branch, 
https://github.com/mehant/drill/commit/3e81a776d1c1bb0ce7f64d8c5a905c87d71e42e0 
(this is old, most likely won't rebase cleanly, I can work on rebasing if 
deemed useful). The basic idea was to provide a way to specify different type 
of errors within the UDF and in case of an error use null for that row. 

> Support the ability to identify and/or skip records when a function 
> evaluation fails
> ------------------------------------------------------------------------------------
>
>                 Key: DRILL-3764
>                 URL: https://issues.apache.org/jira/browse/DRILL-3764
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Functions - Drill
>    Affects Versions: 1.1.0
>            Reporter: Aman Sinha
>             Fix For: Future
>
>
> Drill can point out the filename and location of corrupted records in a file 
> but it does not have a good mechanism to deal with the following scenario: 
> Consider a text file with 2 records:
> {code}
> $ cat t4.csv
> 10,2001
> 11,http://www.cnn.com
> {code}
> {code}
> 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as 
> bigint) from dfs.`t4.csv`;
> Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> Fragment 0:0
> [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
>   (java.lang.NumberFormatException) http://www.cnn.com
>     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
>     
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
>     
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> {code}
> The problem is user does not have the context of where the error occurred 
> -either the file name or the record number.   This becomes a pain point 
> especially when CTAS is being used to do data conversion from (say) text 
> format to Parquet format.  The CTAS may be accessing thousands of files and 1 
> such casting (or another function) failure aborts the query. 
> It would substantially improve the user experience if we provided: 
> 1) the filename and record number where  this failure occurred
> 2) the ability to skip such records depending on a session option
> 3) the ability to write such records to a staging table for future ingestion
> Please see discussion on dev list: 
> http://mail-archives.apache.org/mod_mbox/drill-dev/201509.mbox/%3cCAFyDVvLuPLgTNZ56S6=J=9Vb=aBs=pdw7nrhkkdupbdxgfa...@mail.gmail.com%3e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to