Re: new requirments for Trafodion to be more tolerate during bulkloading

Hans Zeller Wed, 30 Mar 2016 10:50:37 -0700

Hi Ming,

Yes, it would be good if Trafodion would behave in ways similar to Hive. It
would also be good if errors in the data wouldn't make the entire bulk load
fail. Instead, those error rows should ideally go into a separate file or
table (with additional fields for source file name and line number if
possible). That's not an easy thing to do, though. Could you say how much
of this error logging functionality you would want and how Hive solves that?


I don't quite understand how we would recognize record delimiters in a data
field? Do you want to ignore record delimiters that appear within field
delimiters? So, a record delimiter in the last field would cause the record
to be truncated, but in fields other than the last it would be ok? I think
it would be better if we would treat a row with a record delimiter in it as
an error row (two error rows, actually). Possibly we could allow quoted
strings, like

"this is a single field with a field delimiter | and a record delimiter \n
and a quote "" in it"


Thanks,

Hans

On Mon, Mar 28, 2016 at 8:13 PM, Liu, Ming (Ming) <[email protected]> wrote:

> Hi, all,
>
> Trafodion can bulk load data from HDFS into Trafodion tables. Currently,
> it has some strict requirements about the source data in order to load
> successfully.
> Typically, data source should be clean and contains relatively few 'dirty'
> data. However, there will be some special cases where source data contains
> some special value and we hope Trafodion can handle automatically:
>
>
> Automatically remove '\r' when it is used as '\r\n' the DOS format line
> delimiter.
>
> Donot raise SQL error, but convert bad data into null automatically, and
> still be able to log this into error log files when required, don't make
> this change silent, and make this action traceable.
>
> Allow '\n' in data field even '\n' is the line terminator
>
> Auto truncate overflowed string, log it into the error log file, in order
> to make it traceable.
>
> When src data have above 'issues', now, we have to do a special 'data
> clean' process before load the data: convert DOS format into Unix format,
> find bad data and remove them. However, products like Hive can handle these
> 'bad' data as mentioned above. So it will be helpful, if Trafodion can
> introduce a special mode to simulate the same 'tolerance' when doing
> bulkload, if user can make sure these are desired conversion, and no need
> to do the extra 'data clean' process. Especially, data will be shared by
> Trafodion and other products like Hive.
>
> I will file a JIRA if no objections here, and any suggestions ideas are
> welcome!
>
> Thanks,
> Ming
>
>

Re: new requirments for Trafodion to be more tolerate during bulkloading

Reply via email to