Hi Owen,
Thanks for the quick response.

Essentially, I have an Avro -> ORC real-time conversion process I have. I
do the conversion myself using the Java API. In the case I (internally in
my code) hit a serialization failure, etc. then I push to a queue to handle
offline.
However, since I write the data for a single record column vector by column
vector, I want to make sure I don't have partial data from the failed
record still in the vector positions for that failed record.

Here is a small snippet to elucidate what I'm doing. *addToVector* could
fail for any sort of reason, so I track the failed avro record in a
separate thread, but want to make sure for that vectorPosition that the
other column vectors are reset? Maybe to defaults? Maybe it's a dumb
question, but I can't figure out a smart way to do that or if I'm thinking
about that rollback idea correctly. Hopefully that is clear. Thanks Owen!

for (int c = 0; c < batch.numCols; c++) {
  ColumnVector colVector = batch.cols[c];
  final String thisField = orcSchema.getFieldNames().get(c);
  int vectorPosition = batch.size;

  Logger.orcConversionStatus(LOGGER_TRACE_ID, CLASS_LOCATION,
      String.format("Processing field: %s", thisField));
  final TypeDescription type = orcSchema.getChildren().get(c);

  Object fieldValue = record.get(thisField);
  Schema.Field avroField = currSchema.getField(thisField);

  // If this fails on some column X, I want to rollback the data I've
written for batch.numCols - X
  addToVector(type, colVector, avroField.schema(), fieldValue, vectorPosition);
}


On Fri, Sep 11, 2020 at 10:37 AM Owen O'Malley <owen.omal...@gmail.com>
wrote:

> Where is the failure happening? If it is happening in the ORC writer code,
> there isn't a way to do that. Can I ask what kind of exception you are
> hitting? In the column (aka tree) writers, there shouldn't be much that can
> go wrong. It doesn't even write to the file handle, just buffering in
> memory.
>
> If the problem is in your code, you should be able to use the selected
> vector in the VectorizedRowBatch to just select the other rows.
>
> .. Owen
>
> On Fri, Sep 11, 2020 at 7:12 AM Ryan Schachte <c...@ryan-schachte.com>
> wrote:
>
> > I'm writing a streaming application that converts incoming data into ORC
> in
> > real-time. One thing I'm implementing is a dead-letter queue that still
> > allows me to continue the batch processing even if a single record fails.
> >
> > The caveat to this, is I want to remove the data that has been written
> thus
> > far if a failure occurs on say the 6th column out of 10 columns. For
> > example:
> >
> > I write the following data:
> >
> > {
> >  firstName: blah1,
> >  lastName: blah2,
> >  otherData: blah3
> > }
> >
> > My question is, if I fail on otherData, I want to "rollback" the data
> from
> > the column vectors at the current vectorPosition I'm iterating on. Is it
> as
> > simple as setting colVector.isNull[vectorPosition] to true and setting
> > colVector.noNulls to false? I wanted to originally go into the index for
> > each column vector and override, but I don't see an easy way to do that.
> >
> > Cheers!!
> > Ryan Schachte
> >
>

Reply via email to