[GitHub] [hudi] sydneyhoran commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException
sydneyhoran commented on issue #8519: URL: https://github.com/apache/hudi/issues/8519#issuecomment-1533514539 The reason I was getting an error on deleting records without tombstone was because we were testing by starting from a midpoint of a Kafka topic, so I suspect Deltastreamer didn't know what to do with `"d"` messages for records that didn't exist. Had to add a few more logging lines and check executor logs to find out what was going on. We'll make sure jobs are starting fresh from the start of topics and turn on `commit-on-errors` if needed to get around non-existent deletes (until there is a possible fix for that). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sydneyhoran commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException
sydneyhoran commented on issue #8519: URL: https://github.com/apache/hudi/issues/8519#issuecomment-1533511853 Thanks to help from Aditya, @rmahindra123 and @nsivabalan , this was the fix that worked for us to filter out tombstones: https://github.com/sydneyhoran/hudi/commit/b864a69e27d50424b6984f28a31c3bd99a025762 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sydneyhoran commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException
sydneyhoran commented on issue #8519: URL: https://github.com/apache/hudi/issues/8519#issuecomment-152991 Thank you @the-other-tim-brown for the explanation/confirmation. That is what we have assumed as well, but it seems with our configuration we are unable to parse the event with `before.*` and `op: "d"`, seems Deltastreamer just sees it as an error -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sydneyhoran commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException
sydneyhoran commented on issue #8519: URL: https://github.com/apache/hudi/issues/8519#issuecomment-1529939285 When we either turn off tombstones in Debezium, or filter them out in DebeziumSource.java, no null/tombstones are coming in (which is good). But we still get a "commit failed" and upon further inspection of the log I have found that before this error it says `Delta Sync found errors when writing. Errors/Total=14/9282` It seems that DeltaSync/WriteStatus.java is treating the deletes as "Errors". When I set `--commit-on-errors` to true, it allows the job to run but what happens to those "error" records? Shouldn't they be telling Deltastreamer to "delete" those records? ``` 23/05/01 13:49:26 ERROR org.apache.hudi.utilities.deltastreamer.DeltaSync: Delta Sync found errors when writing. Errors/Total=14/9282 23/05/01 13:49:26 ERROR org.apache.hudi.utilities.deltastreamer.DeltaSync: Printing out the top 100 errors 23/05/01 13:49:26 ERROR org.apache.hudi.utilities.deltastreamer.DeltaSync: Global error : 23/05/01 13:49:26 ERROR org.apache.hudi.utilities.deltastreamer.DeltaSync: Global error : 23/05/01 13:49:26 ERROR org.apache.hudi.utilities.deltastreamer.DeltaSync: Global error : ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sydneyhoran commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException
sydneyhoran commented on issue #8519: URL: https://github.com/apache/hudi/issues/8519#issuecomment-1524068006 My team is trying to develop a Custom Transformer class that can skip over null (tombstone) records from PostgresDebezium Kafka Source to address this. We are attempting along the lines of: ``` public class TombstoneTransformer implements Transformer { private static final Logger LOG = LoggerFactory.getLogger(TombstoneTransformer.class); @Override public Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, Dataset rowDataset, TypedProperties properties) { LOG.info("TombstoneTransformer " + rowDataset); // // NullPointerException happens on the following line: // List rowList = rowDataset.collectAsList(); // // NullPointerException happens on the following line: // newRowSet.collectAsList().stream().limit(50).forEach(x -> LOG.info(x.toString())); // // Later in DeltaSync, tombstone records appear to still be present and results in NullPointerException later // Dataset newRowSet = rowDataset.filter("_change_operation_type is not null"); // // Later in DeltaSync, tombstone records appear to still be present and results in NullPointerException later Dataset newRowSet = rowDataset.filter(Objects::nonNull); return newRowSet; } } ``` However, none of the attempts at filtering the rowDataset get rid of the NullPointerException later in the Deltastreamer ingestion. Moreso, many of the attempts to log/view the individual records in rowDataset result in NullPointerException . And so we are wondering if there is something earlier in the code (maybe the PostgresDebeziumSource.java that flattens messages?) that runs that could be allowing malformed Row objects to get passed to the Custom Transformer classes - that somehow is not allowing us to read/access the Rows and filter out the ones that are null (tombstone) records. Anyone that might have an idea for how to make this class work? Side note - we also tried SqlQueryBasedTransformer with “SELECT * FROM a WHERE a.id is not null” and it also did not filter out Tombstones (still had NPE later during ingestion). Could someone explain what is happening with delete operations on `PostgresDebeziumSource` and `PostgresDebeziumAvroPayload` and why they potentially aren't being handled well? Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sydneyhoran commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException
sydneyhoran commented on issue #8519: URL: https://github.com/apache/hudi/issues/8519#issuecomment-1524065639 Just as an update, we were able to set tombstones.on.delete to `false` in a lower environment and still got the following error after a delete op: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org