Hi Kabeer, I have requested some information in the github ticket. Balaji.V On Wednesday, August 28, 2019, 10:46:04 AM PDT, Kabeer Ahmed <kab...@linuxmail.org> wrote: Thanks for the quick response Vinoth. That is what I would have thought that there is nothing complex or different in upsert after a delete. Yes, I can reproduce the issue with simple example that I have written in the email.
I have dug into the issue in detail and it seems it is a bug. I have filed it at: https://github.com/apache/incubator-hudi/issues/859 (https://link.getmailspring.com/link/23c57df5-045c-4021-a880-93a1c46a3...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fissues%2F859&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D). Let me know if more information is required. Thank you, On Aug 23 2019, at 1:37 am, Vinoth Chandar <vin...@apache.org> wrote: > yes. I was asking about the HUDI storage type.. > > There is nothing complex about upsert() after delete(). It almost as if a > delete() for (2, vinoth) happened in between. > > Are you able to repro this literally with this tiny example with 3 records? > Some things to check > > - This sequence would have created 3 commits. You can look at the commit > files and see if the number of record updated, inserted, deleted match > expectations. > - if they do, then you can use spark.read.parquet(.). on the individual > parquet files and see what records they actually contain .. > > This should shed some light on the pattern of failure and when exactly (2, > vinoth) disappeared. > > Alternatively, if you can give a small snippet that reproduces this, we can > debug from there. > > > > > > > On Thu, Aug 22, 2019 at 3:06 PM Kabeer Ahmed <kab...@linuxmail.org> wrote: > > And if you meant HUDI storage type, I have left it to default COW - Copy > > On Write. > > > > If anyone has tried this please let me know if you have hit similar issue. > > Any experience would be greatly helpful. > > On Aug 22 2019, at 11:01 pm, Kabeer Ahmed <kab...@linuxmail.org> wrote: > > > Hi Vinoth - thanks for the quick response. > > > > > > I have followed the mail thread for deletes -> > > http://mail-archives.apache.org/mod_mbox/hudi-commits/201904.mbox/< > > 155556722511.2660.9583626796839453...@gitbox.apache.org> > > > > > > For your convenience, the code that I use is below at the end of the > > email. EmptyHoodieRecord is inserted for the relevant records that need to > > be deleted. After the delete, I can query from Hive and confirm that the > > rows intended to be deleted are no longer present and the records not > > deleted can be seen in the Hive table via Hive and Presto. > > > The issue starts when the upsert is done after a delete. > > > The storage type is S3 and I dont think there is any eventual > > > > consistency in play as the record upserted is visible but the old records > > that werent deleted are not visible. > > > And for the sake of completion, my insert and upsert logic is based out > > > > of the code below: > > https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L43 > > > Thanks > > > Kabeer. > > > > > > > /** > > > > * Empty payload used for deletions > > > > */ > > > > public class EmptyHoodieRecordPayload implements > > > > > > > HoodieRecordPayload<EmptyHoodieRecordPayload> > > > > { > > > > public EmptyHoodieRecordPayload(GenericRecord record, Comparable > > > > > > > orderingVal) { } > > > > @Override > > > > public EmptyHoodieRecordPayload preCombine(EmptyHoodieRecordPayload > > > > > > > another) { > > > > return another; > > > > } > > > > @Override > > > > public Optional<IndexedRecord> combineAndGetUpdateValue(IndexedRecord > > > > > > > currentValue, > > > > chema schema) { > > > > return Optional.empty(); > > > > } > > > > @Override > > > > public Optional<IndexedRecord> getInsertValue(Schema schema) { > > > > return Optional.empty(); > > > > } > > > > } > > > > > > ---------- Forwarded Message --------- > > > > > > From: Vinoth Chandar <vin...@apache.org> > > > Subject: Re: Upsert after Delete > > > Date: Aug 22 2019, at 8:38 pm > > > To: dev@hudi.apache.org > > > > > > That’s interesting. Can you also share details on storage type and how > > you > > > are issuing the deletes and also the table/view (ro, rt) that you are > > > querying? > > > > > > On Thu, Aug 22, 2019 at 9:49 AM Kabeer Ahmed <kab...@linuxmail.org> > > wrote: > > > > Hudi experts and Users, > > > > Has anyone attempted an upsert after a delete? Here is a weird thing > > > > > > > that > > > > I have bumped into and it is a shame that this has come up when > > > > > > > someone in > > > > the team tested this whilst I failed to run this test. > > > > Use case: > > > > Insert data into a table. Say records (1, kabeer | 2, vinoth) > > > > > > > > Delete a record (1, kabeer). Data in the table is: (2, vinoth) and it > > is > > > > visible via sql through Presto/Hive. > > > > > > > > Upsert a new record into the same table (3, balaji). Query the table > > and > > > > only record that is visible is: (3, balaji). The record (2, vinoth) is > > > > > > > not > > > > displayed in the results. > > > > > > > > Any ideas on what could be at play here? Has someone done upsert after > > > > delete? > > > > > > > > Thanks, > > > > Kabeer > > > > > > > > PS: Please note that upsert functionality is well tested and if we do > > (1, > > > > vinoth) insert followed by upsert of (2, balaji) both the records are > > > > visible. So something else is at play and would appreciate any help > > > > > > > that > > > > you experts can provide insight. > > > > > > >