Re: issue while reading archived commit written by 0.5 version with 0.8 version
Hi Aakash, Deleting the old commit files should not impose much of an impact since you are unlikely to use them again once it's been archived successfully, which you have also deleted some of the archived files yourself. 😅 However, I went back and dug the codebase again. A fix has been merged into the master recently and is supposed to come out in 0.9.0, which should be a better fix to this problem rather than manual intervention. Specifically, you can take a look at this fix here https://github.com/apache/hudi/pull/2677, if you are interested. We will be *skipping* the deserialization of inflight commit files and *only* deserialize complete commit files. As you can see, your problem is caused by archiving 20200715192915.rollback.inflight, which is an inflight commit file. We aren't particularly interested in the content of those inflight files; thus, we have decided to modify the archival logic this way. Failure to archive the commit files should not impede your usage of Hudi, and it could continue to function properly. However, if you do care about a clean running status of your pipeline, feel free to build your 0.9.0 SNAPSHOT version and blend it in. Hope it helps. :) Best, Susu On Thu, Jun 24, 2021 at 12:32 AM aakash aakash wrote: > Hi Susu, > > thanks for the response. Can you please explain whats the impact of > deleting these commit files? > > Thanks! > > On Wed, Jun 23, 2021 at 8:09 AM Susu Dong wrote: > > > Hi Aakash, > > > > I believe there were schema level changes from Hudi 0.5.0 to 0.6.0 > > regarding those commit files. So if you are jumping from 0.5.0 to 0.8.0 > > right away, you will likely experience such an error, i.e. Failed to > > archive commits. You shouldn't need to delete archived files; instead, > you > > should try deleting some, if not all, active commit files under your > > *.hoodie* folder. The reason for that is 0.8.0 is using a new AVRO schema > > to parse your old commit files, so you got the failure. Can you try the > > above approach and let us know? Thank you. :) > > > > Best, > > Susu > > > > On Wed, Jun 23, 2021 at 12:21 PM aakash aakash > > wrote: > > > > > Hi, > > > > > > I am trying to use Hudi 0.8 with Spark 3.0 in my prod environment and > > > earlier we were running Hudi 0.5 with Spark 2.4.4. > > > > > > While updating a very old index, I am getting this error : > > > > > > *from the logs it seem its error out while reading this file : > > > hudi/.hoodie/archived/.commits_.archive.119_1-0-1 in s3* > > > > > > 21/06/22 19:18:06 ERROR HoodieTimelineArchiveLog: Failed to archive > > > commits, .commit file: 20200715192915.rollback.inflight > > > java.io.IOException: Not an Avro data file > > > at > org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) > > > at > > > > > > > > > org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:175) > > > at > > > > > > > > > org.apache.hudi.client.utils.MetadataConversionUtils.createMetaWrapper(MetadataConversionUtils.java:84) > > > at > > > > > > > > > org.apache.hudi.table.HoodieTimelineArchiveLog.convertToAvroRecord(HoodieTimelineArchiveLog.java:370) > > > at > > > > > > > > > org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:311) > > > at > > > > > > > > > org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:128) > > > at > > > > > > > > > org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:430) > > > at > > > > > > > > > org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:186) > > > at > > > > > > > > > org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:121) > > > at > > > > > > > > > org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:479) > > > > > > > > > Is this a backward compatibility issue? I have deleted a few archive > > files > > > but the problem is persisting so it does not look like a file > corruption > > > issue. > > > > > > Regards, > > > Aakash > > > > > >
Re: Could Hudi Data lake support low latency, high throughput random reads?
Maybe it is just not sane to serve online request-response service using Data lake as backend? In general, data lakes have not evolved beyond analytics, ML at this point, i.e optimized for large batch scans. Not to say that this cannot be possible, but I am skeptical that it will ever be as low-latency as your regular OLTP database. Object store random reads are definitely going to cost ~100ms, like reading from a highly loaded hard drive. Hudi does support a HFile format, which is more optimized for random reads. We use it to store and serve table metadata. So that path is worth pursuing, if you have the appetite for trying the changing the norm here. :) There is probably some work to do here for scaling it for large amounts of data. Hope that helps. Thanks Vinoth On Mon, Jun 7, 2021 at 4:04 PM Jialun Liu wrote: > Hey Gary, > > Thanks for your reply! > > This is kinda sad that we are not able to serve the insights to commercial > customers in real time. > > Do we have any best practices/ design patterns to get around the problem in > order to support online service for low latency, high throughput random > reads by any chance? > > Best regards, > Bill > > On Sun, Jun 6, 2021 at 2:19 AM Gary Li wrote: > > > Hi Bill, > > > > Data lake was used for offline analytics workload with minutes latency. > > Data lake(at least for Hudi) doesn't fit for online request-response > > service as you described for now. > > > > Best, > > Gary > > > > On Sun, Jun 6, 2021 at 12:29 PM Jialun Liu > wrote: > > > > > Hey Felix, > > > > > > Thanks for your reply! > > > > > > I briefly researched in Presto, it looks like it is designed to support > > the > > > high concurrency of Big data SQL query. The official doc suggests it > > could > > > process queries in sub-seconds to minutes. > > > https://prestodb.io/ > > > "Presto is targeted at analysts who expect response times ranging from > > > sub-second to minutes." > > > > > > However, the doc seems to suggest that it is supposed to be used by > > > analysts running offline queries, and it is not designed to be used as > an > > > OLTP database. > > > https://prestodb.io/docs/current/overview/use-cases.html > > > > > > I am wondering if it is technically possible to use data lake to > support > > > milliseconds latency, high throughput random reads at all today? Am I > > just > > > not thinking in the right direction? Maybe it is just not sane to serve > > > online request-response service using Data lake as backend? > > > > > > Best regards, > > > Bill > > > > > > On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix > > > wrote: > > > > > > > Hi Bill, > > > > > > > > Did you try using Presto (from EMR) to query HUDI tables on S3, and > it > > > > could support real time queries. And you have to partition your data > > > > properly to minimize the amount of data each query has to > scan/process. > > > > > > > > Regards, > > > > Felix K Jose > > > > From: Jialun Liu > > > > Date: Saturday, June 5, 2021 at 3:53 PM > > > > To: dev@hudi.apache.org > > > > Subject: Could Hudi Data lake support low latency, high throughput > > random > > > > reads? > > > > Caution: This e-mail originated from outside of Philips, be careful > for > > > > phishing. > > > > > > > > > > > > Hey guys, > > > > > > > > I am not sure if this is the right forum for this question, if you > know > > > > where this should be directed, appreciated for your help! > > > > > > > > The question is that "Could Hudi Data lake support low latency, high > > > > throughput random reads?". > > > > > > > > I am considering building a data lake that produces auxiliary > > information > > > > for my main service table. Example, say my main service is S3 and I > > want > > > to > > > > produce the S3 object pull count as the auxiliary information. I am > > going > > > > to use Apache Hudi and EMR to process the S3 access log to produce > the > > > pull > > > > count. Now, what I don't know is that can data lake support low > > latency, > > > > high throughput random reads for online request-response type of > > service? > > > > This way I could serve this information to customers in real time. > > > > > > > > I could write the auxiliary information, pull count, back to the main > > > > service table, but I personally don't think it is a sustainable > > > > architecture. It would be hard to do independent and agile > development > > > if I > > > > continue to add more derived attributes to the main table. > > > > > > > > Any help would be appreciated! > > > > > > > > Best regards, > > > > Bill > > > > > > > > > > > > The information contained in this message may be confidential and > > legally > > > > protected under applicable law. The message is intended solely for > the > > > > addressee(s). If you are not the intended recipient, you are hereby > > > > notified that any use, forwarding, dissemination, or reproduction of > > this > > > > message is strictly prohibited and may be
Fwd: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st
Hi all, Looks like this will apply to our site? Any volunteers to help fix this? Thanks Vinoth -- Forwarded message - From: Daniel Gruno Date: Mon, May 31, 2021 at 6:41 AM Subject: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st To: Users TL;DR: if your project web site is kept in subversion, disregard this email please. If your project web site is using git, and you have not deployed it via .asf.yaml, you MUST switch before July 1st or risk your web site goes stale. Dear Apache projects, In order to simplify our web site publishing services and improve self-serve for projects and stability of deployments, we will be turning off the old 'gitwcsub' method of publishing git web sites. As of this moment, this involves 120 web sites. All web sites should switch to our self-serve method of publishing via the .asf.yaml meta-file. We aim to turn off gitwcsub around July 1st. ## How to publish via .asf.yaml: Publishing via .asf.yaml is described at: https://s.apache.org/asfyamlpublishing You can also see an example .asf.yaml with publishing and staging profiles for our own infra web site at: https://github.com/apache/infrastructure-website/blob/asf-site/.asf.yaml In short, one puts a file called .asf.yaml into the branch that needs to be published as the project's web site, with the following two-line content, in this case assuming the published branch is 'asf-site': publish: whoami: asf-site It is important to note that the .asf.yaml file MUST be present at the root of the file system in the branch you wish to publish. The 'whoami' parameter acts as a guard, ensure that only the intended branch is used for publishing. ## Is my project affected by this? The quickest way to check if you need to switch to a .asf.yaml approach is to check out site source page at https://infra-reports.apache.org/site-source/ - if your site is listed in yellow, you will need to switch. This page will also tell you which branch you are currently publishing as your web site. This is (should be) the branch that you must add a .asf.yaml meta file to. The web site source list updates every hour. If your project site appears in green, you are already using .asf.yaml for publishing and do not need to make any changes. ## What happens if we miss the deadline? If you miss the deadline, don't fret. Your site will of course still remain online as is, but new updates will not appear till you create/edit the .asf.yaml and set up publishing. ## Who do we contact if we have questions? Please contact us at us...@infra.apache.org if you have any additional questions. With regards, Daniel on behalf of ASF Infra.
Re: [HELP] unstable tests in the travis CI
yes. CI is pretty flaky atm. There is a compiled list here https://issues.apache.org/jira/browse/HUDI-1248 Siva and I are looking into some of this and try and get everything back to normal again That schema evolution test, I have tried reproducing a few times, without luck. :/ On Wed, Jun 23, 2021 at 10:17 AM Prashant Wason wrote: > Sure. I will take a look today. I wonder how the CI passed during the > merge. > > > On Wed, Jun 23, 2021 at 7:57 AM pzwpzw > wrote: > > > Hi @Prashant Wason, I found that after the [HUDI-1717]( commit hash: > > 11e64b2db0ddf8f816561f8442b373de15a26d71) has merged yesterday, the test > > case TestHoodieBackedMetadata#testOnlyValidPartitionsAdded will always > > crash: > > > > org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve > > files in partition > > > /var/folders/my/841b2c052038ppns0csrf8g8gn/T/junit3095347769583879437/dataset/p1 > > from metadata > > > > at > > > org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:129) > > at > > > org.apache.hudi.metadata.TestHoodieBackedMetadata.testOnlyValidPartitionsAdded(TestHoodieBackedMetadata.java:210) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > > > > Can you take a look at this, Thanks~ > > > > > > > > 2021年6月23日 下午1:49,Danny Chan 写道: > > > > Hi, fellows, there are two test cases in the travis CI that fails very > > often, which blocks our coding too many times, please, if these tests are > > not stable, can we disable them first ? > > They are annoying ~ > > > > > > TestHoodieBackedMetadata.testOnlyValidPartitionsAdded[1] > > HoodieSparkSqlWriterSuite: schema evolution for ... [2] > > > > [1] https://travis-ci.com/github/apache/hudi/jobs/518067391 > > [2] https://travis-ci.com/github/apache/hudi/jobs/518067393 > > > > Best, > > Danny Chan > > > > >
Re: [HELP] unstable tests in the travis CI
Sure. I will take a look today. I wonder how the CI passed during the merge. On Wed, Jun 23, 2021 at 7:57 AM pzwpzw wrote: > Hi @Prashant Wason, I found that after the [HUDI-1717]( commit hash: > 11e64b2db0ddf8f816561f8442b373de15a26d71) has merged yesterday, the test > case TestHoodieBackedMetadata#testOnlyValidPartitionsAdded will always > crash: > > org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve > files in partition > /var/folders/my/841b2c052038ppns0csrf8g8gn/T/junit3095347769583879437/dataset/p1 > from metadata > > at > org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:129) > at > org.apache.hudi.metadata.TestHoodieBackedMetadata.testOnlyValidPartitionsAdded(TestHoodieBackedMetadata.java:210) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > > Can you take a look at this, Thanks~ > > > > 2021年6月23日 下午1:49,Danny Chan 写道: > > Hi, fellows, there are two test cases in the travis CI that fails very > often, which blocks our coding too many times, please, if these tests are > not stable, can we disable them first ? > They are annoying ~ > > > TestHoodieBackedMetadata.testOnlyValidPartitionsAdded[1] > HoodieSparkSqlWriterSuite: schema evolution for ... [2] > > [1] https://travis-ci.com/github/apache/hudi/jobs/518067391 > [2] https://travis-ci.com/github/apache/hudi/jobs/518067393 > > Best, > Danny Chan > >
Re: issue while reading archived commit written by 0.5 version with 0.8 version
Hi Susu, thanks for the response. Can you please explain whats the impact of deleting these commit files? Thanks! On Wed, Jun 23, 2021 at 8:09 AM Susu Dong wrote: > Hi Aakash, > > I believe there were schema level changes from Hudi 0.5.0 to 0.6.0 > regarding those commit files. So if you are jumping from 0.5.0 to 0.8.0 > right away, you will likely experience such an error, i.e. Failed to > archive commits. You shouldn't need to delete archived files; instead, you > should try deleting some, if not all, active commit files under your > *.hoodie* folder. The reason for that is 0.8.0 is using a new AVRO schema > to parse your old commit files, so you got the failure. Can you try the > above approach and let us know? Thank you. :) > > Best, > Susu > > On Wed, Jun 23, 2021 at 12:21 PM aakash aakash > wrote: > > > Hi, > > > > I am trying to use Hudi 0.8 with Spark 3.0 in my prod environment and > > earlier we were running Hudi 0.5 with Spark 2.4.4. > > > > While updating a very old index, I am getting this error : > > > > *from the logs it seem its error out while reading this file : > > hudi/.hoodie/archived/.commits_.archive.119_1-0-1 in s3* > > > > 21/06/22 19:18:06 ERROR HoodieTimelineArchiveLog: Failed to archive > > commits, .commit file: 20200715192915.rollback.inflight > > java.io.IOException: Not an Avro data file > > at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) > > at > > > > > org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:175) > > at > > > > > org.apache.hudi.client.utils.MetadataConversionUtils.createMetaWrapper(MetadataConversionUtils.java:84) > > at > > > > > org.apache.hudi.table.HoodieTimelineArchiveLog.convertToAvroRecord(HoodieTimelineArchiveLog.java:370) > > at > > > > > org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:311) > > at > > > > > org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:128) > > at > > > > > org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:430) > > at > > > > > org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:186) > > at > > > > > org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:121) > > at > > > > > org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:479) > > > > > > Is this a backward compatibility issue? I have deleted a few archive > files > > but the problem is persisting so it does not look like a file corruption > > issue. > > > > Regards, > > Aakash > > >
Re: issue while reading archived commit written by 0.5 version with 0.8 version
Hi Aakash, I believe there were schema level changes from Hudi 0.5.0 to 0.6.0 regarding those commit files. So if you are jumping from 0.5.0 to 0.8.0 right away, you will likely experience such an error, i.e. Failed to archive commits. You shouldn't need to delete archived files; instead, you should try deleting some, if not all, active commit files under your *.hoodie* folder. The reason for that is 0.8.0 is using a new AVRO schema to parse your old commit files, so you got the failure. Can you try the above approach and let us know? Thank you. :) Best, Susu On Wed, Jun 23, 2021 at 12:21 PM aakash aakash wrote: > Hi, > > I am trying to use Hudi 0.8 with Spark 3.0 in my prod environment and > earlier we were running Hudi 0.5 with Spark 2.4.4. > > While updating a very old index, I am getting this error : > > *from the logs it seem its error out while reading this file : > hudi/.hoodie/archived/.commits_.archive.119_1-0-1 in s3* > > 21/06/22 19:18:06 ERROR HoodieTimelineArchiveLog: Failed to archive > commits, .commit file: 20200715192915.rollback.inflight > java.io.IOException: Not an Avro data file > at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) > at > > org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:175) > at > > org.apache.hudi.client.utils.MetadataConversionUtils.createMetaWrapper(MetadataConversionUtils.java:84) > at > > org.apache.hudi.table.HoodieTimelineArchiveLog.convertToAvroRecord(HoodieTimelineArchiveLog.java:370) > at > > org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:311) > at > > org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:128) > at > > org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:430) > at > > org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:186) > at > > org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:121) > at > > org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:479) > > > Is this a backward compatibility issue? I have deleted a few archive files > but the problem is persisting so it does not look like a file corruption > issue. > > Regards, > Aakash >
Re: [HELP] unstable tests in the travis CI
Hi @Prashant Wason, I found that after the [HUDI-1717]( commit hash: 11e64b2db0ddf8f816561f8442b373de15a26d71) has merged yesterday, the test case TestHoodieBackedMetadata#testOnlyValidPartitionsAdded will always crash: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve files in partition /var/folders/my/841b2c052038ppns0csrf8g8gn/T/junit3095347769583879437/dataset/p1 from metadata at org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:129) at org.apache.hudi.metadata.TestHoodieBackedMetadata.testOnlyValidPartitionsAdded(TestHoodieBackedMetadata.java:210) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) Can you take a look at this, Thanks~ 2021年6月23日 下午1:49,Danny Chan 写道: Hi, fellows, there are two test cases in the travis CI that fails very often, which blocks our coding too many times, please, if these tests are not stable, can we disable them first ? They are annoying ~ TestHoodieBackedMetadata.testOnlyValidPartitionsAdded[1] HoodieSparkSqlWriterSuite: schema evolution for ... [2] [1] https://travis-ci.com/github/apache/hudi/jobs/518067391 [2] https://travis-ci.com/github/apache/hudi/jobs/518067393 Best, Danny Chan