Re: issue while reading archived commit written by 0.5 version with 0.8 version

2021-06-23 Thread Susu Dong
Hi Aakash,

Deleting the old commit files should not impose much of an impact since you
are unlikely to use them again once it's been archived successfully, which
you have also deleted some of the archived files yourself. 😅

However, I went back and dug the codebase again. A fix has been merged into
the master recently and is supposed to come out in 0.9.0, which should be a
better fix to this problem rather than manual intervention.
Specifically, you can take a look at this fix here
https://github.com/apache/hudi/pull/2677, if you are interested.
We will be *skipping* the deserialization of inflight commit files and
*only* deserialize complete commit files. As you can see, your problem is
caused by archiving 20200715192915.rollback.inflight, which is an inflight
commit file. We aren't particularly interested in the content of those
inflight files; thus, we have decided to modify the archival logic this
way.

Failure to archive the commit files should not impede your usage of Hudi,
and it could continue to function properly. However, if you do care about a
clean running status of your pipeline, feel free to build your 0.9.0
SNAPSHOT version and blend it in. Hope it helps. :)

Best,
Susu


On Thu, Jun 24, 2021 at 12:32 AM aakash aakash 
wrote:

> Hi Susu,
>
> thanks for the response. Can you please explain whats the impact of
> deleting these commit files?
>
> Thanks!
>
> On Wed, Jun 23, 2021 at 8:09 AM Susu Dong  wrote:
>
> > Hi Aakash,
> >
> > I believe there were schema level changes from Hudi 0.5.0 to 0.6.0
> > regarding those commit files. So if you are jumping from 0.5.0 to 0.8.0
> > right away, you will likely experience such an error, i.e. Failed to
> > archive commits. You shouldn't need to delete archived files; instead,
> you
> > should try deleting some, if not all, active commit files under your
> > *.hoodie* folder. The reason for that is 0.8.0 is using a new AVRO schema
> > to parse your old commit files, so you got the failure. Can you try the
> > above approach and let us know? Thank you. :)
> >
> > Best,
> > Susu
> >
> > On Wed, Jun 23, 2021 at 12:21 PM aakash aakash 
> > wrote:
> >
> > > Hi,
> > >
> > > I am trying to use Hudi 0.8 with Spark 3.0 in my prod environment and
> > > earlier we were running Hudi 0.5 with Spark 2.4.4.
> > >
> > > While updating a very old index, I am getting this error :
> > >
> > > *from the logs it seem its  error out while reading this file :
> > > hudi/.hoodie/archived/.commits_.archive.119_1-0-1 in s3*
> > >
> > > 21/06/22 19:18:06 ERROR HoodieTimelineArchiveLog: Failed to archive
> > > commits, .commit file: 20200715192915.rollback.inflight
> > > java.io.IOException: Not an Avro data file
> > > at
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
> > > at
> > >
> > >
> >
> org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:175)
> > > at
> > >
> > >
> >
> org.apache.hudi.client.utils.MetadataConversionUtils.createMetaWrapper(MetadataConversionUtils.java:84)
> > > at
> > >
> > >
> >
> org.apache.hudi.table.HoodieTimelineArchiveLog.convertToAvroRecord(HoodieTimelineArchiveLog.java:370)
> > > at
> > >
> > >
> >
> org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:311)
> > > at
> > >
> > >
> >
> org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:128)
> > > at
> > >
> > >
> >
> org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:430)
> > > at
> > >
> > >
> >
> org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:186)
> > > at
> > >
> > >
> >
> org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:121)
> > > at
> > >
> > >
> >
> org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:479)
> > >
> > >
> > > Is this a backward compatibility issue? I have deleted a few archive
> > files
> > > but the problem is persisting so it does not look like a file
> corruption
> > > issue.
> > >
> > > Regards,
> > > Aakash
> > >
> >
>


Re: Could Hudi Data lake support low latency, high throughput random reads?

2021-06-23 Thread Vinoth Chandar
Maybe it is just not sane to serve online request-response service
using Data lake as backend?
In general, data lakes have not evolved beyond analytics, ML at this point,
i.e optimized for large batch scans.

Not to say that this cannot be possible, but I am skeptical that it will
ever be as low-latency as your regular OLTP database.
Object store random reads are definitely going to cost ~100ms, like reading
from a highly loaded hard drive.

Hudi does support a HFile format, which is more optimized for random reads.
We use it to store and serve table metadata.
So that path is worth pursuing, if you have the appetite for trying the
changing the norm here. :)
There is probably some work to do here for scaling it for large amounts of
data.

Hope that helps.

Thanks
Vinoth

On Mon, Jun 7, 2021 at 4:04 PM Jialun Liu  wrote:

> Hey Gary,
>
> Thanks for your reply!
>
> This is kinda sad that we are not able to serve the insights to commercial
> customers in real time.
>
> Do we have any best practices/ design patterns to get around the problem in
> order to support online service for low latency, high throughput random
> reads by any chance?
>
> Best regards,
> Bill
>
> On Sun, Jun 6, 2021 at 2:19 AM Gary Li  wrote:
>
> > Hi Bill,
> >
> > Data lake was used for offline analytics workload with minutes latency.
> > Data lake(at least for Hudi) doesn't fit for online request-response
> > service as you described for now.
> >
> > Best,
> > Gary
> >
> > On Sun, Jun 6, 2021 at 12:29 PM Jialun Liu 
> wrote:
> >
> > > Hey Felix,
> > >
> > > Thanks for your reply!
> > >
> > > I briefly researched in Presto, it looks like it is designed to support
> > the
> > > high concurrency of Big data SQL query. The official doc suggests it
> > could
> > > process queries in sub-seconds to minutes.
> > > https://prestodb.io/
> > > "Presto is targeted at analysts who expect response times ranging from
> > > sub-second to minutes."
> > >
> > > However, the doc seems to suggest that it is supposed to be used by
> > > analysts running offline queries, and it is not designed to be used as
> an
> > > OLTP database.
> > > https://prestodb.io/docs/current/overview/use-cases.html
> > >
> > > I am wondering if it is technically possible to use data lake to
> support
> > > milliseconds latency, high throughput random reads at all today? Am I
> > just
> > > not thinking in the right direction? Maybe it is just not sane to serve
> > > online request-response service using Data lake as backend?
> > >
> > > Best regards,
> > > Bill
> > >
> > > On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix
> > >  wrote:
> > >
> > > > Hi Bill,
> > > >
> > > > Did you try using Presto (from EMR) to query HUDI tables on S3, and
> it
> > > > could support real time queries. And you have to partition your data
> > > > properly to minimize the amount of data each query has to
> scan/process.
> > > >
> > > > Regards,
> > > > Felix K Jose
> > > > From: Jialun Liu 
> > > > Date: Saturday, June 5, 2021 at 3:53 PM
> > > > To: dev@hudi.apache.org 
> > > > Subject: Could Hudi Data lake support low latency, high throughput
> > random
> > > > reads?
> > > > Caution: This e-mail originated from outside of Philips, be careful
> for
> > > > phishing.
> > > >
> > > >
> > > > Hey guys,
> > > >
> > > > I am not sure if this is the right forum for this question, if you
> know
> > > > where this should be directed, appreciated for your help!
> > > >
> > > > The question is that "Could Hudi Data lake support low latency, high
> > > > throughput random reads?".
> > > >
> > > > I am considering building a data lake that produces auxiliary
> > information
> > > > for my main service table. Example, say my main service is S3 and I
> > want
> > > to
> > > > produce the S3 object pull count as the auxiliary information. I am
> > going
> > > > to use Apache Hudi and EMR to process the S3 access log to produce
> the
> > > pull
> > > > count. Now, what I don't know is that can data lake support low
> > latency,
> > > > high throughput random reads for online request-response type of
> > service?
> > > > This way I could serve this information to customers in real time.
> > > >
> > > > I could write the auxiliary information, pull count, back to the main
> > > > service table, but I personally don't think it is a sustainable
> > > > architecture. It would be hard to do independent and agile
> development
> > > if I
> > > > continue to add more derived attributes to the main table.
> > > >
> > > > Any help would be appreciated!
> > > >
> > > > Best regards,
> > > > Bill
> > > >
> > > > 
> > > > The information contained in this message may be confidential and
> > legally
> > > > protected under applicable law. The message is intended solely for
> the
> > > > addressee(s). If you are not the intended recipient, you are hereby
> > > > notified that any use, forwarding, dissemination, or reproduction of
> > this
> > > > message is strictly prohibited and may be 

Fwd: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st

2021-06-23 Thread Vinoth Chandar
Hi all,

Looks like this will apply to our site? Any volunteers to help fix this?

Thanks
Vinoth

-- Forwarded message -
From: Daniel Gruno 
Date: Mon, May 31, 2021 at 6:41 AM
Subject: [NOTICE] Git web site publishing to be done via .asf.yaml only as
of July 1st
To: Users 


TL;DR: if your project web site is kept in subversion, disregard this
email please. If your project web site is using git, and you have not
deployed it via .asf.yaml, you MUST switch before July 1st or risk your
web site goes stale.



Dear Apache projects,
In order to simplify our web site publishing services and improve
self-serve for projects and stability of deployments, we will be turning
off the old 'gitwcsub' method of publishing git web sites. As of this
moment, this involves 120 web sites. All web sites should switch to our
self-serve method of publishing via the .asf.yaml meta-file. We aim to
turn off gitwcsub around July 1st.


## How to publish via .asf.yaml:
Publishing via .asf.yaml is described at:
https://s.apache.org/asfyamlpublishing
You can also see an example .asf.yaml with publishing and staging
profiles for our own infra web site at:
https://github.com/apache/infrastructure-website/blob/asf-site/.asf.yaml

In short, one puts a file called .asf.yaml into the branch that needs to
be published as the project's web site, with the following two-line
content, in this case assuming the published branch is 'asf-site':

publish:
   whoami: asf-site


It is important to note that the .asf.yaml file MUST be present at the
root of the file system in the branch you wish to publish. The 'whoami'
parameter acts as a guard, ensure that only the intended branch is used
for publishing.


## Is my project affected by this?
The quickest way to check if you need to switch to a .asf.yaml approach
is to check out site source page at
https://infra-reports.apache.org/site-source/ - if your site is listed
in yellow, you will need to switch. This page will also tell you which
branch you are currently publishing as your web site. This is (should
be) the branch that you must add a .asf.yaml meta file to.

The web site source list updates every hour. If your project site
appears in green, you are already using .asf.yaml for publishing and do
not need to make any changes.


## What happens if we miss the deadline?
If you miss the deadline, don't fret. Your site will of course still
remain online as is, but new updates will not appear till you
create/edit the .asf.yaml and set up publishing.


## Who do we contact if we have questions?
Please contact us at us...@infra.apache.org if you have any additional
questions.


With regards,
Daniel on behalf of ASF Infra.


Re: [HELP] unstable tests in the travis CI

2021-06-23 Thread Vinoth Chandar
yes. CI is pretty flaky atm. There is a compiled list here
https://issues.apache.org/jira/browse/HUDI-1248

Siva and I are looking into some of this and try and get everything back to
normal again

That schema evolution test, I have tried reproducing a few times, without
luck. :/

On Wed, Jun 23, 2021 at 10:17 AM Prashant Wason 
wrote:

> Sure. I will take a look today. I wonder how the CI passed during the
> merge.
>
>
> On Wed, Jun 23, 2021 at 7:57 AM pzwpzw 
> wrote:
>
> > Hi @Prashant Wason, I found that after the [HUDI-1717]( commit hash:
> > 11e64b2db0ddf8f816561f8442b373de15a26d71)  has merged yesterday, the test
> > case TestHoodieBackedMetadata#testOnlyValidPartitionsAdded will always
> > crash:
> >
> > org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve
> > files in partition
> >
> /var/folders/my/841b2c052038ppns0csrf8g8gn/T/junit3095347769583879437/dataset/p1
> > from metadata
> >
> > at
> >
> org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:129)
> > at
> >
> org.apache.hudi.metadata.TestHoodieBackedMetadata.testOnlyValidPartitionsAdded(TestHoodieBackedMetadata.java:210)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:498)
> >
> > Can you take a look at this,  Thanks~
> >
> >
> >
> > 2021年6月23日 下午1:49,Danny Chan  写道:
> >
> > Hi, fellows, there are two test cases in the travis CI that fails very
> > often, which blocks our coding too many times, please, if these tests are
> > not stable, can we disable them first ?
> > They are annoying ~
> >
> >
> > TestHoodieBackedMetadata.testOnlyValidPartitionsAdded[1]
> > HoodieSparkSqlWriterSuite: schema evolution for ... [2]
> >
> > [1] https://travis-ci.com/github/apache/hudi/jobs/518067391
> > [2] https://travis-ci.com/github/apache/hudi/jobs/518067393
> >
> > Best,
> > Danny Chan
> >
> >
>


Re: [HELP] unstable tests in the travis CI

2021-06-23 Thread Prashant Wason
Sure. I will take a look today. I wonder how the CI passed during the merge.


On Wed, Jun 23, 2021 at 7:57 AM pzwpzw 
wrote:

> Hi @Prashant Wason, I found that after the [HUDI-1717]( commit hash:
> 11e64b2db0ddf8f816561f8442b373de15a26d71)  has merged yesterday, the test
> case TestHoodieBackedMetadata#testOnlyValidPartitionsAdded will always
> crash:
>
> org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve
> files in partition
> /var/folders/my/841b2c052038ppns0csrf8g8gn/T/junit3095347769583879437/dataset/p1
> from metadata
>
> at
> org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:129)
> at
> org.apache.hudi.metadata.TestHoodieBackedMetadata.testOnlyValidPartitionsAdded(TestHoodieBackedMetadata.java:210)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> Can you take a look at this,  Thanks~
>
>
>
> 2021年6月23日 下午1:49,Danny Chan  写道:
>
> Hi, fellows, there are two test cases in the travis CI that fails very
> often, which blocks our coding too many times, please, if these tests are
> not stable, can we disable them first ?
> They are annoying ~
>
>
> TestHoodieBackedMetadata.testOnlyValidPartitionsAdded[1]
> HoodieSparkSqlWriterSuite: schema evolution for ... [2]
>
> [1] https://travis-ci.com/github/apache/hudi/jobs/518067391
> [2] https://travis-ci.com/github/apache/hudi/jobs/518067393
>
> Best,
> Danny Chan
>
>


Re: issue while reading archived commit written by 0.5 version with 0.8 version

2021-06-23 Thread aakash aakash
Hi Susu,

thanks for the response. Can you please explain whats the impact of
deleting these commit files?

Thanks!

On Wed, Jun 23, 2021 at 8:09 AM Susu Dong  wrote:

> Hi Aakash,
>
> I believe there were schema level changes from Hudi 0.5.0 to 0.6.0
> regarding those commit files. So if you are jumping from 0.5.0 to 0.8.0
> right away, you will likely experience such an error, i.e. Failed to
> archive commits. You shouldn't need to delete archived files; instead, you
> should try deleting some, if not all, active commit files under your
> *.hoodie* folder. The reason for that is 0.8.0 is using a new AVRO schema
> to parse your old commit files, so you got the failure. Can you try the
> above approach and let us know? Thank you. :)
>
> Best,
> Susu
>
> On Wed, Jun 23, 2021 at 12:21 PM aakash aakash 
> wrote:
>
> > Hi,
> >
> > I am trying to use Hudi 0.8 with Spark 3.0 in my prod environment and
> > earlier we were running Hudi 0.5 with Spark 2.4.4.
> >
> > While updating a very old index, I am getting this error :
> >
> > *from the logs it seem its  error out while reading this file :
> > hudi/.hoodie/archived/.commits_.archive.119_1-0-1 in s3*
> >
> > 21/06/22 19:18:06 ERROR HoodieTimelineArchiveLog: Failed to archive
> > commits, .commit file: 20200715192915.rollback.inflight
> > java.io.IOException: Not an Avro data file
> > at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
> > at
> >
> >
> org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:175)
> > at
> >
> >
> org.apache.hudi.client.utils.MetadataConversionUtils.createMetaWrapper(MetadataConversionUtils.java:84)
> > at
> >
> >
> org.apache.hudi.table.HoodieTimelineArchiveLog.convertToAvroRecord(HoodieTimelineArchiveLog.java:370)
> > at
> >
> >
> org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:311)
> > at
> >
> >
> org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:128)
> > at
> >
> >
> org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:430)
> > at
> >
> >
> org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:186)
> > at
> >
> >
> org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:121)
> > at
> >
> >
> org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:479)
> >
> >
> > Is this a backward compatibility issue? I have deleted a few archive
> files
> > but the problem is persisting so it does not look like a file corruption
> > issue.
> >
> > Regards,
> > Aakash
> >
>


Re: issue while reading archived commit written by 0.5 version with 0.8 version

2021-06-23 Thread Susu Dong
Hi Aakash,

I believe there were schema level changes from Hudi 0.5.0 to 0.6.0
regarding those commit files. So if you are jumping from 0.5.0 to 0.8.0
right away, you will likely experience such an error, i.e. Failed to
archive commits. You shouldn't need to delete archived files; instead, you
should try deleting some, if not all, active commit files under your
*.hoodie* folder. The reason for that is 0.8.0 is using a new AVRO schema
to parse your old commit files, so you got the failure. Can you try the
above approach and let us know? Thank you. :)

Best,
Susu

On Wed, Jun 23, 2021 at 12:21 PM aakash aakash 
wrote:

> Hi,
>
> I am trying to use Hudi 0.8 with Spark 3.0 in my prod environment and
> earlier we were running Hudi 0.5 with Spark 2.4.4.
>
> While updating a very old index, I am getting this error :
>
> *from the logs it seem its  error out while reading this file :
> hudi/.hoodie/archived/.commits_.archive.119_1-0-1 in s3*
>
> 21/06/22 19:18:06 ERROR HoodieTimelineArchiveLog: Failed to archive
> commits, .commit file: 20200715192915.rollback.inflight
> java.io.IOException: Not an Avro data file
> at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
> at
>
> org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:175)
> at
>
> org.apache.hudi.client.utils.MetadataConversionUtils.createMetaWrapper(MetadataConversionUtils.java:84)
> at
>
> org.apache.hudi.table.HoodieTimelineArchiveLog.convertToAvroRecord(HoodieTimelineArchiveLog.java:370)
> at
>
> org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:311)
> at
>
> org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:128)
> at
>
> org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:430)
> at
>
> org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:186)
> at
>
> org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:121)
> at
>
> org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:479)
>
>
> Is this a backward compatibility issue? I have deleted a few archive files
> but the problem is persisting so it does not look like a file corruption
> issue.
>
> Regards,
> Aakash
>


Re: [HELP] unstable tests in the travis CI

2021-06-23 Thread pzwpzw

Hi @Prashant Wason, I found that after the [HUDI-1717]( commit hash:  
11e64b2db0ddf8f816561f8442b373de15a26d71)  has merged yesterday, the test case 
TestHoodieBackedMetadata#testOnlyValidPartitionsAdded will always crash:


org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve files in 
partition 
/var/folders/my/841b2c052038ppns0csrf8g8gn/T/junit3095347769583879437/dataset/p1
 from metadata

at 
org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:129)
at 
org.apache.hudi.metadata.TestHoodieBackedMetadata.testOnlyValidPartitionsAdded(TestHoodieBackedMetadata.java:210)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)


Can you take a look at this,  Thanks~





2021年6月23日 下午1:49,Danny Chan  写道:


Hi, fellows, there are two test cases in the travis CI that fails very
often, which blocks our coding too many times, please, if these tests are
not stable, can we disable them first ?
They are annoying ~


TestHoodieBackedMetadata.testOnlyValidPartitionsAdded[1]
HoodieSparkSqlWriterSuite: schema evolution for ... [2]

[1] https://travis-ci.com/github/apache/hudi/jobs/518067391
[2] https://travis-ci.com/github/apache/hudi/jobs/518067393

Best,
Danny Chan