Hi Bill, Data lake was used for offline analytics workload with minutes latency. Data lake(at least for Hudi) doesn't fit for online request-response service as you described for now.
Best, Gary On Sun, Jun 6, 2021 at 12:29 PM Jialun Liu <liujialu...@gmail.com> wrote: > Hey Felix, > > Thanks for your reply! > > I briefly researched in Presto, it looks like it is designed to support the > high concurrency of Big data SQL query. The official doc suggests it could > process queries in sub-seconds to minutes. > https://prestodb.io/ > "Presto is targeted at analysts who expect response times ranging from > sub-second to minutes." > > However, the doc seems to suggest that it is supposed to be used by > analysts running offline queries, and it is not designed to be used as an > OLTP database. > https://prestodb.io/docs/current/overview/use-cases.html > > I am wondering if it is technically possible to use data lake to support > milliseconds latency, high throughput random reads at all today? Am I just > not thinking in the right direction? Maybe it is just not sane to serve > online request-response service using Data lake as backend? > > Best regards, > Bill > > On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix > <felix.j...@philips.com.invalid> wrote: > > > Hi Bill, > > > > Did you try using Presto (from EMR) to query HUDI tables on S3, and it > > could support real time queries. And you have to partition your data > > properly to minimize the amount of data each query has to scan/process. > > > > Regards, > > Felix K Jose > > From: Jialun Liu <liujialu...@gmail.com> > > Date: Saturday, June 5, 2021 at 3:53 PM > > To: dev@hudi.apache.org <dev@hudi.apache.org> > > Subject: Could Hudi Data lake support low latency, high throughput random > > reads? > > Caution: This e-mail originated from outside of Philips, be careful for > > phishing. > > > > > > Hey guys, > > > > I am not sure if this is the right forum for this question, if you know > > where this should be directed, appreciated for your help! > > > > The question is that "Could Hudi Data lake support low latency, high > > throughput random reads?". > > > > I am considering building a data lake that produces auxiliary information > > for my main service table. Example, say my main service is S3 and I want > to > > produce the S3 object pull count as the auxiliary information. I am going > > to use Apache Hudi and EMR to process the S3 access log to produce the > pull > > count. Now, what I don't know is that can data lake support low latency, > > high throughput random reads for online request-response type of service? > > This way I could serve this information to customers in real time. > > > > I could write the auxiliary information, pull count, back to the main > > service table, but I personally don't think it is a sustainable > > architecture. It would be hard to do independent and agile development > if I > > continue to add more derived attributes to the main table. > > > > Any help would be appreciated! > > > > Best regards, > > Bill > > > > ________________________________ > > The information contained in this message may be confidential and legally > > protected under applicable law. The message is intended solely for the > > addressee(s). If you are not the intended recipient, you are hereby > > notified that any use, forwarding, dissemination, or reproduction of this > > message is strictly prohibited and may be unlawful. If you are not the > > intended recipient, please contact the sender by return e-mail and > destroy > > all copies of the original message. > > >