Thanks a lot Vinoth! Best regards, Bill
On Sat, Jun 26, 2021 at 9:24 PM Vinoth Chandar <vin...@apache.org> wrote: > Yes. Thats a working approach. > > One thing I would like to suggest is the use of Hudi’s incremental queries > to update DynamoDB as opposed to full exporting periodically, depending on > how much of your target dynamoDB table changes between loads, it can save > you cost and time. > > On Sat, Jun 26, 2021 at 5:43 PM Jialun Liu <liujialu...@gmail.com> wrote: > > > Hey Vinoth, > > > > Thanks for your reply! > > > > I am actually looking into a different direction atm. Basically write the > > transformed data into a OLTP database, e.g. DynamoDB, any data need to > > support low latency high throughput read would be exported periodically. > > > > Not sure if this is the right pattern, appreciated if you can point me to > > any similar architecture that I could study. > > > > Best regards, > > Bill > > > > On Wed, Jun 23, 2021 at 3:51 PM Vinoth Chandar <vin...@apache.org> > wrote: > > > > > >>>>Maybe it is just not sane to serve online request-response service > > > using Data lake as backend? > > > In general, data lakes have not evolved beyond analytics, ML at this > > point, > > > i.e optimized for large batch scans. > > > > > > Not to say that this cannot be possible, but I am skeptical that it > will > > > ever be as low-latency as your regular OLTP database. > > > Object store random reads are definitely going to cost ~100ms, like > > reading > > > from a highly loaded hard drive. > > > > > > Hudi does support a HFile format, which is more optimized for random > > reads. > > > We use it to store and serve table metadata. > > > So that path is worth pursuing, if you have the appetite for trying the > > > changing the norm here. :) > > > There is probably some work to do here for scaling it for large amounts > > of > > > data. > > > > > > Hope that helps. > > > > > > Thanks > > > Vinoth > > > > > > On Mon, Jun 7, 2021 at 4:04 PM Jialun Liu <liujialu...@gmail.com> > wrote: > > > > > > > Hey Gary, > > > > > > > > Thanks for your reply! > > > > > > > > This is kinda sad that we are not able to serve the insights to > > > commercial > > > > customers in real time. > > > > > > > > Do we have any best practices/ design patterns to get around the > > problem > > > in > > > > order to support online service for low latency, high throughput > random > > > > reads by any chance? > > > > > > > > Best regards, > > > > Bill > > > > > > > > On Sun, Jun 6, 2021 at 2:19 AM Gary Li <gar...@apache.org> wrote: > > > > > > > > > Hi Bill, > > > > > > > > > > Data lake was used for offline analytics workload with minutes > > latency. > > > > > Data lake(at least for Hudi) doesn't fit for online > request-response > > > > > service as you described for now. > > > > > > > > > > Best, > > > > > Gary > > > > > > > > > > On Sun, Jun 6, 2021 at 12:29 PM Jialun Liu <liujialu...@gmail.com> > > > > wrote: > > > > > > > > > > > Hey Felix, > > > > > > > > > > > > Thanks for your reply! > > > > > > > > > > > > I briefly researched in Presto, it looks like it is designed to > > > support > > > > > the > > > > > > high concurrency of Big data SQL query. The official doc suggests > > it > > > > > could > > > > > > process queries in sub-seconds to minutes. > > > > > > https://prestodb.io/ > > > > > > "Presto is targeted at analysts who expect response times ranging > > > from > > > > > > sub-second to minutes." > > > > > > > > > > > > However, the doc seems to suggest that it is supposed to be used > by > > > > > > analysts running offline queries, and it is not designed to be > used > > > as > > > > an > > > > > > OLTP database. > > > > > > https://prestodb.io/docs/current/overview/use-cases.html > > > > > > > > > > > > I am wondering if it is technically possible to use data lake to > > > > support > > > > > > milliseconds latency, high throughput random reads at all today? > > Am I > > > > > just > > > > > > not thinking in the right direction? Maybe it is just not sane to > > > serve > > > > > > online request-response service using Data lake as backend? > > > > > > > > > > > > Best regards, > > > > > > Bill > > > > > > > > > > > > On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix > > > > > > <felix.j...@philips.com.invalid> wrote: > > > > > > > > > > > > > Hi Bill, > > > > > > > > > > > > > > Did you try using Presto (from EMR) to query HUDI tables on S3, > > and > > > > it > > > > > > > could support real time queries. And you have to partition your > > > data > > > > > > > properly to minimize the amount of data each query has to > > > > scan/process. > > > > > > > > > > > > > > Regards, > > > > > > > Felix K Jose > > > > > > > From: Jialun Liu <liujialu...@gmail.com> > > > > > > > Date: Saturday, June 5, 2021 at 3:53 PM > > > > > > > To: dev@hudi.apache.org <dev@hudi.apache.org> > > > > > > > Subject: Could Hudi Data lake support low latency, high > > throughput > > > > > random > > > > > > > reads? > > > > > > > Caution: This e-mail originated from outside of Philips, be > > careful > > > > for > > > > > > > phishing. > > > > > > > > > > > > > > > > > > > > > Hey guys, > > > > > > > > > > > > > > I am not sure if this is the right forum for this question, if > > you > > > > know > > > > > > > where this should be directed, appreciated for your help! > > > > > > > > > > > > > > The question is that "Could Hudi Data lake support low latency, > > > high > > > > > > > throughput random reads?". > > > > > > > > > > > > > > I am considering building a data lake that produces auxiliary > > > > > information > > > > > > > for my main service table. Example, say my main service is S3 > > and I > > > > > want > > > > > > to > > > > > > > produce the S3 object pull count as the auxiliary information. > I > > am > > > > > going > > > > > > > to use Apache Hudi and EMR to process the S3 access log to > > produce > > > > the > > > > > > pull > > > > > > > count. Now, what I don't know is that can data lake support low > > > > > latency, > > > > > > > high throughput random reads for online request-response type > of > > > > > service? > > > > > > > This way I could serve this information to customers in real > > time. > > > > > > > > > > > > > > I could write the auxiliary information, pull count, back to > the > > > main > > > > > > > service table, but I personally don't think it is a sustainable > > > > > > > architecture. It would be hard to do independent and agile > > > > development > > > > > > if I > > > > > > > continue to add more derived attributes to the main table. > > > > > > > > > > > > > > Any help would be appreciated! > > > > > > > > > > > > > > Best regards, > > > > > > > Bill > > > > > > > > > > > > > > ________________________________ > > > > > > > The information contained in this message may be confidential > and > > > > > legally > > > > > > > protected under applicable law. The message is intended solely > > for > > > > the > > > > > > > addressee(s). If you are not the intended recipient, you are > > hereby > > > > > > > notified that any use, forwarding, dissemination, or > reproduction > > > of > > > > > this > > > > > > > message is strictly prohibited and may be unlawful. If you are > > not > > > > the > > > > > > > intended recipient, please contact the sender by return e-mail > > and > > > > > > destroy > > > > > > > all copies of the original message. > > > > > > > > > > > > > > > > > > > > > > > > > > > >