SAMZA-1905: Added case study for eBay, Optimizely, Redfin and TripAdvisor As per subject
Author: Wei Song <ws...@linkedin.com> Reviewers: Jagadish <jagad...@apache.org> Closes #657 from weisong44/SAMZA-1905 Project: http://git-wip-us.apache.org/repos/asf/samza/repo Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/66d6c5d0 Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/66d6c5d0 Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/66d6c5d0 Branch: refs/heads/master Commit: 66d6c5d09645bcf4dc4051deda3fed31848d0340 Parents: 55989bf Author: Wei Song <ws...@linkedin.com> Authored: Thu Sep 27 08:50:29 2018 -0700 Committer: Jagadish <jvenkatra...@linkedin.com> Committed: Thu Sep 27 08:50:29 2018 -0700 ---------------------------------------------------------------------- docs/_case-studies/ebay.md | 56 +++++++++++++++++++++++ docs/_case-studies/optimizely.md | 42 +++++++++++------ docs/_case-studies/redfin.md | 48 +++++++++++++++++-- docs/_case-studies/tripadvisor.md | 71 +++++++++++++++++++++++++++++ docs/img/case-studies/redfin.svg | 1 + docs/img/case-studies/trip-advisor.svg | 1 + 6 files changed, 203 insertions(+), 16 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/samza/blob/66d6c5d0/docs/_case-studies/ebay.md ---------------------------------------------------------------------- diff --git a/docs/_case-studies/ebay.md b/docs/_case-studies/ebay.md new file mode 100644 index 0000000..3e1d932 --- /dev/null +++ b/docs/_case-studies/ebay.md @@ -0,0 +1,56 @@ +--- +layout: case-study +hide_title: true # so we have control in case-study layout, but can still use page +title: Low Latency Web Scale Fraud Prevention +study_domain: ebay.com +menu_title: eBay +excerpt_separator: <!--more--> +--- +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +Low Latency Web Scale Fraud Prevention + +<!--more--> + +eBay Enterprise is the worldâs largest omni-channel commerce provider with +hundreds millions of units shipped annually, as commerce gets more +convenient and complex, so does fraud. The engineering team at eBay +Enterprise selected Samza as the platform to build the horizontally +scalable, realtime (sub-seconds) and fault tolerant abnormality detection +system. For example, the system computes and evaluates key metrics to +detect abnormal behaviors + +- Transaction velocity (#tnx/day) and change (#tnx/day vs #tnx/day over n days) +- Amount velocity ($tnx/day) and change ($tnx/day vs $tnx/day over n days) + +A wide range of realtime and historical adjunct data from various sources +including people, places, interests, social and connections are ingested +through Kafka, and stored in local RocksDB state store with changelog +enabled for recovery. Incoming transaction data is aggregated using +windowing and then joined with adjunct data stores in multiple stages. +The system generates potential fraud cases for review real time. Finally, +the engineering team at eBay Enterprise has built an OpenTSDB and Grafana +based monitoring system using metrics collected through JMX. + +Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*, +*JMX-metrics* + +More information + +- [https://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends](https://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends) +- [http://ebayenterprise.com/](http://ebayenterprise.com/) http://git-wip-us.apache.org/repos/asf/samza/blob/66d6c5d0/docs/_case-studies/optimizely.md ---------------------------------------------------------------------- diff --git a/docs/_case-studies/optimizely.md b/docs/_case-studies/optimizely.md index 8b18233..6c3e240 100644 --- a/docs/_case-studies/optimizely.md +++ b/docs/_case-studies/optimizely.md @@ -1,7 +1,7 @@ --- layout: case-study hide_title: true # so we have control in case-study layout, but can still use page -title: Real Time Session Aggregation at Optimizely +title: Real Time Session Aggregation study_domain: optimizely.com menu_title: Optimizely excerpt_separator: <!--more--> @@ -23,18 +23,27 @@ excerpt_separator: <!--more--> limitations under the License. --> -Testing the excerpt +Real Time Session Aggregation <!--more--> -Optimizely is a worldâs leading experimentation platform, enabling businesses to deliver continuous experimentation and personalization across websites, mobile apps and connected devices. At Optimizely, billions of events are tracked on a daily basis. Session metrics are among the key metrics provided to their end user in real time. Prior to introducing Samza for realtime computation, the engineering team at Optimizely used HBase to store and serve experimentation data, and Druid for personalization data including session metrics. As business requirements evolved, the Druid-based solution became more and more challenging. +Optimizely is a worldâs leading experimentation platform, enabling businesses to +deliver continuous experimentation and personalization across websites, mobile +apps and connected devices. At Optimizely, billions of events are tracked on a +daily basis. Session metrics are among the key metrics provided to their end user +in real time. Prior to introducing Samza for their realtime computation, the +engineering team at Optimizely built their data-pipeline using a complex +[Lambda architecture] (http://lambda-architecture.net/) leveraging +[Druid and Hbase] (https://medium.com/engineers-optimizely/building-a-scalable-data-pipeline-bfe3f531eb38). +As business requirements evolve, this solution became more and more challenging. -- Long delays in session metrics caused by M/R jobs -- Reprocessing of events due to inability to incrementally update Druid index -- Difficulties in scaling dimensions and cardinality -- Queries expanding long time periods are expensive - -The engineering team at Optimizely decided to move away from Druid and focus on HBase as the store, and introduced stream processing to pre-aggregate and deduplicate session events. They evaluated multiple stream processing platforms and chose Samza as their stream processing platform. In their solution, every session event is tagged with an identifier for up to 30 minutes; upon receiving a session event, the Samza job updates session metadata and aggregates counters for the session that is stored in a local RocksDB state store. At the end of each one-minute window, aggregated session metrics are ingested to HBase. With the new solution +The engineering team at Optimizely decided to move away from Druid and focus on +HBase as the store, and introduced stream processing to pre-aggregate and +deduplicate session events. In their solution, every session event is tagged +with an identifier for up to 30 minutes; upon receiving a session event, the +Samza job updates session metadata and aggregates counters for the session +that is stored in a local RocksDB state store. At the end of each one-minute +window, aggregated session metrics are ingested to HBase. With the new solution - The median query latency was reduced from 40+ ms to 5 ms - Session metrics are now available in realtime @@ -44,15 +53,22 @@ The engineering team at Optimizely decided to move away from Druid and focus on Here is a testimonial from Optimizely -âAt Optimizely, we have built the worldâs leading experimentation platform, which ingests billions of click-stream events a day from millions of visitors for analysis. Apache Samza has been a great asset to Optimizely's Event ingestion pipeline allowing us to perform large scale, real time stream computing such as aggregations (e.g. session computations) and data enrichment on a multiple billion events / day scale. The programming model, durability and the close integration with Apache Kafka fit our needs perfectlyâ said Vignesh Sukumar, Senior Engineering Manager at Optimizelyâ +âAt Optimizely, we have built the worldâs leading experimentation platform, +which ingests billions of click-stream events a day from millions of visitors +for analysis. Apache Samza has been a great asset to Optimizely's Event +ingestion pipeline allowing us to perform large scale, real time stream +computing such as aggregations (e.g. session computations) and data enrichment +on a multiple billion events / day scale. The programming model, durability +and the close integration with Apache Kafka fit our needs perfectlyâ said +Vignesh Sukumar, Senior Engineering Manager at Optimizelyâ -In addition, stream processing is also applied to other use cases such as data enrichment, event stream partitioning and metrics processing at Optimizely. +In addition, stream processing is also applied to other use cases such as +data enrichment, event stream partitioning and metrics processing at Optimizely. -Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*, *Scalability*, *Fault-tolerant* +Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration* More information - [https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-1-aed2051dd7a3](https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-1-aed2051dd7a3) - - [https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-2-b596350a7820](https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-2-b596350a7820) \ No newline at end of file http://git-wip-us.apache.org/repos/asf/samza/blob/66d6c5d0/docs/_case-studies/redfin.md ---------------------------------------------------------------------- diff --git a/docs/_case-studies/redfin.md b/docs/_case-studies/redfin.md index b4e5098..ba03582 100644 --- a/docs/_case-studies/redfin.md +++ b/docs/_case-studies/redfin.md @@ -1,7 +1,7 @@ --- layout: case-study # the layout to use hide_title: true # so we have control in case-study layout, but can still use page -title: Totally awesome use-case of samza by Redfin # title of case study page +title: Realtime Notifications at Redfin study_domain: redfin.com # just the domain, not the protocol menu_title: Redfin # what shows up in the menu excerpt_separator: <!--more--> @@ -23,8 +23,50 @@ excerpt_separator: <!--more--> limitations under the License. --> -Testing the excerpt +Realtime Notifications <!--more--> -Markdown content goes here \ No newline at end of file +Redfin is a leading full-service real estate brokerage that uses modern technology +to help people buy and sell homes. Notification is the critical feature to +communicate with Redfinâs customers, notification includes recommendations, instant +emails, scheduled digests and push notifications. Thousands of emails are delivered +to customers every minute at peak. + +The notification system used to be a monolithic system, which served the company +well. However, as business grew and requirements evolved, it became harder and +harder to maintain and scale. + +![Samza pipeline at Redfin](/img/case-studies/redfin.svg) + +The engineering team at Redfin decided to replace +the existing system with Samza primarily for Samzaâs performance, scalability, +support for stateful processing and Kafka-integration. A multi-stage stream +processing pipeline was developed. At the Identify stage, external events +such as new Listings are identified as candidates for new notification; +then potential recipients of notifications are determined by analyzing data in +events and customer profiles, results are grouped by customer at the end of +each time window at the Match Stage; once recipients and notification outlines are +identified, the Organize stage retrieves adjunct data necessary to appear in each +notification from various data sources by joining them with notification and +customer profiles, results are stored/merged in local RocksDB state store; finally +notifications are formatted at the Format stage and sent to notification + delivery system at the Notify stage. + +With the new notification system + +- The system is more performant and horizontally scalable +- It is now easier to add support for new use cases +- Reduced pressure on other system due to the use of local RocksDB state store +- Processing stages can be scaled individually + +Other engineering teams at Redfin are also using Samza for business metrics +calculation, document processing, event scheduling. + +Key Samza Features: *Stateful processing*, *Windowing*, *Kafka-integration* + +More information + +- [https://www.youtube.com/watch?v=cfy0xjJJf7Y](https://www.youtube.com/watch?v=cfy0xjJJf7Y) +- [https://www.redfin.com/](https://www.redfin.com/) + http://git-wip-us.apache.org/repos/asf/samza/blob/66d6c5d0/docs/_case-studies/tripadvisor.md ---------------------------------------------------------------------- diff --git a/docs/_case-studies/tripadvisor.md b/docs/_case-studies/tripadvisor.md new file mode 100644 index 0000000..1d47ede --- /dev/null +++ b/docs/_case-studies/tripadvisor.md @@ -0,0 +1,71 @@ +--- +layout: case-study +hide_title: true # so we have control in case-study layout, but can still use page +title: Hedwig - Converting Hadoop M/R ETL systems to Stream Processing +study_domain: tripadvisor.com +menu_title: TripAdvisor +excerpt_separator: <!--more--> +--- +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +Hedwig - Converting Hadoop M/R ETL systems to Stream Processing + +<!--more--> + +TripAdvisor is one of the worldâs largest travel website that provides hotel +and restaurant reviews, accommodation bookings and other travel-related +content. It produces and processes billions events processed everyday +including billing records, reports, monitoring events and application +notifications. + +Prior to migrating to Samza, TripAdvisor used Hadoop to ETL its data. Raw +data was rolled up to hourly and daily in a number of stages with joins +and sliding windows applied, session data is then produced from daily data. +About 300 million sessions are produced daily. With this solution, the +engineering team were faced with a few challenges + +- Long lag time to downstream that is business critical +- Difficult to debug and troubleshoot due to scripts, environments, etc. +- Adding more nodes doesnât help to scale + +The engineering team at TripAdvisor decided to replace the Hadoop solution +with a multi-stage Samza pipeline. + +![Samza pipeline at TripAdvisor](/img/case-studies/trip-advisor.svg) + +In the new solution, after raw data is first collected by Flume and ingested +through a Kafka cluster, they are parsed, cleansed and partitioned by the +Lookback Router; then processing logic such as windowing, grouping, joining, +fraud detection are applied by the Session Collector and the Fraud Collector, +RocksDB is used as the local store for intermediate states; finally the Uploader +uploads results to HDFS, ElasticSearch, RedShift and Hive. + +The new solution achieved significant improvements: + +- Processing time is reduced from 3 hours to 1 hour +- Individual stages in the pipeline are scaled independently +- Overall hardware requirement is reduced to â thanks to optimized usage +- Much simpler to debug and test the solution + +Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration* + +More information + +- [https://www.youtube.com/watch?v=KQ5OnL2hMBY](https://www.youtube.com/watch?v=KQ5OnL2hMBY) +- [https://www.tripadvisor.com/](https://www.tripadvisor.com/) + \ No newline at end of file