[3/3] samza git commit: SAMZA-1905: Added case study for eBay, Optimizely, Redfin and TripAdvisor

jagadish Thu, 27 Sep 2018 08:53:28 -0700

SAMZA-1905: Added case study for eBay, Optimizely, Redfin and TripAdvisor

As per subject


Author: Wei Song <ws...@linkedin.com>

Reviewers: Jagadish <jagad...@apache.org>

Closes #657 from weisong44/SAMZA-1905


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/66d6c5d0
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/66d6c5d0
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/66d6c5d0

Branch: refs/heads/master
Commit: 66d6c5d09645bcf4dc4051deda3fed31848d0340
Parents: 55989bf
Author: Wei Song <ws...@linkedin.com>
Authored: Thu Sep 27 08:50:29 2018 -0700
Committer: Jagadish <jvenkatra...@linkedin.com>
Committed: Thu Sep 27 08:50:29 2018 -0700

----------------------------------------------------------------------
 docs/_case-studies/ebay.md             | 56 +++++++++++++++++++++++
 docs/_case-studies/optimizely.md       | 42 +++++++++++------
 docs/_case-studies/redfin.md           | 48 +++++++++++++++++--
 docs/_case-studies/tripadvisor.md      | 71 +++++++++++++++++++++++++++++
 docs/img/case-studies/redfin.svg       |  1 +
 docs/img/case-studies/trip-advisor.svg |  1 +
 6 files changed, 203 insertions(+), 16 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/66d6c5d0/docs/_case-studies/ebay.md
----------------------------------------------------------------------
diff --git a/docs/_case-studies/ebay.md b/docs/_case-studies/ebay.md
new file mode 100644
index 0000000..3e1d932
--- /dev/null
+++ b/docs/_case-studies/ebay.md
@@ -0,0 +1,56 @@
+---
+layout: case-study
+hide_title: true # so we have control in case-study layout, but can still use 
page
+title: Low Latency Web Scale Fraud Prevention
+study_domain: ebay.com
+menu_title: eBay
+excerpt_separator: <!--more-->
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+Low Latency Web Scale Fraud Prevention
+
+<!--more-->
+
+eBay Enterprise is the worldâs largest omni-channel commerce provider with 
+hundreds millions of units shipped annually, as commerce gets more 
+convenient and complex, so does fraud. The engineering team at eBay 
+Enterprise selected Samza as the platform to build the horizontally 
+scalable, realtime (sub-seconds) and fault tolerant abnormality detection 
+system. For example, the system computes and evaluates key metrics to 
+detect abnormal behaviors
+
+-   Transaction velocity (#tnx/day) and change (#tnx/day vs #tnx/day over n 
days)
+-   Amount velocity ($tnx/day) and change ($tnx/day vs $tnx/day over n days)
+
+A wide range of realtime and historical adjunct data from various sources 
+including people, places, interests, social and connections are ingested 
+through Kafka, and stored in local RocksDB state store with changelog 
+enabled for recovery. Incoming transaction data is aggregated using 
+windowing and then joined with adjunct data stores in multiple stages. 
+The system generates potential fraud cases for review real time. Finally, 
+the engineering team at eBay Enterprise has built an OpenTSDB and Grafana 
+based monitoring system using metrics collected through JMX.
+
+Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*,
+*JMX-metrics*
+
+More information
+
+-   
[https://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends](https://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends)
+-   [http://ebayenterprise.com/](http://ebayenterprise.com/)

http://git-wip-us.apache.org/repos/asf/samza/blob/66d6c5d0/docs/_case-studies/optimizely.md
----------------------------------------------------------------------
diff --git a/docs/_case-studies/optimizely.md b/docs/_case-studies/optimizely.md
index 8b18233..6c3e240 100644
--- a/docs/_case-studies/optimizely.md
+++ b/docs/_case-studies/optimizely.md
@@ -1,7 +1,7 @@
 ---
 layout: case-study
 hide_title: true # so we have control in case-study layout, but can still use 
page
-title: Real Time Session Aggregation at Optimizely
+title: Real Time Session Aggregation
 study_domain: optimizely.com
 menu_title: Optimizely
 excerpt_separator: <!--more-->
@@ -23,18 +23,27 @@ excerpt_separator: <!--more-->
    limitations under the License.
 -->
 
-Testing the excerpt
+Real Time Session Aggregation
 
 <!--more-->
 
-Optimizely is a worldâs leading experimentation platform, enabling 
businesses to deliver continuous experimentation and personalization across 
websites, mobile apps and connected devices. At Optimizely, billions of events 
are tracked on a daily basis. Session metrics are among the key metrics 
provided to their end user in real time. Prior to introducing Samza for 
realtime computation, the engineering team at Optimizely used HBase to store 
and serve experimentation data, and Druid for personalization data including 
session metrics. As business requirements evolved, the Druid-based solution 
became more and more challenging.
+Optimizely is a worldâs leading experimentation platform, enabling 
businesses to 
+deliver continuous experimentation and personalization across websites, mobile 
+apps and connected devices. At Optimizely, billions of events are tracked on a 
+daily basis. Session metrics are among the key metrics provided to their end 
user 
+in real time. Prior to introducing Samza for their realtime computation, the 
+engineering team at Optimizely built their data-pipeline using a complex 
+[Lambda architecture] (http://lambda-architecture.net/) leveraging 
+[Druid and Hbase] 
(https://medium.com/engineers-optimizely/building-a-scalable-data-pipeline-bfe3f531eb38).
 
+As business requirements evolve, this solution became more and more 
challenging.
 
--   Long delays in session metrics caused by M/R jobs
--   Reprocessing of events due to inability to incrementally update Druid index
--   Difficulties in scaling dimensions and cardinality
--   Queries expanding long time periods are expensive
-
-The engineering team at Optimizely decided to move away from Druid and focus 
on HBase as the store, and introduced stream processing to pre-aggregate and 
deduplicate session events. They evaluated multiple stream processing platforms 
and chose Samza as their stream processing platform. In their solution, every 
session event is tagged with an identifier for up to 30 minutes; upon receiving 
a session event, the Samza job updates session metadata and aggregates counters 
for the session that is stored in a local RocksDB state store. At the end of 
each one-minute window, aggregated session metrics are ingested to HBase. With 
the new solution
+The engineering team at Optimizely decided to move away from Druid and focus 
on 
+HBase as the store, and introduced stream processing to pre-aggregate and 
+deduplicate session events. In their solution, every session event is tagged 
+with an identifier for up to 30 minutes; upon receiving a session event, the 
+Samza job updates session metadata and aggregates counters for the session 
+that is stored in a local RocksDB state store. At the end of each one-minute 
+window, aggregated session metrics are ingested to HBase. With the new solution
 
 -   The median query latency was reduced from 40+ ms to 5 ms
 -   Session metrics are now available in realtime
@@ -44,15 +53,22 @@ The engineering team at Optimizely decided to move away 
from Druid and focus on
  
 Here is a testimonial from Optimizely
 
-âAt Optimizely, we have built the worldâs leading experimentation 
platform, which ingests billions of click-stream events a day from millions of 
visitors for analysis. Apache Samza has been a great asset to Optimizely's 
Event ingestion pipeline allowing us to perform large scale, real time stream 
computing such as aggregations (e.g. session computations) and data enrichment 
on a multiple billion events / day scale. The programming model, durability and 
the close integration with Apache Kafka fit our needs perfectlyâ said Vignesh 
Sukumar, Senior Engineering Manager at Optimizelyâ
+âAt Optimizely, we have built the worldâs leading experimentation 
platform, 
+which ingests billions of click-stream events a day from millions of visitors 
+for analysis. Apache Samza has been a great asset to Optimizely's Event 
+ingestion pipeline allowing us to perform large scale, real time stream 
+computing such as aggregations (e.g. session computations) and data enrichment 
+on a multiple billion events / day scale. The programming model, durability 
+and the close integration with Apache Kafka fit our needs perfectlyâ said 
+Vignesh Sukumar, Senior Engineering Manager at Optimizelyâ
 
-In addition, stream processing is also applied to other use cases such as data 
enrichment, event stream partitioning and metrics processing at Optimizely.
+In addition, stream processing is also applied to other use cases such as 
+data enrichment, event stream partitioning and metrics processing at 
Optimizely.
 
-Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*, 
*Scalability*, *Fault-tolerant*
+Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*
 
 More information
 
 -   
[https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-1-aed2051dd7a3](https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-1-aed2051dd7a3)
-    
 -   
[https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-2-b596350a7820](https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-2-b596350a7820)
     
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/samza/blob/66d6c5d0/docs/_case-studies/redfin.md
----------------------------------------------------------------------
diff --git a/docs/_case-studies/redfin.md b/docs/_case-studies/redfin.md
index b4e5098..ba03582 100644
--- a/docs/_case-studies/redfin.md
+++ b/docs/_case-studies/redfin.md
@@ -1,7 +1,7 @@
 ---
 layout: case-study # the layout to use
 hide_title: true # so we have control in case-study layout, but can still use 
page
-title: Totally awesome use-case of samza by Redfin # title of case study page
+title: Realtime Notifications at Redfin
 study_domain: redfin.com # just the domain, not the protocol
 menu_title: Redfin # what shows up in the menu
 excerpt_separator: <!--more-->
@@ -23,8 +23,50 @@ excerpt_separator: <!--more-->
    limitations under the License.
 -->
 
-Testing the excerpt
+Realtime Notifications
 
 <!--more-->
 
-Markdown content goes here
\ No newline at end of file
+Redfin is a leading full-service real estate brokerage that uses modern 
technology 
+to help people buy and sell homes. Notification is the critical feature to 
+communicate with Redfinâs customers, notification includes recommendations, 
instant 
+emails, scheduled digests and push notifications. Thousands of emails are 
delivered 
+to customers every minute at peak. 
+
+The notification system used to be a monolithic system, which served the 
company 
+well. However, as business grew and requirements evolved, it became harder and 
+harder to maintain and scale. 
+
+![Samza pipeline at Redfin](/img/case-studies/redfin.svg)
+
+The engineering team at Redfin decided to replace 
+the existing system with Samza primarily for Samzaâs performance, 
scalability, 
+support for stateful processing and Kafka-integration. A multi-stage stream 
+processing pipeline was developed. At the Identify stage, external events 
+such as new Listings are identified as candidates for new notification;
+then potential recipients of notifications are determined by analyzing data in 
+events and customer profiles, results are grouped by customer at the end of 
+each time window at the Match Stage; once recipients and notification outlines 
are 
+identified, the Organize stage retrieves adjunct data necessary to appear in 
each 
+notification from various data sources by joining them with notification and 
+customer profiles, results are stored/merged in local RocksDB state store; 
finally 
+notifications are formatted at the Format stage and sent to notification
+ delivery system at the Notify stage. 
+
+With the new notification system
+
+-   The system is more performant and horizontally scalable
+-   It is now easier to add support for new use cases
+-   Reduced pressure on other system due to the use of local RocksDB state 
store
+-   Processing stages can be scaled individually
+
+Other engineering teams at Redfin are also using Samza for business metrics 
+calculation, document processing, event scheduling.
+
+Key Samza Features: *Stateful processing*, *Windowing*, *Kafka-integration*
+
+More information
+
+-   
[https://www.youtube.com/watch?v=cfy0xjJJf7Y](https://www.youtube.com/watch?v=cfy0xjJJf7Y)
+-   [https://www.redfin.com/](https://www.redfin.com/)
+

http://git-wip-us.apache.org/repos/asf/samza/blob/66d6c5d0/docs/_case-studies/tripadvisor.md
----------------------------------------------------------------------
diff --git a/docs/_case-studies/tripadvisor.md 
b/docs/_case-studies/tripadvisor.md
new file mode 100644
index 0000000..1d47ede
--- /dev/null
+++ b/docs/_case-studies/tripadvisor.md
@@ -0,0 +1,71 @@
+---
+layout: case-study
+hide_title: true # so we have control in case-study layout, but can still use 
page
+title: Hedwig - Converting Hadoop M/R ETL systems to Stream Processing
+study_domain: tripadvisor.com
+menu_title: TripAdvisor
+excerpt_separator: <!--more-->
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+Hedwig - Converting Hadoop M/R ETL systems to Stream Processing
+
+<!--more-->
+
+TripAdvisor is one of the worldâs largest travel website that provides hotel 
+and restaurant reviews, accommodation bookings and other travel-related 
+content. It produces and processes billions events processed everyday 
+including billing records, reports, monitoring events and application 
+notifications.
+
+Prior to migrating to Samza, TripAdvisor used Hadoop to ETL its data. Raw 
+data was rolled up to hourly and daily in a number of stages with joins 
+and sliding windows applied, session data is then produced from daily data. 
+About 300 million sessions are produced daily. With this solution, the 
+engineering team were faced with a few challenges
+  
+-   Long lag time to downstream that is business critical
+-   Difficult to debug and troubleshoot due to scripts, environments, etc.
+-   Adding more nodes doesnât help to scale
+ 
+The engineering team at TripAdvisor decided to replace the Hadoop solution 
+with a multi-stage Samza pipeline. 
+
+![Samza pipeline at TripAdvisor](/img/case-studies/trip-advisor.svg)
+
+In the new solution, after raw data is first collected by Flume and ingested 
+through a Kafka cluster, they are parsed, cleansed and partitioned by the
+Lookback Router; then processing logic such as windowing, grouping, joining, 
+fraud detection are applied by the Session Collector and the Fraud Collector, 
+RocksDB is used as the local store for intermediate states; finally the 
Uploader 
+uploads results to HDFS, ElasticSearch, RedShift and Hive. 
+
+The new solution achieved significant improvements:
+
+-   Processing time is reduced from 3 hours to 1 hour
+-   Individual stages in the pipeline are scaled independently
+-   Overall hardware requirement is reduced to â thanks to optimized usage
+-   Much simpler to debug and test the solution
+ 
+Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*
+
+More information
+
+-   
[https://www.youtube.com/watch?v=KQ5OnL2hMBY](https://www.youtube.com/watch?v=KQ5OnL2hMBY)
+-   [https://www.tripadvisor.com/](https://www.tripadvisor.com/)
+    
\ No newline at end of file

[3/3] samza git commit: SAMZA-1905: Added case study for eBay, Optimizely, Redfin and TripAdvisor

Reply via email to