How PayPal Serves 350 Billion Daily Requests with JunoDB

ByteByteGo Tue, 16 Apr 2024 08:33:08 -0700

View this post on the web at 
https://blog.bytebytego.com/p/how-paypal-serves-350-billion-daily

Stop releasing bugs with fully automated end-to-end test coverage (Sponsored) [
https://substack.com/redirect/42bea5da-03c2-4479-a7e3-2df9a9ec5af5?j=eyJ1IjoiMXBxZnM3In0.Bn7pYfJL44qVpLsyOfEHZjPX-L54Pv8N6hc-GQqMSDU
]
Bugs sneak out when less than 80% of user flows are tested [
https://substack.com/redirect/42bea5da-03c2-4479-a7e3-2df9a9ec5af5?j=eyJ1IjoiMXBxZnM3In0.Bn7pYfJL44qVpLsyOfEHZjPX-L54Pv8N6hc-GQqMSDU
] before shipping. But how do you get that kind of coverage? You either spend
years scaling in-house QA — or you get there in just 4 months with QA Wolf [
https://substack.com/redirect/42bea5da-03c2-4479-a7e3-2df9a9ec5af5?j=eyJ1IjoiMXBxZnM3In0.Bn7pYfJL44qVpLsyOfEHZjPX-L54Pv8N6hc-GQqMSDU
].
How's QA Wolf different?
They don't charge hourly.
They guarantee results.
They provide all of the tooling and (parallel run) infrastructure needed to run
a 15-minute QA cycle.
Have you ever seen a database that fails and comes up again in the blink of an
eye?
PayPal’s JunoDB is a database capable of doing so. As per PayPal’s claim,
JunoDB can run at 6 nines of availability (99.9999%). This comes to just 86.40
milliseconds of downtime per day.
For reference, our average eye blink takes around 100-150 milliseconds.
While the statistics are certainly amazing, it also means that there are many
interesting things to pick up from JunoDB’s architecture and design.
In this post, we will cover the following topics:
JunoDB’s Architecture Breakdown
How JunoDB achieves scalability, availability, performance, and security
Use cases of JunoDB
Key Facts about JunoDB
Before we go further, here are some key facts about JunoDB that can help us
develop a better understanding of it.
JunoDB is a distributed key-value store. Think of a key-value store as a
dictionary where you look up a word (the “key”) to find its definition (the
“value”).
JunoDB leverages a highly concurrent architecture implemented in Go to
efficiently handle hundreds of thousands of connections.
At PayPal, JunoDB serves almost 350 billion daily requests and is used in every
core backend service, including critical functionalities like login, risk
management, and transaction processing.
PayPal primarily uses JunoDB for caching to reduce the load on the main
source-of-truth database. However, there are also other use cases that we will
discuss in a later section.
The diagram shows how JunoDB fits into the overall scheme of things at PayPal.
Why the Need for JunoDB?
One common question surrounding the creation of something like JunoDB is this:
“Why couldn’t PayPal just use something off-the-shelf like Redis?”
The reason is PayPal wanted multi-core support for the database and Redis is
not designed to benefit from multiple CPU cores. It is single-threaded in
nature and utilizes only one core. Typically, you need to launch several Redis
instances to scale out on several cores if needed.
Incidentally, JunoDB started as a single-threaded C++ program and the initial
goal was to use it as an in-memory short TTL data store.
For reference, TTL stands for Time to Live. It specifies the maximum duration a
piece of data should be retained or the maximum time it is considered valid.
However, the goals for JunoDB evolved with time.
First, PayPal wanted JunoDB to work as a persistent data store supporting long
TTLs.
Second, JunoDB was also expected to provide improved data security via on-disk
encryption and TLS in transit by default.
These goals meant that JunoDB had to be CPU-bound rather than memory-bound.
For reference, “memory-bound” and “CPU-bound” refer to different performance
aspects in computer programs. As the name suggests, the performance of
memory-bound programs is limited by the amount of available memory. On the
other hand, CPU-bound programs depend on the processing power of the CPU.
For example, Redis is memory-bound. It primarily stores the data in RAM and
everything about it is optimized for quick in-memory access. The limiting
factor for the performance of Redis is memory rather than CPU.
However, requirements like encryption are CPU-intensive because many
cryptographic algorithms require raw processing power to carry out complex
mathematical calculations.
As a result, PayPal decided to rewrite the earlier version of JunoDB in Go to
make it multi-core friendly and support high concurrency.
The Architecture of JunoDB
The below diagram shows the high-level architecture of JunoDB.
Let’s look at the main components of the overall design.
1 - JunoDB Client Library
The client library is part of the client application and provides an API for
storing and retrieving data via the JunoDB proxy.
It is implemented in several programming languages such as Java, C++, Python,
and Golang to make it easy to use across different application stacks.
For developers, it’s just a matter of picking the library for their respective
programming language and including it in the application to carry out the
various operations.
2 - JunoDB Proxy with Load Balancer
JunoDB utilizes a proxy-based design where the proxy connects to all JunoDB
storage server instances.
This design has a few important advantages:
The complexity of determining which storage server should handle a query is
kept out of the client libraries. Since JunoDB is a distributed data store, the
data is spread across multiple servers. The proxy handles the job of directing
the requests to the correct server.
The proxy is also aware of the JunoDB cluster configuration (such as shard
mappings) stored in the ETCD key-value store.
But can the JunoDB proxy turn into a single point of failure?
To prevent this possibility, the proxy runs on multiple instances downstream to
a load balancer. The load balancer receives incoming requests from the client
applications and routes the requests to the appropriate proxy instance.
3 - JunoDB Storage Servers
The last major component in the JunoDB architecture is the storage servers.
These are instances that accept the operation requests from the proxy and store
data in the memory or persistent storage.
Each storage server is responsible for a set of partitions or shards for an
efficient distribution of data.
Internally, JunoDB uses RocksDB as the storage engine. Using an off-the-shelf
storage engine like RocksDB is common in the database world to avoid building
everything from the ground up. For reference, RocksDB is an embedded key-value
storage engine that is optimized for high read and write throughput.
Key Priorities of JunoDB
Now that we have looked at the overall design and architecture of JunoDB, it’s
time to understand a few key priorities for JunoDB and how it achieves them.
Scalability
Several years ago, PayPal transitioned to a horizontally scalable
microservice-based architecture to support the rapid growth in active customers
and payment rates.
While microservices solve many problems for them, they also have some drawbacks.
One important drawback is the increased number of persistent connections to
key-value stores due to scaling out the application tier. JunoDB handles this
scaling requirement in two primary ways.
1 - Scaling for Client Connections
As discussed earlier, JunoDB uses a proxy-based architecture.
If client connections to the database reach a limit, additional proxies can be
added to support more connections.
There is an acceptable trade-off with latency in this case.
2 - Scaling for Data Volume and Throughput
The second type of scaling requirement is related to the growth in data size.
To ensure efficient storage and data fetching, JunoDB supports partitioning
based on the consistent hashing algorithm. Partitions (or shards) are
distributed to physical storage nodes using a shard map.
Consistent hashing is very useful in this case because when the nodes in a
cluster change due to additions or removals, only a minimal number of shards
require reassignment to different storage nodes.
PayPal uses a fixed number of shards (1024 shards, to be precise), and the
shard map is pre-generated and stored in ETCD storage.
Any change to the shard mapping triggers an automatic data redistribution
process, making it easy to scale your JunoDB cluster depending on the need.
The below diagram shows the process in more detail.
Availability
High availability is critical for PayPal. You can’t have a global payment
platform going down without a big loss of reputation.
However, outages can and will occur due to various reasons such as software
bugs, hardware failures, power outages, and even human error. Failures can lead
to data loss, slow response times, or complete unavailability.
To mitigate these challenges, JunoDB relies on replication and failover
strategies.
1 - Within-Cluster Replication
In a cluster, JunoDB storage nodes are logically organized into a grid. Each
column represents a zone, and each row signifies a storage group.
Data is partitioned into shards and assigned to storage groups. Within a
storage group, each shard is synchronously replicated across various zones
based on the quorum protocol.
The quorum-based protocol is the key to reaching a consensus on a value within
a distributed database. You’ve two quorums:
The Read Quorum: When a client wants to read data, it needs to receive
responses from a certain number of zones (known as the read quorum). This is to
make sure that it gets the most up-to-date data.
The Write Quorum: When the client wants to write data, it must receive
acknowledgment from a certain number of zones to make sure that the data is
written to a majority of the zones.
There are two important rules when it comes to quorum.
The sum of the read quorum and write quorum must be greater than the number of
zones. If that’s not the case, the client may end up reading outdated data. For
example, if there are 5 zones with read quorum as 2 and write quorum as 3, a
client can write data to 3 zones but another client may read from the 2 zones
that have not yet received the updated data.
The write quorum must be more than half the number of zones to prevent two
concurrent write operations on the same key. For example, if there is a JunoDB
cluster with 5 zones and a write quorum of 2, client A may write value X to key
K and is considered successful when 2 zones acknowledge the request. Similarly,
client B may write value Y to the same key K and is also successful when two
different zones acknowledge the request. Ultimately, the data for key K is in
an inconsistent state.
In production, PayPal has a configuration with 5 zones, a read quorum of 3, and
a write quorum of 3.
Lastly, the failover process in JunoDB is automatic and instantaneous without
any need for leader re-election or data redistribution. Proxies can know about
a node failure through a lost connection or a read request that has timed out.
2 - Cross-data center replication
Cross-data center replication is implemented by asynchronously replicating data
between the proxies of each cluster across different data centers.
This is important to make sure that the system continues to operate even if
there’s a catastrophic failure at one data center.
Performance
One of the critical goals of JunoDB is to deliver high performance at scale.
This translates to maintaining single-digit millisecond response times while
providing a great user experience.
The below graphs shared by PayPal show the benchmark results demonstrating
JunoDB’s performance in the case of persistent connections and high throughput.
Security
Being a trusted payment processor, security is paramount for PayPal.
Therefore, it’s no surprise that JunoDB has been designed to secure data both
in transit and at rest.
For transmission security, TLS is enabled between the client and proxy as well
as proxies in different data centers used for replication.
Payload encryption is performed at the client or proxy level to prevent
multiple encryptions of the same data. The ideal approach is to encrypt the
data on the client side but if it’s not done, the proxy figures it out through
a metadata flag and carries out the encryption.
All data received by the storage server and stored in the engine are also
encrypted to maintain security at rest.
A key management module is used to manage certificates for TLS, sessions, and
the distribution of encryption keys to facilitate key rotation,
The below diagram shows JunoDB’s security setup in more detail.
Use Cases of JunoDB
With PayPal having made JunoDB open-source, it’s possible that you can also use
it within your projects.
There are various use cases where JunoDB can help. Let’s look at a few
important ones:
1 - Caching
You can use JunoDB as a temporary cache to store data that doesn’t change
frequently.
Since JunoDB supports both short and long-lived TTLs, you can store data from a
few seconds to a few days. For example, a use case is to store short-lived
tokens in JunoDB instead of fetching them from the database.
Other items you can cache in JunoDB are user preferences, account details, and
API responses.
2 - Idempotency
You can also use JunoDB to implement idempotency.
An operation is idempotent when it produces the same result even when applied
multiple times. With idempotency, repeating the operation is safe and you don’t
need to be worried about things like duplicate payments getting applied.
PayPal uses JunoDB to ensure they don’t process a particular payment multiple
times due to retries. JunoDB’s high availability makes it an ideal data store
to keep track of processing details without overloading the main database.
3 - Counters
Let’s say you’ve certain resources that aren’t available for some reason or
they have an access limit to their usage. For example, these resources can be
database connections, API rate limits, or user authentication attempts.
You can use JunoDB to store counters for these resources and track whether
their usage exceeds the threshold.
4 - Latency Bridging
As we discussed earlier, JunoDB provides fast inter-cluster replication. This
can help you deal with slow replication in a more traditional setup.
For example, in PayPal’s case, they run Oracle in Active-Active mode, but the
replication usually isn’t as fast as they would like for their requirement.
It means there are chances of inconsistent reads if records written in one data
center are not replicated in the second data center and the first data center
goes down.
JunoDB can help bridge the latency where you can write to Data Center A (both
Oracle and JunoDB) and even if it goes down, you can read the updates
consistently from the JunoDB instance in Data Center B.
See the below diagram for a better understanding of this concept.
Conclusion
JunoDB is a distributed key-value store playing a crucial role in various
PayPal applications. It provides efficient data storage for fast access to
reduce the load on costly database solutions.
While doing so, it also fulfills critical requirements such as scalability,
high availability with performance, consistency, and security.
Due to its advantages, PayPal has started using JunoDB in multiple use cases
and patterns. For us, it provides a great opportunity to learn about an
exciting new database system.
References:
Unlocking the Power of JunoDB: PayPal’s Key-Value Store Goes Open-Source [
https://substack.com/redirect/6729a7ef-3797-4602-8ce1-92feeda906b7?j=eyJ1IjoiMXBxZnM3In0.Bn7pYfJL44qVpLsyOfEHZjPX-L54Pv8N6hc-GQqMSDU
]
JunoDB: PayPal Open Sources Key-Value Store [
https://substack.com/redirect/2ca64a99-64d6-41f6-8153-72297c205454?j=eyJ1IjoiMXBxZnM3In0.Bn7pYfJL44qVpLsyOfEHZjPX-L54Pv8N6hc-GQqMSDU
]
Redis Benchmark: Pitfalls and misconceptions [
https://substack.com/redirect/2a036e60-3c22-459b-a159-d5aa6d668864?j=eyJ1IjoiMXBxZnM3In0.Bn7pYfJL44qVpLsyOfEHZjPX-L54Pv8N6hc-GQqMSDU
]
Memory Bound Function [
https://substack.com/redirect/3d722ca1-ed0e-4265-8cc0-878524ba082c?j=eyJ1IjoiMXBxZnM3In0.Bn7pYfJL44qVpLsyOfEHZjPX-L54Pv8N6hc-GQqMSDU
]
High Availability [
https://substack.com/redirect/bf563879-2d71-453a-9131-69b81cefdbc4?j=eyJ1IjoiMXBxZnM3In0.Bn7pYfJL44qVpLsyOfEHZjPX-L54Pv8N6hc-GQqMSDU
]

Unsubscribe
https://substack.com/redirect/2/eyJlIjoiaHR0cHM6Ly9ibG9nLmJ5dGVieXRlZ28uY29tL2FjdGlvbi9kaXNhYmxlX2VtYWlsP3Rva2VuPWV5SjFjMlZ5WDJsa0lqb3hNRE0yT1RBd09EY3NJbkJ2YzNSZmFXUWlPakUwTXpZeU1qTTVOQ3dpYVdGMElqb3hOekV6TWpneE5UYzNMQ0psZUhBaU9qRTNNVFU0TnpNMU56Y3NJbWx6Y3lJNkluQjFZaTA0TVRjeE16SWlMQ0p6ZFdJaU9pSmthWE5oWW14bFgyVnRZV2xzSW4wLnhMYlNPNG1WWDZCS21HMzl6NU9CSWxSN0JjNU1MLW5aVm5TMDJpUXMzd0kmZXhwaXJlcz0zNjVkIiwicCI6MTQzNjIyMzk0LCJzIjo4MTcxMzIsImYiOnRydWUsInUiOjEwMzY5MDA4NywiaWF0IjoxNzEzMjgxNTc3LCJleHAiOjE3MTU4NzM1NzcsImlzcyI6InB1Yi0wIiwic3ViIjoibGluay1yZWRpcmVjdCJ9.vzSHE_iwTtP9npFmnCSxkNWN8kYwgaEJHM4bkcXEfWM?

How PayPal Serves 350 Billion Daily Requests with JunoDB

Reply via email to