I wrote:

> The human brain has 86 billion neurons, all operating simultaneously. In
> other words, complete parallel processing with 86 billion "processors"
> operating simultaneously. ChatGPT tells us she has 175 billion
> parameters in Version 3. I assume each parameter is roughly equivalent to a
> neuron.
>

Wrong! I got that wrong. Each parameter is roughly equivalent to a neuron,
but human brain neurons have roughly 7,000 synapses:

Each neuron has, on average, about 7,000 synaptic connections with other
neurons. That puts the synapse count in the neighborhood of 600 trillion.
In young children, before synaptic pruning begins in earnest, the estimated
number reaches as high as 1 quadrillion.

https://www.jax.org/news-and-insights/jax-blog/2018/december/600-trillion-synapses-and-alzheimers-disease

ChatGPT tells me that each parameter is stored as a 16 or 32-bit
floating point number. That is roughly equivalent to having 32 synapses per
neuron. The human neuron can have a much wider range of values. I don't
know how that affects the model, but to me the human neuron looks more like
a quantum computer Qbit than a 32-bit floating point number. Presumably,
brains evolved to the optimum number of neurons and synapses.

So, ChatGPT Ver. 3 was smaller than the human brain in terms of the range
of values the ANN can hold. Granted that comparing it to a brain is
comparing apples to oranges, there is probably some advantage to a much
larger range of values per neuron. A super-brain would need a lot more
synapse equivalents compared to a human brain, so we have a long way to go
before reaching something like a million times 1 quadrillion artificial
synapses. Plus of course it would need programming methods not yet invented.

ChatGPT tells me that with the specialized hardware provided by OpenAI, all
175 billion parameters are all loaded into RAM at the same time. With other
configurations they are overlaid. The speed of MPP operation depends on
whether the parameters are loaded or overlaid, and I think more importantly
on whether the operation is done with specialized hardware (GPU and
Google's TPU), or emulated with ordinary computer hardware.

I asked ChatGPT for some of these details. Here is our exchange:


ME: You told me that the largest version of GPT-3 had 175 billion
parameters. Is each parameter stored as 1 byte? Or some other number of
bits?

ChatGPT: No, each parameter in GPT-3 is not stored as 1 byte. Instead, each
parameter is typically stored as a 16-bit or 32-bit floating-point number,
depending on the implementation. This is because floating-point numbers can
represent a wider range of values and provide greater precision than
integers.

For example, GPT-3's 175 billion parameters would require around 700 GB of
memory if each parameter was stored as a 32-bit floating-point number. If
each parameter was stored as a 16-bit floating-point number, it would
require around 350 GB of memory.

It's worth noting that the actual memory usage of GPT-3 may be different
due to various optimizations and compression techniques used by the
implementation.

ME: With the 32-bit floating point number configuration, I assume all 700
GB are online in RAM. Not overlaid. Correct?

ChatGPT: No, it's not necessary for all 700 GB of GPT-3's parameters to be
stored in RAM at the same time. In fact, it's not practical to store such a
large model entirely in memory on most machines.

Most deep learning frameworks use a technique called "memory-mapping" to
allow the model parameters to be loaded and unloaded from disk as needed
during training or inference. Memory-mapping allows the model parameters to
be accessed as if they were in memory, even though they are stored on disk.

In practice, the memory usage of GPT-3 will depend on the specific
implementation and hardware configuration. Some implementations may use
techniques like model parallelism or gradient checkpointing to reduce
memory usage, while others may rely on larger hardware configurations with
more memory.

ME: I meant in the "specialized computing infrastructure provided by
OpenAI" that you described earlier. Is that all on-line?

ChatGPT:   Ah, I see what you're asking now. Yes, in the specialized
computing infrastructure provided by OpenAI, the GPT-3 model with 175
billion parameters is fully loaded into the RAM of the computing cluster
during operation. The computing cluster used by OpenAI is composed of many
machines with large amounts of RAM, and the model parameters are
distributed across those machines.

However, it's worth noting that not all of the machines in the cluster will
necessarily have a full copy of the model parameters in memory at all
times. Instead, the model parameters may be partitioned across the machines
and loaded into memory as needed for different parts of the computation.
This approach allows the computing cluster to handle the large memory
requirements of the GPT-3 model while also being scalable and efficient.

ME: I realize that memory mapping is very rapid with today's peripherals,
so even an overlay configuration will be rapid.

ChatGPT:  Yes, memory mapping can be very fast with modern storage devices
and is commonly used to efficiently access large datasets that cannot fit
entirely in memory. With memory mapping, the operating system maps a
portion of a file into the virtual address space of a process, allowing the
process to access the file's contents as if they were in memory.

However, it's important to note that memory-mapped I/O does have some
overhead, such as page faults and disk I/O. Page faults can occur if the
requested memory page is not in physical memory and must be fetched from
disk, while disk I/O can occur if the requested page has not yet been read
into the page cache. These overheads can be minimized through various
techniques such as preloading data into the page cache, using large pages,
and optimizing disk access patterns.

In any case, the actual performance of memory-mapped I/O depends on the
specific hardware and software configuration, as well as the
characteristics of the workload being performed.

ME: One more question about this. I read that the "specialized computing
infrastructure provided by OpenAI" is an MPP configuration. Approximately
how many processors are there? How many processors per parameter? I guess
one GPU or TPU is equivalent to one CPU processor in a traditional MPP
configuration.

ChatGPT:   I'm sorry, but I don't have access to detailed information about
the specific hardware and configuration of OpenAI's computing
infrastructure, so I can't provide specific numbers about the number of
processors or GPUs/TPUs used per parameter.

However, I can provide some general information about MPP (Massively
Parallel Processing) configurations and their relationship to GPUs/TPUs.

In an MPP configuration, the processing power of a system is distributed
across multiple processors or nodes, allowing for massive parallelism and
high throughput. This is typically achieved through the use of specialized
hardware and software, such as high-speed interconnects and parallel file
systems.

In the context of deep learning, GPUs and TPUs are often used as the
primary processing units in MPP configurations due to their ability to
perform large-scale matrix operations in parallel. GPUs and TPUs are
designed with a large number of processing cores that can perform these
operations simultaneously, making them well-suited for deep learning
workloads.

The number of processors or GPUs/TPUs used per parameter will depend on
various factors, such as the size of the model, the complexity of the
workload, and the hardware configuration. In general, larger models and
more complex workloads will require more processing power per parameter,
while smaller models and simpler workloads may be able to make do with less.

ME: So, a larger model actually needs a higher ratio of GPU or TPU units
per parameter? A model is "larger" in the sense that it does more
processing for each parameter, not that it has more parameters. That is
interesting.

ChatGPT:  Yes, you're correct. The number of processing units, whether
CPUs, GPUs or TPUs, required per parameter in a deep learning model depends
not only on the number of parameters but also on the complexity of the
computation required for each parameter.

Larger models, in terms of the computation required for each parameter,
will require more processing units per parameter, as the computation
becomes more computationally intensive. This is especially true for models
that require more complex operations, such as recurrent neural networks,
convolutional neural networks, and transformers like GPT-3.

In general, the number of processing units required per parameter is
proportional to the computational requirements of the model. Therefore,
larger models require a higher ratio of processing units per parameter,
which means that more powerful hardware configurations, such as clusters of
GPUs or TPUs, are needed to efficiently train or inference these models.

Reply via email to