Nevermind. I found my stupid mistake. I didn’t reset a variable…this fact had 
escaped me for the past two days.

From: "Avery, John" <jav...@akamai.com>
Date: Wednesday, December 27, 2017 at 4:20 PM
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Help me understand hadoop caching behavior

I’m writing a program using the C API for Hadoop. I have a 4-node cluster. 
(Cluster was setup according to 
https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm)<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.tutorialspoint.com_hadoop_hadoop-5Fmulti-5Fnode-5Fcluster.htm-29&d=DwMGaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=KsLfEH77dd7XNQpELxDSeOzoJ6T91vg4841kGFjzcmE&m=2ofwsSjChHB4Crrp1R2L7Q4Q9kCXW6jo8MPO2JroVkw&s=SYiZYvbh_O8ix805bOf0si5jCKLzTfH0QzHQEjcJHow&e=>
 Of the 4 nodes, one is the namenode and a datanode, the others are datanodes 
(with one being a secondary namenode).

I’ve already managed to write about 1.5TB of data to the cluster. My issue is 
reading data back, specifically, it’s too fast. *Way* too fast, and I don’t 
understand how or why. The 1.5 TB is stored in the form of about 20,000 60-80MB 
files. When I read back the files (7 files in parallel) I get read speeds in 
excess of 75GB/s. Obviously this is DRAM speed, here’s the problem…each of the 
4 nodes only has 32GB of RAM, and I’m asking Hadoop to re-read over 400GB of 
data. I am using the read back data, so it isn’t the compiler optimizing 
something out, because when I turn off optimization flags, it still runs 10x 
faster than the network/disks to this box can run.

Specifically: 2x10Gb network ports, bonded. Maximum network input 2.5GB/s. 
(test verified)
16x 4TB hard drives: 2GB/s maximum throughput (test verified; outside of 
Hadoop).

As for how I’m reading my data, hdfsOpenFile(…,O_RDONLY) and hdfsRead().

So, at best, I should get 4.5GB/s, and that’s in a perfect work world. But 
during my tests I see no network traffic, and very little (~30-70MB/s) disk IO. 
Yet it manages to return to me 300GB of unique data (the data is real, not a 
pattern, not something particularly compressible or dedupable).

I’m at a complete loss for how 300GB of data is getting sent to me so quickly?! 
I feel like I’m overlooking something trivial…I’m specifically asking for 10X 
the system’s memory (and over 2x the cluster’s memory!) in order to *prevent* 
caching from polluting my numbers. Yet it’s doing something that should be 
impossible. I’m at a complete loss. I fully expect to facepalm at the end of 
this.

Oh, and here’s the really weird part (to me). If I request all 20,000 files, it 
zooms past the 5000 I have cached from my 400MB read test and then slows down 
to a more realistic 2GB/s for the rest of the files. Until I re-run the program 
a second time…then it returns a result in something like 35 seconds instead of 
5 minutes. !!!

Reply via email to