Good stuff! Thanks for sharing! One issue I have found with the short circuit reads:
https://issues.apache.org/jira/browse/HBASE-8143 Basically you need to turn the buffer size down. The hdfs property is: dfs.client.read.shortcircuit.buffer.size Aaron On Mon, Sep 14, 2015 at 6:42 AM, Ravikumar Govindarajan < [email protected]> wrote: > Finally we are done with testing with short-circuit read and SSD_One > policy. Summarizing few crucial points we observed during query-runs > > 1. A single read issued by hadoop-client takes on an average 0.15-0.25 > ms for 32KB byte-size. Some-times this could be on the higher side > like 0.6-0.65 ms per read… Actual SSD latencies got from iostat was > around 0.1ms with spikes of 0.6 ms > > 2. The overhead of hadoop wrapper code involved in SSD-reads is very > minimal & negligible. However we tested with a single-thread. May be > when multiple-threads are involved during queries, hadoop could be > a spoiler > > 3. It still makes sense to retain the block-cache. Assuming a bad-query > makes about 1000 trips to hadoop. Time consumed ~= 0.15*1000 = > 150 ms. Block-cache could play a crucial role here. It could also help > in resolving multi-threaded accesses > > 4. Segment writes/merges are actually slower than HDD may be because > of sequential reads… > > Overall, we found good gains especially for queries using short-circuit > reads when combined with block-cache. > > -- > Ravi > > > > On Wed, Aug 12, 2015 at 6:34 PM, Ravikumar Govindarajan < > [email protected]> wrote: > > > Our very basic testing with SSD_One policy works as expected. Now we are > > moving to test the efficiency of SSD reads via hadoop.. > > > > I see numerous params that need to be setup for hadoop short-circuit > reads > > as documented here… > > > > > > > http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.7/bk_system-admin-guide/content/ch_short-circuit-reads-hdfs.html > > > > For production workloads are there any standard configs for blur? > > > > Especially, the following params > > > > 1. dfs.client.read.shortcircuit.streams.cache.size > > > > 2. dfs.client.read.shortcircuit.streams.cache.expiry.ms > > > > 3. dfs.client.read.shortcircuit.buffer.size > > > > > > > > On Tue, Aug 11, 2015 at 6:13 PM, Aaron McCurry <[email protected]> > wrote: > > > >> That is awesome! Let know your results when you get a chance. > >> > >> Aaron > >> > >> On Mon, Aug 10, 2015 at 9:21 AM, Ravikumar Govindarajan < > >> [email protected]> wrote: > >> > >> > Hadoop 2.7.1 is out and now handles mixed storage… A single > >> > data-node/shard-server can run HDDs & SSDs together… > >> > > >> > More about this here… > >> > > >> > > >> > > >> > http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html > >> > > >> > The policy I looked for was "SSD_One". The first-copy of index-data > >> placed > >> > on local-machine will be stored in SSD. The second & third-copies > >> stored on > >> > other machines will be in HDDs… > >> > > >> > This eliminates need for mixed setup using RACK1 & RACK2 I previously > >> > thought of. Hadoop 2.7.1 helps me to achieve this in a single cluster > of > >> > machines running data-nodes + shard-servers > >> > > >> > Every machine stores primary copy in SSDs. Writes, Searches, Merges > all > >> > take advantage of it, while replication can be relegated to slower but > >> > bigger capacity HDDs. These HDDs also serve as an online backup of > less > >> > fault-tolerant SSDs > >> > > >> > We have ported our in-house blur extension to hadoop-2.7.1. Will > update > >> on > >> > test results shortly > >> > > >> > -- > >> > Ravi > >> > > >> > On Mon, Jun 22, 2015 at 6:18 PM, Aaron McCurry <[email protected]> > >> wrote: > >> > > >> > > On Thu, Jun 18, 2015 at 8:55 AM, Ravikumar Govindarajan < > >> > > [email protected]> wrote: > >> > > > >> > > > Apologize for resurrecting this thread… > >> > > > > >> > > > One problem of lucene is OS buffer-cache pollution during segment > >> > merges, > >> > > > as documented here > >> > > > > >> > > > > >> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html > >> > > > > >> > > > This problem could occur in Blur, when short-circuit reads are > >> > enabled... > >> > > > > >> > > > >> > > True but Blur deals with this issue by not allowing (by default) the > >> > merges > >> > > to effect the Block Cache. > >> > > > >> > > > >> > > > > >> > > > My take on this… > >> > > > > >> > > > It may be possible to overcome the problem by simply re-directing > >> > > > merge-read requests to a node other than local-node instead of > fancy > >> > > stuff > >> > > > like O_DIRECT, FADVISE etc... > >> > > > > >> > > > >> > > I have always thought of having merge occur in a Mapreduce (or Yarn) > >> job > >> > > instead of locally. > >> > > > >> > > > >> > > > > >> > > > In a mixed setup, this means merge requests need to be diverted to > >> > > low-end > >> > > > Rack2 machines {running only data-nodes} while short-circuit read > >> > > requests > >> > > > will continue to be served from high-end Rack1 machines {running > >> both > >> > > > shard-server and data-nodes} > >> > > > > >> > > > Hadoop 2.x provides a cool read-API "seekToNewSource" > >> > > > API documentation says "Seek to given position on a node other > than > >> the > >> > > > current node" > >> > > > >> > > > >> > > > From blur code, it's just enough if we open a new > FSDataInputStream > >> for > >> > > > merge-reads and issue seekToNewSource call. Once merges are done, > it > >> > can > >> > > > closed & discarded… > >> > > > > >> > > > Please let know your view-points on this… > >> > > > > >> > > > >> > > We could do this, but I find that reading the TIM file types over > the > >> > wire > >> > > during a merge causes a HUGE slow down in merge performance. The > >> fastest > >> > > way to merge is to copy the TIM files involved in the merge locally > to > >> > run > >> > > the merge and then delete them after the fact. > >> > > > >> > > Aaron > >> > > > >> > > > >> > > > > >> > > > -- > >> > > > Ravi > >> > > > > >> > > > On Mon, Mar 9, 2015 at 5:45 PM, Ravikumar Govindarajan < > >> > > > [email protected]> wrote: > >> > > > > >> > > > > > >> > > > > On Sat, Mar 7, 2015 at 11:00 AM, Aaron McCurry < > >> [email protected]> > >> > > > wrote: > >> > > > > > >> > > > >> > >> > > > >> I thought the normal hdfs replica rules were once local. One > >> remote > >> > > rack > >> > > > >> once same rack. > >> > > > >> > >> > > > > > >> > > > > Yes. One copy is local & other two copies on the same remote > rack. > >> > > > > > >> > > > > How did > >> > > > >> land on your current configuration ? > >> > > > > > >> > > > > > >> > > > > When I was evaluating disk-budget, we were looking at 6 > expensive > >> > > drives > >> > > > > per machine. It lead me to think what those 6 drives would do & > >> how > >> > we > >> > > > can > >> > > > > reduce the cost. Then stumbled on this two-rack setup and now we > >> need > >> > > > only > >> > > > > 2 such drives... > >> > > > > > >> > > > > Apart from reduced disk-budget & write-overhead on cluster, it > >> also > >> > > helps > >> > > > > in greater availability as rack-failure would be recoverable... > >> > > > > > >> > > > > -- > >> > > > > Ravi > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > >
