That's great news! Harinder On 13 Apr 2016 4:08 p.m., "Aaron McCurry" <[email protected]> wrote:
> Also any chance that all or some of the mods you guys made could be made > public and possibly folded back into blur ? > > On Wednesday, April 13, 2016, Ravikumar Govindarajan < > [email protected]> wrote: > > > Finally, our production grid is live... > > > > We have migrated about 20TB of index-data from old-grid to blur (About > > 1/3rd of overall grid size). > > > > I must say the results we observe are excellent. There is actually > > zero-diff between an HDFSDirectory & a normal FileDirectory impl... > > > > We designed a shared multi-user index inspired from Blur's Rows&Records > > approach + few good ideas from NoSQL world... > > > > Made many changes to blur to support our application requirements. > Listing > > down a few of them... > > > > - ADD_RECORD operation to an existing RowId > > - Online Shard Creation > > - Online Alias Shard Creation (Freeze old-shard & send add-docs call > to > > newer ones) > > - Externalised BlurPartitioner (DB based) > > - Customised write-thru caching (Cache only important stuff) > > - Block-cache meta save & load at server start-up > > - Kafka integration with back-up data-center mirroring in real-time > > - Blur on Hadoop 2.7.x with mixed SSD/HDD storage & short-circuit > reads > > - Partial document update using tokyo-cabinet > > > > Thanks to Blur community & especially Aaron for all the help rendered & > > getting our first-cut released... > > > > -- > > Ravi > > > > On Tue, Sep 15, 2015 at 3:33 PM, Aaron McCurry <[email protected] > > <javascript:;>> wrote: > > > > > Thanks Ravi > > > > > > Didn't know that. Good to know. > > > > > > On Tuesday, September 15, 2015, Ravikumar Govindarajan < > > > [email protected] <javascript:;>> wrote: > > > > > > > > > > > > > Basically you need to turn the buffer size down. The hdfs property > > > > > is: dfs.client.read.shortcircuit.buffer.size > > > > > > > > > > > > Yes we ran into this issue. We found that SSR takes 2 paths during > > read… > > > > > > > > 1. readWithoutBounceBuffer > > > > 2. readWithBounceBuffer > > > > > > > > Only path-2, that is reading with bounce-buffers uses the > > > > direct-byte-buffers and OOMs, while path-1 reads are normal reads. > > > > > > > > To force the use of path-1, went through BlockReaderLocal source and > > > found > > > > that following conditions need to be met > > > > > > > > a. Skip Checksums > > > > b. Switch-off Read-Ahead.. > > > > > > > > Tweaking hdfs-default.xml for the following configs forces Path-1 to > be > > > > used > > > > > > > > 1. dfs.client.cache.readahead = 0 > > > > 2. dfs.bytes-per-checksum = 1 > > > > 3. dfs.checksum.type = NULL > > > > > > > > -- > > > > Ravi > > > > > > > > On Tue, Sep 15, 2015 at 7:01 AM, Aaron McCurry <[email protected] > > <javascript:;> > > > > <javascript:;>> wrote: > > > > > > > > > Good stuff! Thanks for sharing! One issue I have found with the > > short > > > > > circuit reads: > > > > > > > > > > https://issues.apache.org/jira/browse/HBASE-8143 > > > > > > > > > > Basically you need to turn the buffer size down. The hdfs property > > > > > is: dfs.client.read.shortcircuit.buffer.size > > > > > > > > > > Aaron > > > > > > > > > > On Mon, Sep 14, 2015 at 6:42 AM, Ravikumar Govindarajan < > > > > > [email protected] <javascript:;> <javascript:;>> > > wrote: > > > > > > > > > > > Finally we are done with testing with short-circuit read and > > SSD_One > > > > > > policy. Summarizing few crucial points we observed during > > query-runs > > > > > > > > > > > > 1. A single read issued by hadoop-client takes on an average > > > 0.15-0.25 > > > > > > ms for 32KB byte-size. Some-times this could be on the higher > > > side > > > > > > like 0.6-0.65 ms per read… Actual SSD latencies got from > iostat > > > was > > > > > > around 0.1ms with spikes of 0.6 ms > > > > > > > > > > > > 2. The overhead of hadoop wrapper code involved in SSD-reads is > > very > > > > > > minimal & negligible. However we tested with a single-thread. > > May > > > > be > > > > > > when multiple-threads are involved during queries, hadoop > could > > > be > > > > > > a spoiler > > > > > > > > > > > > 3. It still makes sense to retain the block-cache. Assuming a > > > bad-query > > > > > > makes about 1000 trips to hadoop. Time consumed ~= 0.15*1000 > = > > > > > > 150 ms. Block-cache could play a crucial role here. It could > > also > > > > > help > > > > > > in resolving multi-threaded accesses > > > > > > > > > > > > 4. Segment writes/merges are actually slower than HDD may be > > because > > > > > > of sequential reads… > > > > > > > > > > > > Overall, we found good gains especially for queries using > > > short-circuit > > > > > > reads when combined with block-cache. > > > > > > > > > > > > -- > > > > > > Ravi > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 12, 2015 at 6:34 PM, Ravikumar Govindarajan < > > > > > > [email protected] <javascript:;> <javascript:;>> > > wrote: > > > > > > > > > > > > > Our very basic testing with SSD_One policy works as expected. > Now > > > we > > > > > are > > > > > > > moving to test the efficiency of SSD reads via hadoop.. > > > > > > > > > > > > > > I see numerous params that need to be setup for hadoop > > > short-circuit > > > > > > reads > > > > > > > as documented here… > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.7/bk_system-admin-guide/content/ch_short-circuit-reads-hdfs.html > > > > > > > > > > > > > > For production workloads are there any standard configs for > blur? > > > > > > > > > > > > > > Especially, the following params > > > > > > > > > > > > > > 1. dfs.client.read.shortcircuit.streams.cache.size > > > > > > > > > > > > > > 2. dfs.client.read.shortcircuit.streams.cache.expiry.ms > > > > > > > > > > > > > > 3. dfs.client.read.shortcircuit.buffer.size > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 11, 2015 at 6:13 PM, Aaron McCurry < > > [email protected] <javascript:;> > > > > <javascript:;>> > > > > > > wrote: > > > > > > > > > > > > > >> That is awesome! Let know your results when you get a chance. > > > > > > >> > > > > > > >> Aaron > > > > > > >> > > > > > > >> On Mon, Aug 10, 2015 at 9:21 AM, Ravikumar Govindarajan < > > > > > > >> [email protected] <javascript:;> > > <javascript:;>> wrote: > > > > > > >> > > > > > > >> > Hadoop 2.7.1 is out and now handles mixed storage… A single > > > > > > >> > data-node/shard-server can run HDDs & SSDs together… > > > > > > >> > > > > > > > >> > More about this here… > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html > > > > > > >> > > > > > > > >> > The policy I looked for was "SSD_One". The first-copy of > > > > index-data > > > > > > >> placed > > > > > > >> > on local-machine will be stored in SSD. The second & > > > third-copies > > > > > > >> stored on > > > > > > >> > other machines will be in HDDs… > > > > > > >> > > > > > > > >> > This eliminates need for mixed setup using RACK1 & RACK2 I > > > > > previously > > > > > > >> > thought of. Hadoop 2.7.1 helps me to achieve this in a > single > > > > > cluster > > > > > > of > > > > > > >> > machines running data-nodes + shard-servers > > > > > > >> > > > > > > > >> > Every machine stores primary copy in SSDs. Writes, Searches, > > > > Merges > > > > > > all > > > > > > >> > take advantage of it, while replication can be relegated to > > > slower > > > > > but > > > > > > >> > bigger capacity HDDs. These HDDs also serve as an online > > backup > > > of > > > > > > less > > > > > > >> > fault-tolerant SSDs > > > > > > >> > > > > > > > >> > We have ported our in-house blur extension to hadoop-2.7.1. > > Will > > > > > > update > > > > > > >> on > > > > > > >> > test results shortly > > > > > > >> > > > > > > > >> > -- > > > > > > >> > Ravi > > > > > > >> > > > > > > > >> > On Mon, Jun 22, 2015 at 6:18 PM, Aaron McCurry < > > > > [email protected] <javascript:;> <javascript:;>> > > > > > > >> wrote: > > > > > > >> > > > > > > > >> > > On Thu, Jun 18, 2015 at 8:55 AM, Ravikumar Govindarajan < > > > > > > >> > > [email protected] <javascript:;> > > <javascript:;>> wrote: > > > > > > >> > > > > > > > > >> > > > Apologize for resurrecting this thread… > > > > > > >> > > > > > > > > > >> > > > One problem of lucene is OS buffer-cache pollution > during > > > > > segment > > > > > > >> > merges, > > > > > > >> > > > as documented here > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html > > > > > > >> > > > > > > > > > >> > > > This problem could occur in Blur, when short-circuit > reads > > > are > > > > > > >> > enabled... > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > True but Blur deals with this issue by not allowing (by > > > default) > > > > > the > > > > > > >> > merges > > > > > > >> > > to effect the Block Cache. > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > My take on this… > > > > > > >> > > > > > > > > > >> > > > It may be possible to overcome the problem by simply > > > > > re-directing > > > > > > >> > > > merge-read requests to a node other than local-node > > instead > > > of > > > > > > fancy > > > > > > >> > > stuff > > > > > > >> > > > like O_DIRECT, FADVISE etc... > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > I have always thought of having merge occur in a Mapreduce > > (or > > > > > Yarn) > > > > > > >> job > > > > > > >> > > instead of locally. > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > In a mixed setup, this means merge requests need to be > > > > diverted > > > > > to > > > > > > >> > > low-end > > > > > > >> > > > Rack2 machines {running only data-nodes} while > > short-circuit > > > > > read > > > > > > >> > > requests > > > > > > >> > > > will continue to be served from high-end Rack1 machines > > > > {running > > > > > > >> both > > > > > > >> > > > shard-server and data-nodes} > > > > > > >> > > > > > > > > > >> > > > Hadoop 2.x provides a cool read-API "seekToNewSource" > > > > > > >> > > > API documentation says "Seek to given position on a node > > > other > > > > > > than > > > > > > >> the > > > > > > >> > > > current node" > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > From blur code, it's just enough if we open a new > > > > > > FSDataInputStream > > > > > > >> for > > > > > > >> > > > merge-reads and issue seekToNewSource call. Once merges > > are > > > > > done, > > > > > > it > > > > > > >> > can > > > > > > >> > > > closed & discarded… > > > > > > >> > > > > > > > > > >> > > > Please let know your view-points on this… > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > We could do this, but I find that reading the TIM file > types > > > > over > > > > > > the > > > > > > >> > wire > > > > > > >> > > during a merge causes a HUGE slow down in merge > performance. > > > > The > > > > > > >> fastest > > > > > > >> > > way to merge is to copy the TIM files involved in the > merge > > > > > locally > > > > > > to > > > > > > >> > run > > > > > > >> > > the merge and then delete them after the fact. > > > > > > >> > > > > > > > > >> > > Aaron > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > -- > > > > > > >> > > > Ravi > > > > > > >> > > > > > > > > > >> > > > On Mon, Mar 9, 2015 at 5:45 PM, Ravikumar Govindarajan < > > > > > > >> > > > [email protected] <javascript:;> > > <javascript:;>> wrote: > > > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > On Sat, Mar 7, 2015 at 11:00 AM, Aaron McCurry < > > > > > > >> [email protected] <javascript:;> <javascript:;>> > > > > > > >> > > > wrote: > > > > > > >> > > > > > > > > > > >> > > > >> > > > > > > >> > > > >> I thought the normal hdfs replica rules were once > > local. > > > > One > > > > > > >> remote > > > > > > >> > > rack > > > > > > >> > > > >> once same rack. > > > > > > >> > > > >> > > > > > > >> > > > > > > > > > > >> > > > > Yes. One copy is local & other two copies on the same > > > remote > > > > > > rack. > > > > > > >> > > > > > > > > > > >> > > > > How did > > > > > > >> > > > >> land on your current configuration ? > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > When I was evaluating disk-budget, we were looking at > 6 > > > > > > expensive > > > > > > >> > > drives > > > > > > >> > > > > per machine. It lead me to think what those 6 drives > > would > > > > do > > > > > & > > > > > > >> how > > > > > > >> > we > > > > > > >> > > > can > > > > > > >> > > > > reduce the cost. Then stumbled on this two-rack setup > > and > > > > now > > > > > we > > > > > > >> need > > > > > > >> > > > only > > > > > > >> > > > > 2 such drives... > > > > > > >> > > > > > > > > > > >> > > > > Apart from reduced disk-budget & write-overhead on > > > cluster, > > > > it > > > > > > >> also > > > > > > >> > > helps > > > > > > >> > > > > in greater availability as rack-failure would be > > > > > recoverable... > > > > > > >> > > > > > > > > > > >> > > > > -- > > > > > > >> > > > > Ravi > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
