Re: General guidance on blur-shard server

Aaron McCurry Wed, 13 Apr 2016 03:39:08 -0700

Also any chance that all or some of the mods you guys made could be made
public and possibly folded back into blur ?


On Wednesday, April 13, 2016, Ravikumar Govindarajan <
[email protected]> wrote:

> Finally, our production grid is live...
>
> We have migrated about 20TB of index-data from old-grid to blur (About
> 1/3rd of overall grid size).
>
> I must say the results we observe are excellent. There is actually
> zero-diff between an HDFSDirectory & a normal FileDirectory impl...
>
> We designed a shared multi-user index inspired from Blur's Rows&Records
> approach + few good ideas from NoSQL world...
>
> Made many changes to blur to support our application requirements. Listing
> down a few of them...
>
>    - ADD_RECORD operation to an existing RowId
>    - Online Shard Creation
>    - Online Alias Shard Creation (Freeze old-shard & send add-docs call to
>    newer ones)
>    - Externalised BlurPartitioner (DB based)
>    - Customised write-thru caching (Cache only important stuff)
>    - Block-cache meta save & load at server start-up
>    - Kafka integration with back-up data-center mirroring in real-time
>    - Blur on Hadoop 2.7.x with mixed SSD/HDD storage & short-circuit reads
>    - Partial document update using tokyo-cabinet
>
> Thanks to Blur community & especially Aaron for all the help rendered &
> getting our first-cut released...
>
> --
> Ravi
>
> On Tue, Sep 15, 2015 at 3:33 PM, Aaron McCurry <[email protected]
> <javascript:;>> wrote:
>
> > Thanks Ravi
> >
> > Didn't know that. Good to know.
> >
> > On Tuesday, September 15, 2015, Ravikumar Govindarajan <
> > [email protected] <javascript:;>> wrote:
> >
> > > >
> > > > Basically you need to turn the buffer size down.  The hdfs property
> > > > is: dfs.client.read.shortcircuit.buffer.size
> > >
> > >
> > > Yes we ran into this issue. We found that SSR takes 2 paths during
> read…
> > >
> > > 1. readWithoutBounceBuffer
> > > 2. readWithBounceBuffer
> > >
> > > Only path-2, that is reading with bounce-buffers uses the
> > > direct-byte-buffers and OOMs, while path-1 reads are normal reads.
> > >
> > > To force the use of path-1, went through BlockReaderLocal source and
> > found
> > > that following conditions need to be met
> > >
> > > a. Skip Checksums
> > > b. Switch-off Read-Ahead..
> > >
> > > Tweaking hdfs-default.xml for the following configs forces Path-1 to be
> > > used
> > >
> > > 1. dfs.client.cache.readahead = 0
> > > 2. dfs.bytes-per-checksum = 1
> > > 3. dfs.checksum.type = NULL
> > >
> > > --
> > > Ravi
> > >
> > > On Tue, Sep 15, 2015 at 7:01 AM, Aaron McCurry <[email protected]
> <javascript:;>
> > > <javascript:;>> wrote:
> > >
> > > > Good stuff!  Thanks for sharing!  One issue I have found with the
> short
> > > > circuit reads:
> > > >
> > > > https://issues.apache.org/jira/browse/HBASE-8143
> > > >
> > > > Basically you need to turn the buffer size down.  The hdfs property
> > > > is: dfs.client.read.shortcircuit.buffer.size
> > > >
> > > > Aaron
> > > >
> > > > On Mon, Sep 14, 2015 at 6:42 AM, Ravikumar Govindarajan <
> > > > [email protected] <javascript:;> <javascript:;>>
> wrote:
> > > >
> > > > > Finally we are done with testing with short-circuit read and
> SSD_One
> > > > > policy. Summarizing few crucial points we observed during
> query-runs
> > > > >
> > > > > 1. A single read issued by hadoop-client takes on an average
> > 0.15-0.25
> > > > >     ms for 32KB byte-size. Some-times this could be on the higher
> > side
> > > > >     like 0.6-0.65 ms per read… Actual SSD latencies got from iostat
> > was
> > > > >     around 0.1ms with spikes of 0.6 ms
> > > > >
> > > > > 2. The overhead of hadoop wrapper code involved in SSD-reads is
> very
> > > > >     minimal & negligible. However we tested with a single-thread.
> May
> > > be
> > > > >     when multiple-threads are involved during queries, hadoop could
> > be
> > > > >     a spoiler
> > > > >
> > > > > 3. It still makes sense to retain the block-cache. Assuming a
> > bad-query
> > > > >     makes about 1000 trips to hadoop. Time consumed ~= 0.15*1000 =
> > > > >     150 ms. Block-cache could play a crucial role here. It could
> also
> > > > help
> > > > >     in resolving multi-threaded accesses
> > > > >
> > > > > 4. Segment writes/merges are actually slower than HDD may be
> because
> > > > >     of sequential reads…
> > > > >
> > > > > Overall, we found good gains especially for queries using
> > short-circuit
> > > > > reads when combined with block-cache.
> > > > >
> > > > > --
> > > > > Ravi
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 12, 2015 at 6:34 PM, Ravikumar Govindarajan <
> > > > > [email protected] <javascript:;> <javascript:;>>
> wrote:
> > > > >
> > > > > > Our very basic testing with SSD_One policy works as expected. Now
> > we
> > > > are
> > > > > > moving to test the efficiency of SSD reads via hadoop..
> > > > > >
> > > > > > I see numerous params that need to be setup for hadoop
> > short-circuit
> > > > > reads
> > > > > > as documented here…
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.7/bk_system-admin-guide/content/ch_short-circuit-reads-hdfs.html
> > > > > >
> > > > > > For production workloads are there any standard configs for blur?
> > > > > >
> > > > > > Especially, the following params
> > > > > >
> > > > > > 1. dfs.client.read.shortcircuit.streams.cache.size
> > > > > >
> > > > > > 2. dfs.client.read.shortcircuit.streams.cache.expiry.ms
> > > > > >
> > > > > > 3. dfs.client.read.shortcircuit.buffer.size
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 11, 2015 at 6:13 PM, Aaron McCurry <
> [email protected] <javascript:;>
> > > <javascript:;>>
> > > > > wrote:
> > > > > >
> > > > > >> That is awesome!  Let know your results when you get a chance.
> > > > > >>
> > > > > >> Aaron
> > > > > >>
> > > > > >> On Mon, Aug 10, 2015 at 9:21 AM, Ravikumar Govindarajan <
> > > > > >> [email protected] <javascript:;>
> <javascript:;>> wrote:
> > > > > >>
> > > > > >> > Hadoop 2.7.1 is out and now handles mixed storage… A single
> > > > > >> > data-node/shard-server can run HDDs & SSDs together…
> > > > > >> >
> > > > > >> > More about this here…
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
> > > > > >> >
> > > > > >> > The policy I looked for was "SSD_One". The first-copy of
> > > index-data
> > > > > >> placed
> > > > > >> > on local-machine will be stored in SSD. The second &
> > third-copies
> > > > > >> stored on
> > > > > >> > other machines will be in HDDs…
> > > > > >> >
> > > > > >> > This eliminates need for mixed setup using RACK1 & RACK2 I
> > > > previously
> > > > > >> > thought of. Hadoop 2.7.1 helps me to achieve this in a single
> > > > cluster
> > > > > of
> > > > > >> > machines running data-nodes + shard-servers
> > > > > >> >
> > > > > >> > Every machine stores primary copy in SSDs. Writes, Searches,
> > > Merges
> > > > > all
> > > > > >> > take advantage of it, while replication can be relegated to
> > slower
> > > > but
> > > > > >> > bigger capacity HDDs. These HDDs also serve as an online
> backup
> > of
> > > > > less
> > > > > >> > fault-tolerant SSDs
> > > > > >> >
> > > > > >> > We have ported our in-house blur extension to hadoop-2.7.1.
> Will
> > > > > update
> > > > > >> on
> > > > > >> > test results shortly
> > > > > >> >
> > > > > >> > --
> > > > > >> > Ravi
> > > > > >> >
> > > > > >> > On Mon, Jun 22, 2015 at 6:18 PM, Aaron McCurry <
> > > [email protected] <javascript:;> <javascript:;>>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > On Thu, Jun 18, 2015 at 8:55 AM, Ravikumar Govindarajan <
> > > > > >> > > [email protected] <javascript:;>
> <javascript:;>> wrote:
> > > > > >> > >
> > > > > >> > > > Apologize for resurrecting this thread…
> > > > > >> > > >
> > > > > >> > > > One problem of lucene is OS buffer-cache pollution during
> > > > segment
> > > > > >> > merges,
> > > > > >> > > > as documented here
> > > > > >> > > >
> > > > > >> > > >
> > > > > >>
> > > http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html
> > > > > >> > > >
> > > > > >> > > > This problem could occur in Blur, when short-circuit reads
> > are
> > > > > >> > enabled...
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > True but Blur deals with this issue by not allowing (by
> > default)
> > > > the
> > > > > >> > merges
> > > > > >> > > to effect the Block Cache.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > My take on this…
> > > > > >> > > >
> > > > > >> > > > It may be possible to overcome the problem by simply
> > > > re-directing
> > > > > >> > > > merge-read requests to a node other than local-node
> instead
> > of
> > > > > fancy
> > > > > >> > > stuff
> > > > > >> > > > like O_DIRECT, FADVISE etc...
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > I have always thought of having merge occur in a Mapreduce
> (or
> > > > Yarn)
> > > > > >> job
> > > > > >> > > instead of locally.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > In a mixed setup, this means merge requests need to be
> > > diverted
> > > > to
> > > > > >> > > low-end
> > > > > >> > > > Rack2 machines {running only data-nodes} while
> short-circuit
> > > > read
> > > > > >> > > requests
> > > > > >> > > > will continue to be served from high-end Rack1 machines
> > > {running
> > > > > >> both
> > > > > >> > > > shard-server and data-nodes}
> > > > > >> > > >
> > > > > >> > > > Hadoop 2.x provides a cool read-API "seekToNewSource"
> > > > > >> > > > API documentation says "Seek to given position on a node
> > other
> > > > > than
> > > > > >> the
> > > > > >> > > > current node"
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > > From blur code, it's just enough if we open a new
> > > > > FSDataInputStream
> > > > > >> for
> > > > > >> > > > merge-reads and issue seekToNewSource call. Once merges
> are
> > > > done,
> > > > > it
> > > > > >> > can
> > > > > >> > > > closed & discarded…
> > > > > >> > > >
> > > > > >> > > > Please let know your view-points on this…
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > We could do this, but I find that reading the TIM file types
> > > over
> > > > > the
> > > > > >> > wire
> > > > > >> > > during a merge causes a HUGE slow down in merge performance.
> > > The
> > > > > >> fastest
> > > > > >> > > way to merge is to copy the TIM files involved in the merge
> > > > locally
> > > > > to
> > > > > >> > run
> > > > > >> > > the merge and then delete them after the fact.
> > > > > >> > >
> > > > > >> > > Aaron
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > --
> > > > > >> > > > Ravi
> > > > > >> > > >
> > > > > >> > > > On Mon, Mar 9, 2015 at 5:45 PM, Ravikumar Govindarajan <
> > > > > >> > > > [email protected] <javascript:;>
> <javascript:;>> wrote:
> > > > > >> > > >
> > > > > >> > > > >
> > > > > >> > > > > On Sat, Mar 7, 2015 at 11:00 AM, Aaron McCurry <
> > > > > >> [email protected] <javascript:;> <javascript:;>>
> > > > > >> > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > >>
> > > > > >> > > > >> I thought the normal hdfs replica rules were once
> local.
> > > One
> > > > > >> remote
> > > > > >> > > rack
> > > > > >> > > > >> once same rack.
> > > > > >> > > > >>
> > > > > >> > > > >
> > > > > >> > > > > Yes. One copy is local & other two copies on the same
> > remote
> > > > > rack.
> > > > > >> > > > >
> > > > > >> > > > > How did
> > > > > >> > > > >> land on your current configuration ?
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > When I was evaluating disk-budget, we were looking at 6
> > > > > expensive
> > > > > >> > > drives
> > > > > >> > > > > per machine. It lead me to think what those 6 drives
> would
> > > do
> > > > &
> > > > > >> how
> > > > > >> > we
> > > > > >> > > > can
> > > > > >> > > > > reduce the cost. Then stumbled on this two-rack setup
> and
> > > now
> > > > we
> > > > > >> need
> > > > > >> > > > only
> > > > > >> > > > > 2 such drives...
> > > > > >> > > > >
> > > > > >> > > > > Apart from reduced disk-budget & write-overhead on
> > cluster,
> > > it
> > > > > >> also
> > > > > >> > > helps
> > > > > >> > > > > in greater availability as rack-failure would be
> > > > recoverable...
> > > > > >> > > > >
> > > > > >> > > > > --
> > > > > >> > > > > Ravi
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: General guidance on blur-shard server

Reply via email to