Re: Index Warmup in Blur

Ravikumar Govindarajan Thu, 10 Oct 2013 03:48:35 -0700

I saw this JIRA on humungous rows and got quite confused on the UPDATE_ROW
operation.


https://issues.apache.org/jira/browse/BLUR-220

Lets say I add 2 records to a row, whose existing records number in
hundreds-of-thousands.

Will Blur attempt to first read all these records before adding the
incoming 2 records?

What, if we just expose simple record-add/delete on a row, without fetching
the row at all?

It should be quite quick and highly useful, at least for apps already using
lucene.

--
Ravi


On Wed, Oct 9, 2013 at 11:27 AM, Ravikumar Govindarajan <
[email protected]> wrote:

> Yes, I think bringing in a mutable file in lucene-index brings it's own
> set of problems to handle. Filters, Caches, Scoring, Snapshots/Commits
> etc... will all be affected.
>
> There is on JIRA on writing generation of updatable files, just like
> doc-deletes instead of over-writing a single file.[
> https://issues.apache.org/jira/browse/LUCENE-4258]. But that is still
> in-progress and from what I understand, it could slow searches considerably.
>
> BTW, is it possible to extend BlurPartitioner and load it during start-up?
>
> Also, it would be awesome if Blur supports a per-row auto-complete feature.
>
> --
> Ravi
>
>
> On Sat, Oct 5, 2013 at 2:01 AM, Aaron McCurry <[email protected]> wrote:
>
>> I have thought of one possible problem with this approach.  To date the
>> mindset I have used in all of the Blur internals is that segments are
>> immutable.  This is a fundamental principle that Blur uses and I don't
>> really have any ideas on where to behind checking for when this is a
>> problem.  I know filters are going to be an issue, not sure where else.
>>
>> Not saying that it can't be done, it's just not going to be as clean as I
>> originally thought.
>>
>> Aaron
>>
>>
>> On Fri, Oct 4, 2013 at 4:26 PM, Aaron McCurry <[email protected]> wrote:
>>
>> >
>> >
>> > On Fri, Oct 4, 2013 at 7:15 AM, Ravikumar Govindarajan <
>> > [email protected]> wrote:
>> >
>> >> On a related note, do you think such an approach will fit in Blur
>> >>
>> >> 1. Store the BDB file in shard-server itself.
>> >>
>> >
>> > Probably not, this would pin the BDB (or whatever the solution would be)
>> > to a specific server.  We will have to sync to HDFS.
>> >
>> >
>> >>
>> >> 2. Apply all incoming partial doc-updates to local BDB file as well as
>> an
>> >>     update-transaction log
>> >>
>> >
>> > Blur already has a write ahead log as apart of internals.  It's written
>> > and synced to HDFS.
>> >
>> >
>> >>
>> >> 3. Periodically sync dirty BDB files to HDFS and roll-over the update-
>> >>  transaction log.
>> >
>> >
>> >> Whenever a shard-server goes down, the take-over server can initially
>> sync
>> >> the BDB file from HDFS to local, replay the update-transaction log and
>> >> then
>> >> start serving data
>> >>
>> >
>> > Blur already does this internally, it records the mutates and replays
>> them
>> > if a failure happens before a commit.
>> >
>> > Aaron
>> >
>> >
>> >>
>> >> --
>> >> Ravi
>> >>
>> >>
>> >> On Thu, Oct 3, 2013 at 11:14 PM, Ravikumar Govindarajan <
>> >> [email protected]> wrote:
>> >>
>> >> > The mutate APIs are a good fit for individual cols update. BlurCodec
>> >> will
>> >> > be cool and solve a lot of problems.
>> >> >
>> >> > There are 3 caveats for such a codec
>> >> >
>> >> > 1. Scores for affected queries will be wrong, until segment-merge
>> >> >
>> >> > 2. Responsibility of ordering updates must be on the client.
>> >> >
>> >> > 3. Repeated updates for the same document can either take a
>> generational
>> >> > approach [Lucene-4258] or use a single version of storage [Redis/TC
>> >> etc..],
>> >> > pushing the onus to client, depending on how the Codec shapes up.
>> >> >
>> >> > The former will be semantically correct but really sluggish while the
>> >> > latter will be faster during search
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Oct 3, 2013 at 8:53 PM, Aaron McCurry <[email protected]>
>> >> wrote:
>> >> >
>> >> >> On Thu, Oct 3, 2013 at 11:08 AM, Ravikumar Govindarajan <
>> >> >> [email protected]> wrote:
>> >> >>
>> >> >> > Yeah, you are correct. A BDB file might probably never be ported
>> to
>> >> >> HDFS.
>> >> >> >
>> >> >> > Our daily update frequency comes to about 20% of insertion rate.
>> >> >> >
>> >> >> > Lets say "UPDATE <TABLE> SET COL2=1 WHERE COL1=X".
>> >> >> >
>> >> >> > This update could potentially span across tens of thousands of SQL
>> >> rows
>> >> >> in
>> >> >> > our case, where COL2 is just a boolean flip.
>> >> >> >
>> >> >> > The problem is not with lucene's ability to handle load. Instead
>> it
>> >> is
>> >> >> with
>> >> >> > the consistent load it puts on our content servers to read and
>> >> >> re-tokenize
>> >> >> > such huge rows just for a boolean flip. Another big winner is that
>> >> all
>> >> >> our
>> >> >> > updatable fields are not involved in scoring at all. Just matching
>> >> will
>> >> >> do.
>> >> >> >
>> >> >> > The changes also sit in BDB only till the next segment merge,
>> after
>> >> >> which
>> >> >> > it is cleaned out. There is very little perf hit here for us, as
>> >> users
>> >> >> > don't immediately search after a change.
>> >> >> >
>> >> >> > I am afraid there is no documentation/code/numbers on this
>> currently
>> >> in
>> >> >> > public, as it is still proprietary but is remarkably similar to
>> the
>> >> >> popular
>> >> >> > to RedisCodec.
>> >> >> >
>> >> >> > "If you really need partial document updates, there would need to
>> be
>> >> >> > changes
>> >> >> > throughout the entire stack"
>> >> >> >
>> >> >> > You mean, the entire stack of Blur? In case this is possible, can
>> you
>> >> >> give
>> >> >> > me 10000-ft overview of what you have in mind?
>> >> >> >
>> >> >>
>> >> >> Interesting, now that I think about it.  The situation that you
>> >> describe
>> >> >> is
>> >> >> very interesting, I'm wondering if we came up with something like
>> this
>> >> in
>> >> >> Blur that it would fix our large Row issue.  Or at the very least
>> help
>> >> the
>> >> >> problem.
>> >> >>
>> >> >> https://issues.apache.org/jira/browse/BLUR-220
>> >> >>
>> >> >> Plus the more I think about it, the mutate methods are probably the
>> >> right
>> >> >> implementation for modifying single columns.  So the API of Blur
>> >> probably
>> >> >> wouldn't need to be changed.  Maybe just the way it goes about
>> dealing
>> >> >> with
>> >> >> changes.  I thinking maybe we need our own BlurCodec to handle large
>> >> Rows
>> >> >> as well as Record (Document) updates.
>> >> >>
>> >> >> As an aside I constantly am having to refer to Records as Documents,
>> >> this
>> >> >> is why I think we need a rename.
>> >> >>
>> >> >> Aaron
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > --
>> >> >> > Ravi
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Oct 3, 2013 at 5:36 PM, Aaron McCurry <[email protected]
>> >
>> >> >> wrote:
>> >> >> >
>> >> >> > > The biggest issue with this is that the shards (the indexes)
>> >> inside of
>> >> >> > Blur
>> >> >> > > actually move from one server to another.  So to support this
>> >> behavior
>> >> >> > all
>> >> >> > > the indexes are stored in HDFS.  Do due the differences between
>> >> HDFS
>> >> >> and
>> >> >> > > the a normal POSIX file system, I highly doubt that the BDB file
>> >> form
>> >> >> in
>> >> >> > > TokyoCabinet can ever be supported.
>> >> >> > >
>> >> >> > > If you really need partial document updates, there would need
>> to be
>> >> >> > changes
>> >> >> > > throughout the entire stack.  I am curious why you need this
>> >> feature?
>> >> >>  Do
>> >> >> > > you have that many updates to the index?  What is the update
>> >> >> frequency?
>> >> >> > >  I'm just curious of what kind of performance you get out of a
>> >> setup
>> >> >> like
>> >> >> > > that?  Since I haven't ever run such a setup I have no idea how
>> to
>> >> >> > compare
>> >> >> > > that kind of system to a base Lucene setup.
>> >> >> > >
>> >> >> > > Could you point be to some code or documentation?  I would to go
>> >> and
>> >> >> > take a
>> >> >> > > look.
>> >> >> > >
>> >> >> > > Thanks,
>> >> >> > > Aaron
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > On Thu, Oct 3, 2013 at 7:00 AM, Ravikumar Govindarajan <
>> >> >> > > [email protected]> wrote:
>> >> >> > >
>> >> >> > > > One more help.
>> >> >> > > >
>> >> >> > > > We also maintain a file by name "BDB", just like the "Sample"
>> >> file
>> >> >> for
>> >> >> > > > tracing used by Blur.
>> >> >> > > >
>> >> >> > > > This "BDB" file pertains to TokyoCabinet and is used purely
>> for
>> >> >> > > supporting
>> >> >> > > > partial updates to a document.
>> >> >> > > > All operations on this file rely on local file-paths only,
>> >> through
>> >> >> the
>> >> >> > > use
>> >> >> > > > of native code.
>> >> >> > > > Currently, all update requests are local to the index files
>> and
>> >> it
>> >> >> > > becomes
>> >> >> > > > trivial to support.
>> >> >> > > >
>> >> >> > > > Any pointers on how to take this forward in Blur set-up of
>> >> >> > shard-servers
>> >> >> > > &
>> >> >> > > > controllers?
>> >> >> > > >
>> >> >> > > > --
>> >> >> > > > Ravi
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > On Tue, Oct 1, 2013 at 10:15 PM, Aaron McCurry <
>> >> [email protected]>
>> >> >> > > wrote:
>> >> >> > > >
>> >> >> > > > > You can control the fields to warmup via:
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_TableDescriptor
>> >> >> > > > >
>> >> >> > > > > The preCacheCols field.  The comment is wrong however, so I
>> >> will
>> >> >> > > create a
>> >> >> > > > > task to correct.  The use of the field is: "family.column"
>> just
>> >> >> like
>> >> >> > > you
>> >> >> > > > > would search.
>> >> >> > > > >
>> >> >> > > > > Aaron
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > On Tue, Oct 1, 2013 at 12:41 PM, Ravikumar Govindarajan <
>> >> >> > > > > [email protected]> wrote:
>> >> >> > > > >
>> >> >> > > > > > Thanks Aaron
>> >> >> > > > > >
>> >> >> > > > > > General sampling and warming is fine and the code is
>> really
>> >> >> concise
>> >> >> > > and
>> >> >> > > > > > clear.
>> >> >> > > > > >
>> >> >> > > > > >  The act of reading
>> >> >> > > > > > brings the data into the block cache and the result is
>> that
>> >> the
>> >> >> > index
>> >> >> > > > is
>> >> >> > > > > > "hot".
>> >> >> > > > > >
>> >> >> > > > > > Will all the terms of a field be read and brought into the
>> >> >> cache?
>> >> >> > If
>> >> >> > > > so,
>> >> >> > > > > > then it has an obvious implication to avoid fields like,
>> say
>> >> >> > > > > > attachment-data from warming up, provided queries don't
>> often
>> >> >> > include
>> >> >> > > > > such
>> >> >> > > > > > fields
>> >> >> > > > > >
>> >> >> > > > > >
>> >> >> > > > > > On Tue, Oct 1, 2013 at 7:58 PM, Aaron McCurry <
>> >> >> [email protected]>
>> >> >> > > > > wrote:
>> >> >> > > > > >
>> >> >> > > > > > > Take a look at this package.
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=tree;f=blur-store/src/main/java/org/apache/blur/lucene/warmup;h=f4239b1947965dc7fe8218eaa16e3f39ecffdda0;hb=apache-blur-0.2
>> >> >> > > > > > >
>> >> >> > > > > > > Basically when the warmup process starts (which is
>> >> >> asynchronous
>> >> >> > to
>> >> >> > > > the
>> >> >> > > > > > rest
>> >> >> > > > > > > of the application) it flips a thread local switch to
>> allow
>> >> >> for
>> >> >> > > > tracing
>> >> >> > > > > > of
>> >> >> > > > > > > the file accesses.  The sampler will sample each of the
>> >> >> fields in
>> >> >> > > > each
>> >> >> > > > > > > segment and create a sample file that attempts to detect
>> >> the
>> >> >> > > > boundaries
>> >> >> > > > > > of
>> >> >> > > > > > > each field within each file within each segment.  Then
>> it
>> >> >> stores
>> >> >> > > the
>> >> >> > > > > > sample
>> >> >> > > > > > > info into the directory beside each segment (so that
>> way it
>> >> >> > doesn't
>> >> >> > > > > have
>> >> >> > > > > > to
>> >> >> > > > > > > re-sample the segment).  After the sampling is complete
>> or
>> >> >> > loaded,
>> >> >> > > > the
>> >> >> > > > > > > warmup just reads the binary data from each file.  The
>> act
>> >> of
>> >> >> > > reading
>> >> >> > > > > > > brings the data into the block cache and the result is
>> that
>> >> >> the
>> >> >> > > index
>> >> >> > > > > is
>> >> >> > > > > > > "hot".
>> >> >> > > > > > >
>> >> >> > > > > > > Hope this helps.
>> >> >> > > > > > >
>> >> >> > > > > > > Aaron
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > > On Tue, Oct 1, 2013 at 10:09 AM, Ravikumar Govindarajan
>> <
>> >> >> > > > > > > [email protected]> wrote:
>> >> >> > > > > > >
>> >> >> > > > > > > > As I understand,
>> >> >> > > > > > > >
>> >> >> > > > > > > > Lucene will store the files in following way
>> per-segment
>> >> >> > > > > > > >
>> >> >> > > > > > > > TIM file
>> >> >> > > > > > > >      Field1 ---> Some byte[]
>> >> >> > > > > > > >      Field2 ---> Some byte[]
>> >> >> > > > > > > >
>> >> >> > > > > > > > TIP file
>> >> >> > > > > > > >      Field1 ---> Some byte[]
>> >> >> > > > > > > >      Field2 ---> Some byte[]
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Blur will "sample" this lucene-file in the following
>> way
>> >> >> > > > > > > >
>> >> >> > > > > > > > Field1 --> <TIM, start-offset>, <TIP, start-offset>,
>> ...
>> >> >> > > > > > > >
>> >> >> > > > > > > > Field 2 --> <TIM, start-offset>, <TIP, start-offset>,
>> ...
>> >> >> > > > > > > >
>> >> >> > > > > > > > Is my understanding correct?
>> >> >> > > > > > > >
>> >> >> > > > > > > > How does Blur warm-up the fields, when it does not
>> know
>> >> the
>> >> >> > > > > > "end-offset"
>> >> >> > > > > > > or
>> >> >> > > > > > > > the "length" for each field to warm.
>> >> >> > > > > > > >
>> >> >> > > > > > > > Will it by default read all Terms of a field?
>> >> >> > > > > > > >
>> >> >> > > > > > > > --
>> >> >> > > > > > > > Ravi
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >>
>> >
>> >
>>
>
>

Re: Index Warmup in Blur

Reply via email to