Re: Realtime & distributed

John Wang Thu, 08 Oct 2009 22:26:46 -0700

Jason:
        I would really appreciate it if you would stop making false
statements and misinformation. Everyone is entitled to his/her opinions on
technologies, but deliberately making misleading and false information on
such a distribution is just unethical, and you'll end up just discrediting
yourself.


        Making unsubstantiated comments while not willing to put in any
effort is the primary reason you are no longer working at Linkedin and on
Zoie.

"The problem
with this is, merging in the background becomes really tricky
unless it's performed inside of IndexWriter" - *what does this really mean?
Merging happens regardless in an incremental indexing system. Especially
with high indexing load, segments are created often, merging is crucial.*
"There is the Zoie system which uses the RAMDir
solution, however it's implemented using a customized deleted
doc set based on a bloomfilter backed by an inefficient RB tree
which slows down queries"  -* if you ever spend the time to read the code,
(even when you were working on it), it is just not true. We did have an RB
set for deleted docs, quite a few releases ago, and we changed to a special
type of bloomfilter set backed by a hash int set. You knew this and was part
of the discussion on it, and now saying such a thing is just plain
disappointing.*

        Thanks Jake for the clarification, and Eric, let me know if you to
know more in detail with how we are dealing with realtime indexing/search
with Zoie here at linkedin in a production environment powering a real
internet company with real traffic.

-John

On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <[email protected]
> wrote:

> Eric,
>
> Katta doesn't require HDFS which would be slow to search on,
> though Katta can be used to copy indexes out of HDFS onto local
> servers. The best bet is hardware that uses SSDs because merges
> and update latency will greatly decrease and there won't be a
> synchronous IO issue as there is with hard drives. Also, IO
> caches get flushed as large merges occur, which means subsequent
> queries may hit the HD and slow down. With SSDs this is much
> less of an issue.
>
> Today near realtime search (with or without SSDs) comes at a
> price, that is reduced indexing speed due to continued in RAM
> merging. People typically hack something together where indexes
> are held in a RAMDir until being flushed to disk. The problem
> with this is, merging in the background becomes really tricky
> unless it's performed inside of IndexWriter (see LUCENE-1313 and
> IW.getReader). There is the Zoie system which uses the RAMDir
> solution, however it's implemented using a customized deleted
> doc set based on a bloomfilter backed by an inefficient RB tree
> which slows down queries. There's always a trade off when trying
> to build an NRT system, currently.
>
> Also, there isn't a clear way to replicate segments in realtime
> so people usually end up analyzing documents on each replicated
> node, which is redundant. A long term solution here could be a
> distributed transaction log where encoded segments are stored
> and replicated to N nodes.
>
> Deletes can pile up in segments so the
> BalancedSegmentMergePolicy could be used to remove those faster
> than LogMergePolicy, however I haven't tested it, and it may be
> trying to not do large segment merges altogether which IMO
> is less than ideal because query performance soon degrades
> (similar to an unoptimized index).
>
> Hopefully in the future we can offer searching over
> IndexWriter's RAM buffer where indexing and search speed would
> be roughly what it is today. That combined with a way to insure
> segments don't get flushed out of the IO cache during large
> segment merges would mean really efficient NRT, even on systems
> with HDs. In the interim, you'd need to play around and see what
> works for your requirements.
>
> -J
>
> On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <[email protected]> wrote:
> >
> > Does anyone have any recommendations?  I've looked at Katta, but it
> doesn't
> > seem to support realtime searching.  It also uses hdfs, which I've heard
> can
> > be slow.  I'm looking to serve 40gb of indexes and support about 1
> million
> > updates per day.
> >
> > Thx
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Realtime & distributed

Reply via email to