Re: Is there a way to completely drop incoming documents from indexing based on some criteria?

Nikolas Everett Sat, 13 Dec 2014 20:40:38 -0800

We solve problems like this in two ways:
Adding queueing or concurrent request limits.


Queueing buys retries for free and can absorb temporary shocks. You can
also get things like priority, backlog monitoring, and manual backlog
grooming. I think logstash already supports this, but I don't know it very
well.

Concurrent request limits are more brutal. You just throw away requests to
index if there are too many in flight. You can make it more granular by
giving each incoming application its own pool and limits. We implement
these using a simple server called poolcounter. You can find it by
searching for WMF poolcounterd.

Either way you would have to implement a small application to get these
integrated. Well, maybe someone has already made the queueing one, I don't
know.

Nik
On Dec 13, 2014 11:21 PM, "Konstantin Erman" <kon...@gmail.com> wrote:

> I don't crawl the web, just collect rather verbose logs from multiple
> private cloud services and try to keep the size of ES cluster just
> sufficient for comfortable searching those logs. Monitored services are
> under development and occasionally (because of bugs or specifics of the
> source data) they start to send orders of magnitude higher than usual
> torrent of logs. When this happens, very soon ES cluster become
> non-responsive and drops logs from all services, bad behaving or not.
>
> We cannot afford to keep the cluster of the size capable to handle those
> peak loads (and idling most of the time). We rather need some kind of
> Denial of Service attack prevention logic. When some client(s) goes over
> its quota of logs it should be blocked, rather than melting cluster down.
>
> River plugin looks like overkill to me, especially considering deprecation
> of rivers.
>
> On Saturday, December 13, 2014 7:33:05 PM UTC-8, BillyEm wrote:
>>
>> Why are you putting business logic of this type in ES? It belongs in your
>> workflow. At the ES indexer level you will have no idea of the source of
>> truth of the questionable content. Unless you're web crawliing which means
>> you're using the wrong search platform altogether imo.
>>
>> On Friday, December 12, 2014 5:11:05 PM UTC-5, Konstantin Erman wrote:
>>>
>>> I noticed that occasionally I need to shield my ES cluster from some
>>> documents, which are too many or too big or otherwise poison ES.
>>> Usually I can formulate pretty easy query or criteria to detect those
>>> documents and I'm looking for a way to block them from entering the index.
>>>
>>> Is there such pre-indexing filtering mechanism? May be Transforms can be
>>> used for that purpose?
>>>
>>> Thank you!
>>> Konstantin
>>>
>>>
>>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/26556df6-a2a5-495f-bb23-95b5bd0fa63b%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/26556df6-a2a5-495f-bb23-95b5bd0fa63b%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3GuaF%3D9xNyBCtGOpyYZgWYHZKL2i1wR-LdfceV7BV0Og%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Is there a way to completely drop incoming documents from indexing based on some criteria?

Reply via email to