date:20150420

You don't have to set replicas to 3. It depends on the number of shards you 
have for your index.
If you are using default (5), then you probably have today something like:

Node 1 : 4 shards
Node 2 : 3 shards
Node 3 : 3 shards

Each shard should be around 600mb size (If using all defaults).

What are your exact index settings today?

David

> Le 20 avr. 2015 à 23:54, TB  a écrit :
> 
> I have my indexes size @ 6 GB currently with replica set @ 1.
> I have 3 node cluster, in order to utilize the cluster , my understanding 
> that i would have set the replica to 3.
> If i do that, would my index size grow more than 6 GB in each node?
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/b3a6247f-a2a1-446f-8ed5-e93be4672cc3%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/517407CD-0008-4C68-89A9-29FC3B0E18DD%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Re: Bulk Index from Remote Host

That's fine but you need to split your bulk into smaller bulk requests.
Don't send a 10gb bulk in one call! :)

David

> Le 21 avr. 2015 à 00:40, TB  a écrit :
> 
> We are planning to bulk insert about 10 Gig data ,however we are being forced 
> to do this from a remote host.
> Is this a good practice? And are there any potential issues i should watch 
> out for?
> 
> any advice would be great
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/479edffe-e780-4858-b093-676b1837d668%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/65B2211D-B9EE-42B4-B522-4B21624B47AF%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Re: How to diagnose slow queries every 10 minutes exactly?

Could you run a hot_threads API call when this happens?
Anything in logs about GC?

BTW 200 indices is a lot for 2 nodes. And how many shards/replicas do you have?
Why do you need so many indices for 2m docs?


David

> Le 21 avr. 2015 à 01:16, Dave Reed  a écrit :
> 
> I have a 2-node cluster running on some beefy machines. 12g and 16g of heap 
> space. About 2.1 million documents, each relatively small in size, spread 
> across 200 or so indexes. The refresh interval is 0.5s (while I don't need 
> realtime I do need relatively quick refreshes). Documents are continuously 
> modified by the app, so reindex requests trickle in constantly. By trickle I 
> mean maybe a dozen a minute. All index requests are made with _bulk, although 
> a lot of the time there's only 1 in the list.
> 
> Searches are very fast -- normally taking 50ms or less.
> 
> But oddly, exactly every 10 minutes, searches slow down for a moment. The 
> exact same query that normally takes <50ms takes 9000ms, for example. Any 
> other queries regardless of what they are also take multiple seconds to 
> complete. Once this moment passes, search queries return to normal.
> 
> I have a tester I wrote that continuously posts the same query and logs the 
> results, which is how I discovered this pattern.
> 
> Here's an excerpt. Notice that query time is great at 3:49:10, then at :11 
> things stop for 10 seconds. At :21 the queued up searches finally come 
> through. The numbers reported are the "took" field from the ES search 
> response. Then things resume as normal. This is true no matter which node I 
> run the search against.
> 
> This pattern repeats like this every 10 minutes, to the second, for days now.
> 
> 3:49:09, 47
> 3:49:09, 63
> 3:49:10, 31
> 3:49:10, 47
> 3:49:11, 47
> 3:49:11, 62
> 3:49:21, 8456
> 3:49:21, 5460
> 3:49:21, 7457
> 3:49:21, 4509
> 3:49:21, 3510
> 3:49:21, 515
> 3:49:21, 1498
> 3:49:21, 2496
> 3:49:22, 2465
> 3:49:22, 2636
> 3:49:22, 6506
> 3:49:22, 7504
> 3:49:22, 9501
> 3:49:22, 4509
> 3:49:22, 1638
> 3:49:22, 6646
> 3:49:22, 9641
> 3:49:22, 655
> 3:49:22, 3667
> 3:49:22, 31
> 3:49:22, 78
> 3:49:22, 47
> 3:49:23, 47
> 3:49:23, 47
> 3:49:24, 93
> 
> I've ruled out any obvious periodical process running on either node in the 
> cluster. It's pretty clean. Disk I/O, CPU, RAM, etc stays pretty consistent 
> and flat even during one of these blips. These are beefy machines like I 
> said, so there's plenty of CPU and RAM available.
> 
> Any advice on how I can figure out what ES is waiting for would be helpful. 
> Is there any process ES runs every 10 minutes by default? 
> 
> Thanks!
> -Dave
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2863E632-E94D-4BFA-851B-2D487EC23276%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Re: 2.0 ETA

2015-04-20 Thread Matt Weber

Thanks Adrien!

On Mon, Apr 20, 2015 at 3:38 PM, Adrien Grand  wrote:

> Hi Matt,
>
> We have this meta issue which tracks what remains to be done before we
> release 2.0: https://github.com/elastic/elasticsearch/issues/9970. We
> plan to release as soon as we can but some of these issues are quite
> challenging so it's hard to give you an ETA. It should be a matter of
> months but I can't tell how many yet.
>
> However, even if the GA release might still take time, there will be beta
> releases before as we make progress through this checklist.
>
> Sorry if my answer does not give as much information as you hoped, we are
> all looking forward to this release and items on this checklist are very
> high priorities!
>
>
> On Mon, Apr 20, 2015 at 10:55 PM, Matt Weber  wrote:
>
>> Is there an ETA for 2.0?
>>
>> --
>> Thanks,
>> Matt Weber
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoAa69%3D%2B4NC608ptZzGZmF%2BTiW72yCMikS%2BRKM3RX08KCg%40mail.gmail.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> Adrien
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAO5%3DkAiRXH0w8QG_cO5cWgMdpVNwv9On%3Da6TbxWfD6%3DY6yjBHQ%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoAF%2Bv9v50SWQHR_%2BF68sSAm8wd1A_%3DA_UppiYFLhsyETg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using serialized doc_value instead of _source to improve read latency

If I could focus the question better :  How do I whitelist a specific class 
in the groovy script inside transform ?

On Tuesday, April 21, 2015 at 1:18:03 AM UTC+3, Itai Frenkel wrote:
>
> Hi,
>
> We are having a performance problem in which for each hit, elasticsearch 
> parses the entire _source then generates a new Json with only the requested 
> query _source fields. In order to overcome this issue we would like to use 
> mapping transform script that serializes the requested query fields (which 
> is known in advance) into a doc_value. Does that makes sense?
>
> The actual problem with the transform script is  SecurityException that 
> does not allow using any json serialization mechanism. A binary 
> serialization would also be ok.
>
>
> Itai
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e925c3b6-b102-413c-a320-62f1c0ffcf99%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using serialized doc_value instead of _source to improve read latency

Hi Nik,

when _source : true the time it takes for the search to complete in 
elasticsearch is very short. when _souce is a list of fields it is 
significantly slower.

Itai

On Tuesday, April 21, 2015 at 3:06:06 AM UTC+3, Nikolas Everett wrote:
>
> Have you profiled it and seen that reading the source is actually the slow 
> part? hot_threads can lie here so I'd go with a profiler or just sigquit or 
> something.
>
> I've got some reasonably big documents and generally don't see that as a 
> problem even under decent load.
>
> I could see an argument for a second source field with the long stuff 
> removed if you see the json decode or the disk read of the source be really 
> slow - but transform doesn't do that.
>
> Nik
>
> On Mon, Apr 20, 2015 at 7:57 PM, Itai Frenkel  > wrote:
>
>> A quick check shows there is no significant performance gain between 
>> doc_value and stored field that is not a doc value. I suppose there are 
>> warm-up and file system caching issues are at play. I do not have that 
>> field in the source since the ETL process at this point does not generate 
>> it. The ETL could be fixed and then it will generate the required field. 
>> However, even then I would still prefer doc_field over _source since I do 
>> not need _source at all. You are right to assume that reading the entire 
>> source parsing it and returning only one field would be fast (since the cpu 
>> is in the json generator I suspect, and not the parser, but that requires 
>> more work).
>>
>>
>> On Tuesday, April 21, 2015 at 2:25:22 AM UTC+3, Itamar Syn-Hershko wrote:
>>>
>>> What if all those fields are collapsed to one, like you suggest, but 
>>> that one field is projected out of _source (think non-indexed json in a 
>>> string field)? do you see a noticable performance gain then?
>>>
>>> What if that field is set to be stored (and loaded using fields, not via 
>>> _source)? what is the performance gain then?
>>>
>>> Fielddata and the doc_values optimization on top of them will not help 
>>> you here, those data structures aren't being used for sending data out, 
>>> only for aggregations and sorting. Also, using fielddata will require 
>>> indexing those fields; it is apparent that you are not looking to be doing 
>>> that.
>>>
>>> --
>>>
>>> Itamar Syn-Hershko
>>> http://code972.com | @synhershko 
>>> Freelance Developer & Consultant
>>> Lucene.NET committer and PMC member
>>>
>>> On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel  
>>> wrote:
>>>
 Itamar,

 1. The _source field includes many fields that are only being indexed, 
 and many fields that are only needed as a query search result. _source 
 includes them both.The projection from _source from the query result is 
 too 
 CPU intensive to do during search time for each result, especially if the 
 size is big. 
 2. I agree that adding another NoSQL could solve this problem, however 
 it is currently out of scope, as it would require syncing data with 
 another 
 data store.
 3. Wouldn't a big stored field will bloat the lucene index size? Even 
 if not, isn't non_analyzed fields are destined to be (or already are) 
 doc_fields?

 On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko 
 wrote:
>
> This is how _source works. doc_values don't make sense in this regard 
> - what you are looking for is using stored fields and have the transform 
> script write to that. Loading stored fields (even one field per hit) may 
> be 
> slower than loading and parsing _source, though.
>
> I'd just put this logic in the indexer, though. It will definitely 
> help with other things as well, such as nasty huge mappings.
>
> Alternatively, find a way to avoid IO completely. How about using ES 
> for search and something like riak for loading the actual data, if IO 
> costs 
> are so noticable?
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko 
> Freelance Developer & Consultant
> Lucene.NET committer and PMC member
>
> On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel  
> wrote:
>
>> Hi,
>>
>> We are having a performance problem in which for each hit, 
>> elasticsearch parses the entire _source then generates a new Json with 
>> only 
>> the requested query _source fields. In order to overcome this issue we 
>> would like to use mapping transform script that serializes the requested 
>> query fields (which is known in advance) into a doc_value. Does that 
>> makes 
>> sense?
>>
>> The actual problem with the transform script is  SecurityException 
>> that does not allow using any json serialization mechanism. A binary 
>> serialization would also be ok.
>>
>>
>> Itai
>>
>>  -- 
>> You received this message because you are subscribed to the Goo

Re: Using serialized doc_value instead of _source to improve read latency

2015-04-20 Thread Nikolas Everett

Have you profiled it and seen that reading the source is actually the slow
part? hot_threads can lie here so I'd go with a profiler or just sigquit or
something.

I've got some reasonably big documents and generally don't see that as a
problem even under decent load.

I could see an argument for a second source field with the long stuff
removed if you see the json decode or the disk read of the source be really
slow - but transform doesn't do that.

Nik

On Mon, Apr 20, 2015 at 7:57 PM, Itai Frenkel  wrote:

> A quick check shows there is no significant performance gain between
> doc_value and stored field that is not a doc value. I suppose there are
> warm-up and file system caching issues are at play. I do not have that
> field in the source since the ETL process at this point does not generate
> it. The ETL could be fixed and then it will generate the required field.
> However, even then I would still prefer doc_field over _source since I do
> not need _source at all. You are right to assume that reading the entire
> source parsing it and returning only one field would be fast (since the cpu
> is in the json generator I suspect, and not the parser, but that requires
> more work).
>
>
> On Tuesday, April 21, 2015 at 2:25:22 AM UTC+3, Itamar Syn-Hershko wrote:
>>
>> What if all those fields are collapsed to one, like you suggest, but that
>> one field is projected out of _source (think non-indexed json in a string
>> field)? do you see a noticable performance gain then?
>>
>> What if that field is set to be stored (and loaded using fields, not via
>> _source)? what is the performance gain then?
>>
>> Fielddata and the doc_values optimization on top of them will not help
>> you here, those data structures aren't being used for sending data out,
>> only for aggregations and sorting. Also, using fielddata will require
>> indexing those fields; it is apparent that you are not looking to be doing
>> that.
>>
>> --
>>
>> Itamar Syn-Hershko
>> http://code972.com | @synhershko 
>> Freelance Developer & Consultant
>> Lucene.NET committer and PMC member
>>
>> On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel  wrote:
>>
>>> Itamar,
>>>
>>> 1. The _source field includes many fields that are only being indexed,
>>> and many fields that are only needed as a query search result. _source
>>> includes them both.The projection from _source from the query result is too
>>> CPU intensive to do during search time for each result, especially if the
>>> size is big.
>>> 2. I agree that adding another NoSQL could solve this problem, however
>>> it is currently out of scope, as it would require syncing data with another
>>> data store.
>>> 3. Wouldn't a big stored field will bloat the lucene index size? Even if
>>> not, isn't non_analyzed fields are destined to be (or already are)
>>> doc_fields?
>>>
>>> On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:

 This is how _source works. doc_values don't make sense in this regard -
 what you are looking for is using stored fields and have the transform
 script write to that. Loading stored fields (even one field per hit) may be
 slower than loading and parsing _source, though.

 I'd just put this logic in the indexer, though. It will definitely help
 with other things as well, such as nasty huge mappings.

 Alternatively, find a way to avoid IO completely. How about using ES
 for search and something like riak for loading the actual data, if IO costs
 are so noticable?

 --

 Itamar Syn-Hershko
 http://code972.com | @synhershko 
 Freelance Developer & Consultant
 Lucene.NET committer and PMC member

 On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel 
 wrote:

> Hi,
>
> We are having a performance problem in which for each hit,
> elasticsearch parses the entire _source then generates a new Json with 
> only
> the requested query _source fields. In order to overcome this issue we
> would like to use mapping transform script that serializes the requested
> query fields (which is known in advance) into a doc_value. Does that makes
> sense?
>
> The actual problem with the transform script is  SecurityException
> that does not allow using any json serialization mechanism. A binary
> serialization would also be ok.
>
>
> Itai
>
>  --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to elasticsearc...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
>

Re: Using serialized doc_value instead of _source to improve read latency

A quick check shows there is no significant performance gain between 
doc_value and stored field that is not a doc value. I suppose there are 
warm-up and file system caching issues are at play. I do not have that 
field in the source since the ETL process at this point does not generate 
it. The ETL could be fixed and then it will generate the required field. 
However, even then I would still prefer doc_field over _source since I do 
not need _source at all. You are right to assume that reading the entire 
source parsing it and returning only one field would be fast (since the cpu 
is in the json generator I suspect, and not the parser, but that requires 
more work).


On Tuesday, April 21, 2015 at 2:25:22 AM UTC+3, Itamar Syn-Hershko wrote:
>
> What if all those fields are collapsed to one, like you suggest, but that 
> one field is projected out of _source (think non-indexed json in a string 
> field)? do you see a noticable performance gain then?
>
> What if that field is set to be stored (and loaded using fields, not via 
> _source)? what is the performance gain then?
>
> Fielddata and the doc_values optimization on top of them will not help you 
> here, those data structures aren't being used for sending data out, only 
> for aggregations and sorting. Also, using fielddata will require indexing 
> those fields; it is apparent that you are not looking to be doing that.
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko 
> Freelance Developer & Consultant
> Lucene.NET committer and PMC member
>
> On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel  > wrote:
>
>> Itamar,
>>
>> 1. The _source field includes many fields that are only being indexed, 
>> and many fields that are only needed as a query search result. _source 
>> includes them both.The projection from _source from the query result is too 
>> CPU intensive to do during search time for each result, especially if the 
>> size is big. 
>> 2. I agree that adding another NoSQL could solve this problem, however it 
>> is currently out of scope, as it would require syncing data with another 
>> data store.
>> 3. Wouldn't a big stored field will bloat the lucene index size? Even if 
>> not, isn't non_analyzed fields are destined to be (or already are) 
>> doc_fields?
>>
>> On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:
>>>
>>> This is how _source works. doc_values don't make sense in this regard - 
>>> what you are looking for is using stored fields and have the transform 
>>> script write to that. Loading stored fields (even one field per hit) may be 
>>> slower than loading and parsing _source, though.
>>>
>>> I'd just put this logic in the indexer, though. It will definitely help 
>>> with other things as well, such as nasty huge mappings.
>>>
>>> Alternatively, find a way to avoid IO completely. How about using ES for 
>>> search and something like riak for loading the actual data, if IO costs are 
>>> so noticable?
>>>
>>> --
>>>
>>> Itamar Syn-Hershko
>>> http://code972.com | @synhershko 
>>> Freelance Developer & Consultant
>>> Lucene.NET committer and PMC member
>>>
>>> On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel  
>>> wrote:
>>>
 Hi,

 We are having a performance problem in which for each hit, 
 elasticsearch parses the entire _source then generates a new Json with 
 only 
 the requested query _source fields. In order to overcome this issue we 
 would like to use mapping transform script that serializes the requested 
 query fields (which is known in advance) into a doc_value. Does that makes 
 sense?

 The actual problem with the transform script is  SecurityException that 
 does not allow using any json serialization mechanism. A binary 
 serialization would also be ok.


 Itai

  -- 
 You received this message because you are subscribed to the Google 
 Groups "elasticsearch" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com
>>  
>>

Re: Using serialized doc_value instead of _source to improve read latency

2015-04-20 Thread Itamar Syn-Hershko

What if all those fields are collapsed to one, like you suggest, but that
one field is projected out of _source (think non-indexed json in a string
field)? do you see a noticable performance gain then?

What if that field is set to be stored (and loaded using fields, not via
_source)? what is the performance gain then?

Fielddata and the doc_values optimization on top of them will not help you
here, those data structures aren't being used for sending data out, only
for aggregations and sorting. Also, using fielddata will require indexing
those fields; it is apparent that you are not looking to be doing that.

--

Itamar Syn-Hershko
http://code972.com | @synhershko 
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel  wrote:

> Itamar,
>
> 1. The _source field includes many fields that are only being indexed, and
> many fields that are only needed as a query search result. _source includes
> them both.The projection from _source from the query result is too CPU
> intensive to do during search time for each result, especially if the size
> is big.
> 2. I agree that adding another NoSQL could solve this problem, however it
> is currently out of scope, as it would require syncing data with another
> data store.
> 3. Wouldn't a big stored field will bloat the lucene index size? Even if
> not, isn't non_analyzed fields are destined to be (or already are)
> doc_fields?
>
> On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:
>>
>> This is how _source works. doc_values don't make sense in this regard -
>> what you are looking for is using stored fields and have the transform
>> script write to that. Loading stored fields (even one field per hit) may be
>> slower than loading and parsing _source, though.
>>
>> I'd just put this logic in the indexer, though. It will definitely help
>> with other things as well, such as nasty huge mappings.
>>
>> Alternatively, find a way to avoid IO completely. How about using ES for
>> search and something like riak for loading the actual data, if IO costs are
>> so noticable?
>>
>> --
>>
>> Itamar Syn-Hershko
>> http://code972.com | @synhershko 
>> Freelance Developer & Consultant
>> Lucene.NET committer and PMC member
>>
>> On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel  wrote:
>>
>>> Hi,
>>>
>>> We are having a performance problem in which for each hit, elasticsearch
>>> parses the entire _source then generates a new Json with only the requested
>>> query _source fields. In order to overcome this issue we would like to use
>>> mapping transform script that serializes the requested query fields (which
>>> is known in advance) into a doc_value. Does that makes sense?
>>>
>>> The actual problem with the transform script is  SecurityException that
>>> does not allow using any json serialization mechanism. A binary
>>> serialization would also be ok.
>>>
>>>
>>> Itai
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuxvUoZ4L%2BUq0G82GLZKYfN-hj_e_gez6RsUc3hZeHbyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using serialized doc_value instead of _source to improve read latency

Also - does "fielddata": {  "loading": "eager" } makes sense with 
doc_values in this use case? Would that combination be supported in the 
future?

On Tuesday, April 21, 2015 at 2:14:03 AM UTC+3, Itai Frenkel wrote:
>
> Itamar,
>
> 1. The _source field includes many fields that are only being indexed, and 
> many fields that are only needed as a query search result. _source includes 
> them both.The projection from _source from the query result is too CPU 
> intensive to do during search time for each result, especially if the size 
> is big. 
> 2. I agree that adding another NoSQL could solve this problem, however it 
> is currently out of scope, as it would require syncing data with another 
> data store.
> 3. Wouldn't a big stored field will bloat the lucene index size? Even if 
> not, isn't non_analyzed fields are destined to be (or already are) 
> doc_fields?
>
> On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:
>>
>> This is how _source works. doc_values don't make sense in this regard - 
>> what you are looking for is using stored fields and have the transform 
>> script write to that. Loading stored fields (even one field per hit) may be 
>> slower than loading and parsing _source, though.
>>
>> I'd just put this logic in the indexer, though. It will definitely help 
>> with other things as well, such as nasty huge mappings.
>>
>> Alternatively, find a way to avoid IO completely. How about using ES for 
>> search and something like riak for loading the actual data, if IO costs are 
>> so noticable?
>>
>> --
>>
>> Itamar Syn-Hershko
>> http://code972.com | @synhershko 
>> Freelance Developer & Consultant
>> Lucene.NET committer and PMC member
>>
>> On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel  wrote:
>>
>>> Hi,
>>>
>>> We are having a performance problem in which for each hit, elasticsearch 
>>> parses the entire _source then generates a new Json with only the requested 
>>> query _source fields. In order to overcome this issue we would like to use 
>>> mapping transform script that serializes the requested query fields (which 
>>> is known in advance) into a doc_value. Does that makes sense?
>>>
>>> The actual problem with the transform script is  SecurityException that 
>>> does not allow using any json serialization mechanism. A binary 
>>> serialization would also be ok.
>>>
>>>
>>> Itai
>>>
>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
>>>  
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d5abaeac-ff16-45ac-bb3d-62b53e497795%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

How to diagnose slow queries every 10 minutes exactly?

2015-04-20 Thread Dave Reed

I have a 2-node cluster running on some beefy machines. 12g and 16g of heap
space. About 2.1 million documents, each relatively small in size, spread
across 200 or so indexes. The refresh interval is 0.5s (while I don't need
realtime I do need relatively quick refreshes). Documents are continuously
modified by the app, so reindex requests trickle in constantly. By trickle
I mean maybe a dozen a minute. All index requests are made with _bulk,
although a lot of the time there's only 1 in the list.

Searches are very fast -- normally taking 50ms or less.

But oddly, exactly every 10 minutes, searches slow down for a moment. The
exact same query that normally takes <50ms takes 9000ms, for example. Any
other queries regardless of what they are also take multiple seconds to
complete. Once this moment passes, search queries return to normal.

I have a tester I wrote that continuously posts the same query and logs the
results, which is how I discovered this pattern.

Here's an excerpt. Notice that query time is great at 3:49:10, then at :11
things stop for 10 seconds. At :21 the queued up searches finally come
through. The numbers reported are the "took" field from the ES search
response. Then things resume as normal. This is true no matter which node I
run the search against.

This pattern repeats like this every 10 minutes, to the second, for days
now.

3:49:09, 47
3:49:09, 63
3:49:10, 31
3:49:10, 47
3:49:11, 47
3:49:11, 62
3:49:21, 8456
3:49:21, 5460
3:49:21, 7457
3:49:21, 4509
3:49:21, 3510
3:49:21, 515
3:49:21, 1498
3:49:21, 2496
3:49:22, 2465
3:49:22, 2636
3:49:22, 6506
3:49:22, 7504
3:49:22, 9501
3:49:22, 4509
3:49:22, 1638
3:49:22, 6646
3:49:22, 9641
3:49:22, 655
3:49:22, 3667
3:49:22, 31
3:49:22, 78
3:49:22, 47
3:49:23, 47
3:49:23, 47
3:49:24, 93

I've ruled out any obvious periodical process running on either node in the
cluster. It's pretty clean. Disk I/O, CPU, RAM, etc stays pretty consistent
and flat even during one of these blips. These are beefy machines like I
said, so there's plenty of CPU and RAM available.

Any advice on how I can figure out what ES is waiting for would be helpful.
Is there any process ES runs every 10 minutes by default?

Thanks!
-Dave

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using serialized doc_value instead of _source to improve read latency

Itamar,

1. The _source field includes many fields that are only being indexed, and 
many fields that are only needed as a query search result. _source includes 
them both.The projection from _source from the query result is too CPU 
intensive to do during search time for each result, especially if the size 
is big. 
2. I agree that adding another NoSQL could solve this problem, however it 
is currently out of scope, as it would require syncing data with another 
data store.
3. Wouldn't a big stored field will bloat the lucene index size? Even if 
not, isn't non_analyzed fields are destined to be (or already are) 
doc_fields?

On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:
>
> This is how _source works. doc_values don't make sense in this regard - 
> what you are looking for is using stored fields and have the transform 
> script write to that. Loading stored fields (even one field per hit) may be 
> slower than loading and parsing _source, though.
>
> I'd just put this logic in the indexer, though. It will definitely help 
> with other things as well, such as nasty huge mappings.
>
> Alternatively, find a way to avoid IO completely. How about using ES for 
> search and something like riak for loading the actual data, if IO costs are 
> so noticable?
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko 
> Freelance Developer & Consultant
> Lucene.NET committer and PMC member
>
> On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel  > wrote:
>
>> Hi,
>>
>> We are having a performance problem in which for each hit, elasticsearch 
>> parses the entire _source then generates a new Json with only the requested 
>> query _source fields. In order to overcome this issue we would like to use 
>> mapping transform script that serializes the requested query fields (which 
>> is known in advance) into a doc_value. Does that makes sense?
>>
>> The actual problem with the transform script is  SecurityException that 
>> does not allow using any json serialization mechanism. A binary 
>> serialization would also be ok.
>>
>>
>> Itai
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bulk Index from Remote Host

2015-04-20 Thread TB

We are planning to bulk insert about 10 Gig data ,however we are being 
forced to do this from a remote host.
Is this a good practice? And are there any potential issues i should watch 
out for?

any advice would be great

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/479edffe-e780-4858-b093-676b1837d668%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: 2.0 ETA

2015-04-20 Thread Adrien Grand

Hi Matt,

We have this meta issue which tracks what remains to be done before we
release 2.0: https://github.com/elastic/elasticsearch/issues/9970. We plan
to release as soon as we can but some of these issues are quite challenging
so it's hard to give you an ETA. It should be a matter of months but I
can't tell how many yet.

However, even if the GA release might still take time, there will be beta
releases before as we make progress through this checklist.

Sorry if my answer does not give as much information as you hoped, we are
all looking forward to this release and items on this checklist are very
high priorities!

On Mon, Apr 20, 2015 at 10:55 PM, Matt Weber  wrote:

> Is there an ETA for 2.0?
>
> --
> Thanks,
> Matt Weber
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoAa69%3D%2B4NC608ptZzGZmF%2BTiW72yCMikS%2BRKM3RX08KCg%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Adrien

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAO5%3DkAiRXH0w8QG_cO5cWgMdpVNwv9On%3Da6TbxWfD6%3DY6yjBHQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using serialized doc_value instead of _source to improve read latency

2015-04-20 Thread Itamar Syn-Hershko

This is how _source works. doc_values don't make sense in this regard -
what you are looking for is using stored fields and have the transform
script write to that. Loading stored fields (even one field per hit) may be
slower than loading and parsing _source, though.

I'd just put this logic in the indexer, though. It will definitely help
with other things as well, such as nasty huge mappings.

Alternatively, find a way to avoid IO completely. How about using ES for
search and something like riak for loading the actual data, if IO costs are
so noticable?

--

Itamar Syn-Hershko
http://code972.com | @synhershko 
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel  wrote:

> Hi,
>
> We are having a performance problem in which for each hit, elasticsearch
> parses the entire _source then generates a new Json with only the requested
> query _source fields. In order to overcome this issue we would like to use
> mapping transform script that serializes the requested query fields (which
> is known in advance) into a doc_value. Does that makes sense?
>
> The actual problem with the transform script is  SecurityException that
> does not allow using any json serialization mechanism. A binary
> serialization would also be ok.
>
>
> Itai
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zsmri8LvzAqnXrwCA7B2PesCtH05BQxmj%3D3vMr%2B9abikw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Using serialized doc_value instead of _source to improve read latency

Hi,

We are having a performance problem in which for each hit, elasticsearch 
parses the entire _source then generates a new Json with only the requested 
query _source fields. In order to overcome this issue we would like to use 
mapping transform script that serializes the requested query fields (which 
is known in advance) into a doc_value. Does that makes sense?

The actual problem with the transform script is  SecurityException that 
does not allow using any json serialization mechanism. A binary 
serialization would also be ok.


Itai

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Index Size and Replica Impact

2015-04-20 Thread TB

I have my indexes size @ 6 GB currently with replica set @ 1.
I have 3 node cluster, in order to utilize the cluster , my understanding 
that i would have set the replica to 3.
If i do that, would my index size grow more than 6 GB in each node?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b3a6247f-a2a1-446f-8ed5-e93be4672cc3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

2.0 ETA

2015-04-20 Thread Matt Weber

Is there an ETA for 2.0?

-- 
Thanks,
Matt Weber

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoAa69%3D%2B4NC608ptZzGZmF%2BTiW72yCMikS%2BRKM3RX08KCg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Query boost values available in script_score?

2015-04-20 Thread Kevin Reilly

Hi. Are query boost values available in script_score? 
Read the documentation with no success but perhaps I overlooked something.

Thanks in advance.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5546904e-17b3-45a2-8586-14ff9fa6aea8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Horizontal Bar Chart in Kibana

2015-04-20 Thread Vijay Nagendrarao

Hi,

   I need to implement horizontal bar chart in Kibana 4. I need help 
regarding. Please let me know.

Thanks,
Vijay.C.N

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fef6d80d-364f-46ee-a7dc-91f45a0e8dc3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Scroll issue

2015-04-20 Thread Shawn Feldman

We are using scroll to do paging.

We are encountering an issue where the last result from the initial search 
appears as the first result in our scroll request.

so.. hits[length-1]  == nextPageHits[0]

This only seems to occur after we do a large series of writes and searches. 
 Initially it doesn't occur.  

Is this behavior intended or a known issue?  

-shawn

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fce7ae9f-87ec-4956-926e-57328522ee2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Elasticseach issue with some indicies not populating data

2015-04-20 Thread christian . dahlqvist

HI,

That sounds like a very large amount of shards for a node that size, and 
this is most likely the source of your problems. Each shard in 
Elasticsearch corresponds to a Lucene instance and carries with it a 
certain amount of overhead. You therefore do not want your shards to be too 
small. For logging use cases a common shard size at least a few GB.

If you are using daily indices and the default 5 shards per index, you may 
want to consider reducing the shard count for each of your indices and/or 
switch to weekly or perhaps monthly indices in order to reduce the number 
of shards created each day and increase the average shard size going 
forward.

In order to get the instance working again you may also need to start 
closing the older insides in order to bring down the number of active 
shards and/or upgrade the node to get more RAM.

Best regards,

Christian



On Monday, April 20, 2015 at 4:38:53 PM UTC+1, Don Pich wrote:
>
> Hey Christian,
>
> 8 gigs of ram
> -Xms6g -Xmx6g
>
> Don Pich | Jedi Master (aka System Administrator 2) | O: 701-952-5925
>
> 3320 Westrac Drive South, Suite A * Fargo, ND 58103
>
> Facebook  | Youtube 
> | Twitter 
>  | Google+  
> | Instagram  | Linkedin 
>  | Our Guiding Principles 
> 
> “If it goes on a truck we got it, if it’s fun we do it” – RealTruck.com 
> 
>
> On Mon, Apr 20, 2015 at 10:29 AM,  > wrote:
>
>> Hi,
>>
>> Having read through the thread it sounds like your configuration has been 
>> working in the past. Is that correct?
>>
>> If this is the case I would reiterate David's initial questions about 
>> your node's RAM and heap size as the number of shards look quite large for 
>> a single node. Could you please provide information about this?
>>
>> Best regards,
>>
>> Christian
>>
>>
>>
>> On Sunday, April 19, 2015 at 8:08:05 PM UTC+1, dp...@realtruck.com wrote:
>>>
>>> I am new to elasticsearch and have a problem.  I have 5 indicies.  At 
>>> first all of them were running without issue.  However, over the last 2 
>>> weeks, all but one have stopped generating data.  I have run a tcpdump on 
>>> the logstash server and confirmed that logging packets are getting to the 
>>> server.  I have looked into the servers health.  I have issued the 
>>> following to check on the cluster:
>>>
>>> root@logstash:/# curl -XGET 'localhost:9200/_cluster/health?pretty=true'
>>> {
>>>   "cluster_name" : "es-logstash",
>>>   "status" : "yellow",
>>>   "timed_out" : false,
>>>   "number_of_nodes" : 1,
>>>   "number_of_data_nodes" : 1,
>>>   "active_primary_shards" : 2791,
>>>   "active_shards" : 2791,
>>>   "relocating_shards" : 0,
>>>   "initializing_shards" : 0,
>>>   "unassigned_shards" : 2791
>>> }
>>> root@logstash:/#
>>>
>>>
>>> Can some one please point me in the right direction on troubleshooting 
>>> this?
>>>
>>  -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/elasticsearch/0GEaRABjLQY/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/2a4d7543-b110-499b-a8d3-ccfa19284617%40googlegroups.com
>>  
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b7853329-aa6b-4cd5-b8e6-dfb30d779509%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

how to detect changes in database and automatically adding new row to elasticsearch index

2015-04-20 Thread snosek

 
   
What I've already done:

I connected my hbase with elasticsearch via this tutorial: 

http://lessc0de.github.io/connecting_hbase_to_elasticsearch.html

And I get index with hbase table content, but after adding new row to 
hbase, it is not automatically added to elasticsearch index. I tried to add 
this line to my conf:

"schedule" : "* 1/5 * ? * *"

and mapping:

"mappings": {
"jdbc" : {
 "_id" : {
 "path" : "ID"
 }
 }
}
 

which assigns _id = ID, and ID has unique value in my hbase table.

It's working well: when I add new row to hbase it is uploaded to index in 
less then 5 minutes. But it is not good for performance, because every 5 
minutes it executes a query and doesn't add old content to index only 
because of _id has to be unique. It is good for small db, but I had over 10 
millions row in my hbase table, so my index is working all time.

It is any solution or plugin to elasticsearch to automatically detected 
changes in db and add only the new row to index?

I create index using:

curl -XPUT 'localhost:9200/_river/jdbc/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:phoenix:localhost",
"user" : "",
"password" : "",
"sql" : "select ID, MESSAGE from test",
"schedule" : "* 1/5 * ? * *"
}
}'

Thanks for help.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1ea1f467-171a-48cf-8661-1e1dddc8db31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: shingle filter for sub phrase matching

2015-04-20 Thread brian

Did you ever figure this out? I have the same exact issue but using 
different words.  

On Wednesday, July 23, 2014 at 10:37:03 AM UTC-4, Nick Tackes wrote:
>
> I have created a gist with an analyzer that uses filter shingle in 
> attempt to match sub phrases. 
>
> For instance I have entries in the table with discrete phrases like 
>
> EGFR 
> Lung Cancer 
> Lung 
> Cancer 
>
> and I want to match these when searching the phrase 'EGFR related lung 
> cancer 
>
> My expectation is that the multi word matches score higher than the single 
> matches, for instance... 
> 1. Lung Cancer 
> 2. Lung 
> 3. Cancer 
> 4. EGFR 
>
> Additionally, I tried a standard analyzer match but this didn't yield the 
> desired result either. One complicating aspect to this approach is that the 
> min_shingle_size has to be 2 or more. 
>
> How then would I be able to match single words like 'EGFR' or 'Lung'? 
>
> thanks
>
> https://gist.github.com/nicktackes/ffdbf22aba393efc2169.js
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9f480904-aca7-468b-9d43-4243b65899df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Elasticseach issue with some indicies not populating data

Hey Christian,

8 gigs of ram
-Xms6g -Xmx6g

Don Pich | Jedi Master (aka System Administrator 2) | O: 701-952-5925

3320 Westrac Drive South, Suite A * Fargo, ND 58103

Facebook  | Youtube
| Twitter
 | Google+  |
Instagram  | Linkedin
 | Our Guiding Principles

“If it goes on a truck we got it, if it’s fun we do it” – RealTruck.com


On Mon, Apr 20, 2015 at 10:29 AM, 
wrote:

> Hi,
>
> Having read through the thread it sounds like your configuration has been
> working in the past. Is that correct?
>
> If this is the case I would reiterate David's initial questions about your
> node's RAM and heap size as the number of shards look quite large for a
> single node. Could you please provide information about this?
>
> Best regards,
>
> Christian
>
>
>
> On Sunday, April 19, 2015 at 8:08:05 PM UTC+1, dp...@realtruck.com wrote:
>>
>> I am new to elasticsearch and have a problem.  I have 5 indicies.  At
>> first all of them were running without issue.  However, over the last 2
>> weeks, all but one have stopped generating data.  I have run a tcpdump on
>> the logstash server and confirmed that logging packets are getting to the
>> server.  I have looked into the servers health.  I have issued the
>> following to check on the cluster:
>>
>> root@logstash:/# curl -XGET 'localhost:9200/_cluster/health?pretty=true'
>> {
>>   "cluster_name" : "es-logstash",
>>   "status" : "yellow",
>>   "timed_out" : false,
>>   "number_of_nodes" : 1,
>>   "number_of_data_nodes" : 1,
>>   "active_primary_shards" : 2791,
>>   "active_shards" : 2791,
>>   "relocating_shards" : 0,
>>   "initializing_shards" : 0,
>>   "unassigned_shards" : 2791
>> }
>> root@logstash:/#
>>
>>
>> Can some one please point me in the right direction on troubleshooting
>> this?
>>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/0GEaRABjLQY/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/2a4d7543-b110-499b-a8d3-ccfa19284617%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAHjBx_SY_%3DhVpsSNZ9urA2MGqetg0QrfOuorY_2rc0uCu_%2B1Xg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Elasticseach issue with some indicies not populating data

2015-04-20 Thread christian . dahlqvist

Hi,

Having read through the thread it sounds like your configuration has been 
working in the past. Is that correct?

If this is the case I would reiterate David's initial questions about your 
node's RAM and heap size as the number of shards look quite large for a 
single node. Could you please provide information about this?

Best regards,

Christian



On Sunday, April 19, 2015 at 8:08:05 PM UTC+1, dp...@realtruck.com wrote:
>
> I am new to elasticsearch and have a problem.  I have 5 indicies.  At 
> first all of them were running without issue.  However, over the last 2 
> weeks, all but one have stopped generating data.  I have run a tcpdump on 
> the logstash server and confirmed that logging packets are getting to the 
> server.  I have looked into the servers health.  I have issued the 
> following to check on the cluster:
>
> root@logstash:/# curl -XGET 'localhost:9200/_cluster/health?pretty=true'
> {
>   "cluster_name" : "es-logstash",
>   "status" : "yellow",
>   "timed_out" : false,
>   "number_of_nodes" : 1,
>   "number_of_data_nodes" : 1,
>   "active_primary_shards" : 2791,
>   "active_shards" : 2791,
>   "relocating_shards" : 0,
>   "initializing_shards" : 0,
>   "unassigned_shards" : 2791
> }
> root@logstash:/#
>
>
> Can some one please point me in the right direction on troubleshooting 
> this?
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2a4d7543-b110-499b-a8d3-ccfa19284617%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

find missing documents in an index

2015-04-20 Thread seallison

Is there a way for Elasticsearch to tell me documents that are NOT in an 
index given a set of criteria?

I have a field in my documents that contains a unique numerical id. There 
are some ids that are missing from documents in the index and I want to 
find those ids. For example:

{ "product_id": 1000 }, {"product_id": 1002}, {"product_id": 1004}, 
{"product_id": 1005}, ...

In this example I want to find that documents with a product_id of 1001 and 
1003 are missing from the index. Is this something an aggregation could 
help me identify?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/50ba365a-02c8-40ce-8d9d-c49cedd7e2da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

"Cannot specify a query in the target index and through es.query" when working with ES, Wikipedia River and Hive

2015-04-20 Thread Gordon

Hi,

I've largely got everything setup to integrate ES and Hive. However, when I 
execute a query against the table "wikitable" as defined below, I get the 
error 

*Cannot specify a query in the target index and through es.query*

Versions are ES Hive integration, 2.1.0.Beta3; ES, 1.4.4; and, I'm running 
on the latest Hadoop/Hive (installed last week directly from Apache).

I suspect the error has to do with the definition of the resource for the 
external table

wikipedia_river/page/_search?q

which I found in this excellent article here

http://ryrobes.com/systems/connecting-tableau-to-elasticsearch-read-how-to-query-elasticsearch-with-hive-sql-and-hadoop/

things may have changed however since the article was written. For 
instance, I had to take the es.host property out of the table definition 
and instead make sure I used es.nodes in a set statement in Hive. Things 
like that usually take a bit of digging and experimenting to figure out. 
I've gotten to where there's an attempt to execute the query and would 
appreciate it if anyone could shed some light on how to work beyond this 
point. Thanks!

Hive statements
--

set es.nodes=peace;
set es.port=9201;

DROP TABLE IF EXISTS wikitable;

CREATE EXTERNAL TABLE wikitable (
title string,
redirect_page string )
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'wikipedia_river/page/_search?q=*'
);

select count(distinct title) from wikitable;

--

Full stack trace:

--

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot specify a 
query in the target index and through es.query
at org.elasticsearch.hadoop.rest.Resource.(Resource.java:48)
at 
org.elasticsearch.hadoop.rest.RestRepository.(RestRepository.java:88)
at 
org.elasticsearch.hadoop.rest.RestService.findPartitions(RestService.java:226)
at 
org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:406)
at 
org.elasticsearch.hadoop.hive.EsHiveInputFormat.getSplits(EsHiveInputFormat.java:112)
at 
org.elasticsearch.hadoop.hive.EsHiveInputFormat.getSplits(EsHiveInputFormat.java:51)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)
at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:361)
at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:571)
at 
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:624)
at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:616)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:428)
at 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:137)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1638)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1397)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1183)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1039)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:207)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:754)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce

Re: jdbcRiver rebuilding after restart.

2015-04-20 Thread joergpra...@gmail.com

The column strategy is a community effort, it can manipulate SQL statement
where clauses with timestamp filter.

I do not have enough knowledge about column strategy.

You are correct, at node restart, a river does not know from where to
restart. There is no method to resolve this within river logic.

Jörg


On Mon, Apr 20, 2015 at 2:11 PM, GWired  wrote:

> I can't look at the feeder setup now but I could in the future.
>
> Is my SQL statement incorrect?
>
> Should I be doing something differently?
>
> Does the river not utilize created_at and updated_at in this setup?  I
> don't have a where clause because I thought using the column strategy it
> would take that in to account.
>
> This is an example of what I see in SQL server:
>
> SELECT id as _id, * FROM [MyDBName].[dbo].[MyTableName] WHERE ({fn
> TIMESTAMPDIFF(SQL_TSI_SECOND,@P0,"createddate")} >= 0)
>
> Which when i populate the @P0 with a timestamp it seems to be working fine.
>
> On a restart I'm guessing it doesn't know when to start.
>
> Any way that I can check values in elasticsearch within the column
> strategy?  Such as using Max(CreatedDate) so that it can start there?
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/06e9ce54-8b71-4337-971b-440a5b56f00d%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFOXtchcMm%3D3PN2gwA6P%2B%2BZoDNtSpwRAk51e2yNXuAdNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

SHIELD terms lookup filter : AuthorizationException BUG

2015-04-20 Thread Bert Vermeiren

Hi,

Using:
* ElasticSearch 1.5.1
* SHIELD 1.2

Whenever I use a terms lookup filter in a search query, I get an 
UnAuthorizedException for the  [__es_system_user] user although the actual 
user has even 'admin' role privileges.
This seems a bug to me, where the terms filter does not have the correct 
security context.

This is very easy to reproduce, see gist :

https://gist.github.com/bertvermeiren/c29e0d9ee54bb5b0b73a

Scenario :

# Add user 'admin' with default 'admin' role.
./bin/shield/esusers useradd admin -p admin1 -r admin

# create index.
curl -XPUT 'admin:admin1@localhost:9200/customer'

# create a document on the index
curl -XPUT 'admin:admin1@localhost:9200/customer/external/1' -d '
{
  "name" : "John Doe",
  "token" : "token1"
}'

# create additional index for the "terms lookup" filter functionality
curl -XPUT 'admin:admin1@localhost:9200/tokens'

# create document in 'tokens' index
curl -XPUT 'admin:admin1@localhost:9200/tokens/tokens/1' -d '
{
  "group" : "1",
  "tokens" : ["token1", "token2" ]
}'

# search with a terms lookup filter on the "customer" index, referring to 
the 'tokens' index.

curl -XGET 'admin:admin1@localhost:9200/customer/external/_search' -d '
{
  "query": {
"filtered": {
  "query": {
"match_all": {}
  },
  "filter": {
   "terms": {
"token": {
  "index": "tokens",
  "type": "tokens",
  "id": "1",
  "path": "tokens"
 }
   }
  }
}
  }
}'


=> org.elasticsearch.shield.authz.AuthorizationException: action 
[indices:data/read/get] is unauthorized for user [__es_system_user]

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4419d9d4-9bcc-4fab-afa3-a70799891f44%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Elasticseach issue with some indicies not populating data

Thanks David.  I will move over to logstash as I agree that is where it is
starting to feel like the problem is there.

I appreciate your help!!

Don Pich | Jedi Master (aka System Administrator 2) | O: 701-952-5925

3320 Westrac Drive South, Suite A * Fargo, ND 58103

Facebook  | Youtube
| Twitter
 | Google+  |
Instagram  | Linkedin
 | Our Guiding Principles

“If it goes on a truck we got it, if it’s fun we do it” – RealTruck.com


On Mon, Apr 20, 2015 at 9:43 AM, David Pilato  wrote:

> Might be. But you should ask this on the logstash mailing list.
> I think that elasticsearch is working fine here as you did not see any
> trouble in logs.
>
> That said I’d use:
>
>   elasticsearch {
> protocol => "http"
> host => "localhost"
>   }
>
> So using REST port (9200) that is.
>
> You can also add this output to make sure something is meant to be sent in
> elasticsearch:
>
> output {
>   stdout {
> codec => rubydebug
>   }
>   elasticsearch {
> protocol => "http"
> host => "localhost"
>   }
> }
>
>
>
> --
> *David Pilato* - Developer | Evangelist
> *elastic.co *
> @dadoonet  | @elasticsearchfr
>  | @scrutmydocs
> 
>
>
>
>
>
> Le 20 avr. 2015 à 16:38, Don Pich  a écrit :
>
> Thanks for that info.  Again, training wheels...  :-)
>
> So below is my logstash config.  If I do a tcpdump on port 5044, I see all
> of my forwarders communicating with the logstash server.  However, if I do
> a tcpdump on port 9300, I do not see any traffic.  This leads me to believe
> that I have a problem in my output.
>
> input
> {
> lumberjack # comes from logstash-forwarder, we sent ALL formats and
> types through this and control logType and logFormat on the client
> {
># The port to listen on
>port => 5044
>host => "192.168.1.72"
>
># The paths to your ssl cert and key
>ssl_certificate => "/opt/logstash-1.4.2/ssl/certs/lumberjack.crt" #
> new cert needed for latest v of lumberjack-pusher
>ssl_key => "/opt/logstash-1.4.2/ssl/private/lumberjack.key"
> }
>
> tcp
> {
># Remember with nxlog we're automatically converting our windows
> xml to JSON
>ssl_cert => "/opt/logstash-1.4.2/ssl/certs/logstash-forwarder.crt"
>ssl_key  => "/opt/logstash-1.4.2/ssl/private/logstash-forwarder.key"
>ssl_enable => true
>debug=>true
>type => "windowsEventLog"
>port => 3515
>codec => "line"
>add_field=>{"logType"=>"windowsEventLog"}
> }
> tcp
> {
># Remember with nxlog we're automatically converting our windows
> xml to JSON
># used for NFSServer which apparently cannot connect via SSL :(
>type => "windowsEventLog"
>port => 3516
>codec => "line"
>add_field=>{"logType"=>"windowsEventLog"}
> }
>
> }
>
> filter
> {
> if [logFormat] == "nginxLog"
> {
>mutate{add_field => ["receivedAt","%{@timestamp}"]} #preserve when
> we received this
>grok
>{
>   break_on_match => false
>   match =>
> ["message","%{IP:visitor_ip}\|[^|]+\|%{TIMESTAMP_ISO8601:entryDateTime}\|%{URIPATH:url}%{URIPARAM:query_string}?\|%{INT:http_response}\|%{INT:response_length}\|(?[^|]+)\|(?[^|]+)\|%{BASE16FLOAT:request_time}\|%{BASE16FLOAT:upstream_response_time}"]
>   match => ["url","\.(?(?:.(?!\.))+)$"]
>}
>date
>{
>match => ["entryDateTime","ISO8601"]
>remove_field => ["entryDateTime"]
>}
> }
> else if [logFormat] == "exim4"
> {
> mutate{add_field => ["receivedAt","%{@timestamp}"]} #preserve when
> we received this
> grok
> {
> break_on_match => false
> match => ["message","(?[^ ]+ [^ ]+)
> \[(?.*)\] (?.*)"]
> }
> date
> {
> match => ["entryDateTime","-MM-dd HH:mm:ss"]
> }
> }
> else if [logFormat]=="proftpd"
> {
> grok
> {
> break_on_match => false
> match => ["message","(?[^ ]+) (?[^
> ]+) (?[^ ]+) \[(?.*)\] (?\".*\")
> (?[^ ]+) (?\".*\") (?[^ ]+)"]
> add_field => ["receivedAt","%{@timestamp}"] # preserve now
> before date overwrites
> }
> date
> {
> match => ["entryDateTime","dd/MMM/:HH:mm:ss Z"]
> #target => "testDate"
> }
> }
> else if [logFormat] == "debiansyslog"
> {
># linux sysLog
>grok
>{
>break_on_match => false
>match => ["message","(?[a-zA-Z]{3} [ 0-9]+ [^
>

Re: Elasticseach issue with some indicies not populating data

Might be. But you should ask this on the logstash mailing list.
I think that elasticsearch is working fine here as you did not see any trouble 
in logs.

That said I’d use:

  elasticsearch {
protocol => "http"
host => "localhost"
  }

So using REST port (9200) that is.

You can also add this output to make sure something is meant to be sent in 
elasticsearch:

output { 
  stdout {
codec => rubydebug
  }
  elasticsearch {
protocol => "http"
host => "localhost"
  }
}



-- 
David Pilato - Developer | Evangelist 
elastic.co
@dadoonet  | @elasticsearchfr 
 | @scrutmydocs 






> Le 20 avr. 2015 à 16:38, Don Pich  a écrit :
> 
> Thanks for that info.  Again, training wheels...  :-)
> 
> So below is my logstash config.  If I do a tcpdump on port 5044, I see all of 
> my forwarders communicating with the logstash server.  However, if I do a 
> tcpdump on port 9300, I do not see any traffic.  This leads me to believe 
> that I have a problem in my output.
> 
> input
> {
> lumberjack # comes from logstash-forwarder, we sent ALL formats and types 
> through this and control logType and logFormat on the client
> {
># The port to listen on
>port => 5044
>host => "192.168.1.72"
> 
># The paths to your ssl cert and key
>ssl_certificate => "/opt/logstash-1.4.2/ssl/certs/lumberjack.crt" # 
> new cert needed for latest v of lumberjack-pusher
>ssl_key => "/opt/logstash-1.4.2/ssl/private/lumberjack.key"
> }
> 
> tcp
> {
># Remember with nxlog we're automatically converting our windows xml 
> to JSON
>ssl_cert => "/opt/logstash-1.4.2/ssl/certs/logstash-forwarder.crt"
>ssl_key  => "/opt/logstash-1.4.2/ssl/private/logstash-forwarder.key"
>ssl_enable => true
>debug=>true
>type => "windowsEventLog"
>port => 3515
>codec => "line"
>add_field=>{"logType"=>"windowsEventLog"}
> }
> tcp
> {
># Remember with nxlog we're automatically converting our windows xml 
> to JSON
># used for NFSServer which apparently cannot connect via SSL :(
>type => "windowsEventLog"
>port => 3516
>codec => "line"
>add_field=>{"logType"=>"windowsEventLog"}
> }
> 
> }
> 
> filter
> {
> if [logFormat] == "nginxLog"
> {
>mutate{add_field => ["receivedAt","%{@timestamp}"]} #preserve when we 
> received this
>grok
>{
>   break_on_match => false
>   match => 
> ["message","%{IP:visitor_ip}\|[^|]+\|%{TIMESTAMP_ISO8601:entryDateTime}\|%{URIPATH:url}%{URIPARAM:query_string}?\|%{INT:http_response}\|%{INT:response_length}\|(?[^|]+)\|(?[^|]+)\|%{BASE16FLOAT:request_time}\|%{BASE16FLOAT:upstream_response_time}"]
>   match => ["url","\.(?(?:.(?!\.))+)$"]
>}
>date
>{
>match => ["entryDateTime","ISO8601"]
>remove_field => ["entryDateTime"]
>}
> }
> else if [logFormat] == "exim4"
> {
> mutate{add_field => ["receivedAt","%{@timestamp}"]} #preserve when we 
> received this
> grok
> {
> break_on_match => false
> match => ["message","(?[^ ]+ [^ ]+) 
> \[(?.*)\] (?.*)"]
> }
> date
> {
> match => ["entryDateTime","-MM-dd HH:mm:ss"]
> }
> }
> else if [logFormat]=="proftpd"
> {
> grok
> {
> break_on_match => false
> match => ["message","(?[^ ]+) (?[^ ]+) 
> (?[^ ]+) \[(?.*)\] (?\".*\") 
> (?[^ ]+) (?\".*\") (?[^ ]+)"]
> add_field => ["receivedAt","%{@timestamp}"] # preserve now before 
> date overwrites
> }
> date
> {
> match => ["entryDateTime","dd/MMM/:HH:mm:ss Z"]
> #target => "testDate"
> }
> }
> else if [logFormat] == "debiansyslog"
> {
># linux sysLog
>grok
>{
>break_on_match => false
>match => ["message","(?[a-zA-Z]{3} [ 0-9]+ [^ ]+) 
> (?[^ ]+) (?[^:]+):(?.*)"]
>add_field => ["receivedAt","%{@timestamp}"] # preserve NOW before 
> date overwrites
>}
>date
>{
>   # Mar  2 02:21:28 primaryweb-wheezy logstash-forwarder[754]: 
> 2015/03/02 02:21:28.607445 Registrar received 348 events
>   match => ["entryDateTime","MMM dd HH:mm:ss","MMM  d HH:mm:ss"] 
> # problems with jodatime and missing leading 0 on days, we can supply 
> multiple patterns :)
>}
> }
> else if [type] == "windowsEventLog"
> {
>   json{ source => "message"  } # set our source to the entire message as 
> its JSON
>   mutate
>   {
>   add_field => ["receivedAt","%{@timestamp}"]
>   }
>   if [SourceModuleName] == "eventlog"
>   {
>  # use the date/time of the entry and not physic

Re: Elasticseach issue with some indicies not populating data

Also, sanity check:

root@logstash:/var/log/logstash# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination
root@logstash:/var/log/logstash#

Don Pich | Jedi Master (aka System Administrator 2) | O: 701-952-5925

3320 Westrac Drive South, Suite A * Fargo, ND 58103

Facebook  | Youtube
| Twitter
 | Google+  |
Instagram  | Linkedin
 | Our Guiding Principles

“If it goes on a truck we got it, if it’s fun we do it” – RealTruck.com


On Mon, Apr 20, 2015 at 9:38 AM, Don Pich  wrote:

> Thanks for that info.  Again, training wheels...  :-)
>
> So below is my logstash config.  If I do a tcpdump on port 5044, I see all
> of my forwarders communicating with the logstash server.  However, if I do
> a tcpdump on port 9300, I do not see any traffic.  This leads me to believe
> that I have a problem in my output.
>
> input
> {
> lumberjack # comes from logstash-forwarder, we sent ALL formats and
> types through this and control logType and logFormat on the client
> {
># The port to listen on
>port => 5044
>host => "192.168.1.72"
>
># The paths to your ssl cert and key
>ssl_certificate => "/opt/logstash-1.4.2/ssl/certs/lumberjack.crt" #
> new cert needed for latest v of lumberjack-pusher
>ssl_key => "/opt/logstash-1.4.2/ssl/private/lumberjack.key"
> }
>
> tcp
> {
># Remember with nxlog we're automatically converting our windows
> xml to JSON
>ssl_cert => "/opt/logstash-1.4.2/ssl/certs/logstash-forwarder.crt"
>ssl_key  => "/opt/logstash-1.4.2/ssl/private/logstash-forwarder.key"
>ssl_enable => true
>debug=>true
>type => "windowsEventLog"
>port => 3515
>codec => "line"
>add_field=>{"logType"=>"windowsEventLog"}
> }
> tcp
> {
># Remember with nxlog we're automatically converting our windows
> xml to JSON
># used for NFSServer which apparently cannot connect via SSL :(
>type => "windowsEventLog"
>port => 3516
>codec => "line"
>add_field=>{"logType"=>"windowsEventLog"}
> }
>
> }
>
> filter
> {
> if [logFormat] == "nginxLog"
> {
>mutate{add_field => ["receivedAt","%{@timestamp}"]} #preserve when
> we received this
>grok
>{
>   break_on_match => false
>   match =>
> ["message","%{IP:visitor_ip}\|[^|]+\|%{TIMESTAMP_ISO8601:entryDateTime}\|%{URIPATH:url}%{URIPARAM:query_string}?\|%{INT:http_response}\|%{INT:response_length}\|(?[^|]+)\|(?[^|]+)\|%{BASE16FLOAT:request_time}\|%{BASE16FLOAT:upstream_response_time}"]
>   match => ["url","\.(?(?:.(?!\.))+)$"]
>}
>date
>{
>match => ["entryDateTime","ISO8601"]
>remove_field => ["entryDateTime"]
>}
> }
> else if [logFormat] == "exim4"
> {
> mutate{add_field => ["receivedAt","%{@timestamp}"]} #preserve when
> we received this
> grok
> {
> break_on_match => false
> match => ["message","(?[^ ]+ [^ ]+)
> \[(?.*)\] (?.*)"]
> }
> date
> {
> match => ["entryDateTime","-MM-dd HH:mm:ss"]
> }
> }
> else if [logFormat]=="proftpd"
> {
> grok
> {
> break_on_match => false
> match => ["message","(?[^ ]+) (?[^
> ]+) (?[^ ]+) \[(?.*)\] (?\".*\")
> (?[^ ]+) (?\".*\") (?[^ ]+)"]
> add_field => ["receivedAt","%{@timestamp}"] # preserve now
> before date overwrites
> }
> date
> {
> match => ["entryDateTime","dd/MMM/:HH:mm:ss Z"]
> #target => "testDate"
> }
> }
> else if [logFormat] == "debiansyslog"
> {
># linux sysLog
>grok
>{
>break_on_match => false
>match => ["message","(?[a-zA-Z]{3} [ 0-9]+ [^
> ]+) (?[^ ]+) (?[^:]+):(?.*)"]
>add_field => ["receivedAt","%{@timestamp}"] # preserve NOW
> before date overwrites
>}
>date
>{
>   # Mar  2 02:21:28 primaryweb-wheezy logstash-forwarder[754]:
> 2015/03/02 02:21:28.607445 Registrar received 348 events
>   match => ["entryDateTime","MMM dd HH:mm:ss","MMM  d
> HH:mm:ss"] # problems with jodatime and missing leading 0 on days, we can
> supply multiple patterns :)
>}
> }
> else if [type] == "windowsEventLog"
> {
>   json{ source => "message"  } # set our source to the entire message
> as its JSON
>

Re: Elasticseach issue with some indicies not populating data

Thanks for that info.  Again, training wheels...  :-)

So below is my logstash config.  If I do a tcpdump on port 5044, I see all
of my forwarders communicating with the logstash server.  However, if I do
a tcpdump on port 9300, I do not see any traffic.  This leads me to believe
that I have a problem in my output.

input
{
lumberjack # comes from logstash-forwarder, we sent ALL formats and
types through this and control logType and logFormat on the client
{
   # The port to listen on
   port => 5044
   host => "192.168.1.72"

   # The paths to your ssl cert and key
   ssl_certificate => "/opt/logstash-1.4.2/ssl/certs/lumberjack.crt" #
new cert needed for latest v of lumberjack-pusher
   ssl_key => "/opt/logstash-1.4.2/ssl/private/lumberjack.key"
}

tcp
{
   # Remember with nxlog we're automatically converting our windows xml
to JSON
   ssl_cert => "/opt/logstash-1.4.2/ssl/certs/logstash-forwarder.crt"
   ssl_key  => "/opt/logstash-1.4.2/ssl/private/logstash-forwarder.key"
   ssl_enable => true
   debug=>true
   type => "windowsEventLog"
   port => 3515
   codec => "line"
   add_field=>{"logType"=>"windowsEventLog"}
}
tcp
{
   # Remember with nxlog we're automatically converting our windows xml
to JSON
   # used for NFSServer which apparently cannot connect via SSL :(
   type => "windowsEventLog"
   port => 3516
   codec => "line"
   add_field=>{"logType"=>"windowsEventLog"}
}

}

filter
{
if [logFormat] == "nginxLog"
{
   mutate{add_field => ["receivedAt","%{@timestamp}"]} #preserve when
we received this
   grok
   {
  break_on_match => false
  match =>
["message","%{IP:visitor_ip}\|[^|]+\|%{TIMESTAMP_ISO8601:entryDateTime}\|%{URIPATH:url}%{URIPARAM:query_string}?\|%{INT:http_response}\|%{INT:response_length}\|(?[^|]+)\|(?[^|]+)\|%{BASE16FLOAT:request_time}\|%{BASE16FLOAT:upstream_response_time}"]
  match => ["url","\.(?(?:.(?!\.))+)$"]
   }
   date
   {
   match => ["entryDateTime","ISO8601"]
   remove_field => ["entryDateTime"]
   }
}
else if [logFormat] == "exim4"
{
mutate{add_field => ["receivedAt","%{@timestamp}"]} #preserve when
we received this
grok
{
break_on_match => false
match => ["message","(?[^ ]+ [^ ]+)
\[(?.*)\] (?.*)"]
}
date
{
match => ["entryDateTime","-MM-dd HH:mm:ss"]
}
}
else if [logFormat]=="proftpd"
{
grok
{
break_on_match => false
match => ["message","(?[^ ]+) (?[^
]+) (?[^ ]+) \[(?.*)\] (?\".*\")
(?[^ ]+) (?\".*\") (?[^ ]+)"]
add_field => ["receivedAt","%{@timestamp}"] # preserve now
before date overwrites
}
date
{
match => ["entryDateTime","dd/MMM/:HH:mm:ss Z"]
#target => "testDate"
}
}
else if [logFormat] == "debiansyslog"
{
   # linux sysLog
   grok
   {
   break_on_match => false
   match => ["message","(?[a-zA-Z]{3} [ 0-9]+ [^ ]+)
(?[^ ]+) (?[^:]+):(?.*)"]
   add_field => ["receivedAt","%{@timestamp}"] # preserve NOW
before date overwrites
   }
   date
   {
  # Mar  2 02:21:28 primaryweb-wheezy logstash-forwarder[754]:
2015/03/02 02:21:28.607445 Registrar received 348 events
  match => ["entryDateTime","MMM dd HH:mm:ss","MMM  d
HH:mm:ss"] # problems with jodatime and missing leading 0 on days, we can
supply multiple patterns :)
   }
}
else if [type] == "windowsEventLog"
{
  json{ source => "message"  } # set our source to the entire message
as its JSON
  mutate
  {
  add_field => ["receivedAt","%{@timestamp}"]
  }
  if [SourceModuleName] == "eventlog"
  {
 # use the date/time of the entry and not physical time so viewing
acts as expected
 date
 {
 match => ["EventTime","-MM-dd HH:mm:ss"]
 }

 # message defaults to the entire message. Since we have json data
for all properties, copy the event message into it instead
 mutate
 {
   replace => [ "message", "%{Message}" ]
 }
 mutate
 {
   remove_field => [ "Message" ]
 }
  }
   }
}
output
{
if [logType] == "webLog"
{
elasticsearch
{
host=>"127.0.0.1"
port=>9300
cluster => "es-logstash"
#node_name => "es-logstash-n1"
index => "logstash-weblog-events-%{+.MM.dd}"
}
}
else if [logType] == "mailLog"
{
elasticsearch
{
host=>"127.0.0.1"
port=>9300
cluster => "es-logstash"
#node_name => "es-logstash-n1"
index => "logstash-mail-events-%{+.MM.dd}"
}
}
else if [t

Re: How to configure max file descriptors on windows OS?

2015-04-20 Thread Xudong You

Thanks Mark!

On Friday, April 17, 2015 at 6:22:24 AM UTC+8, Mark Walkom wrote:
>
> -1 means unbound, ie unlimited.
>
> On 16 April 2015 at 20:54, Xudong You > 
> wrote:
>
>> Anyone knows how to change the max_file_descriptors on windows?
>> I built ES cluster on Windows and got following process information:
>>
>> "max_file_descriptors" : -1,
>> "open_file_descriptors" : -1,
>>
>> What does “-1 mean?
>> Is it possible to change the max file descriptors on windows platform?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/aa22c565-80f5-4228-8f03-15d1b1e3f150%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6c87970e-ca32-413b-9333-9ec60a931c9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: creation_date in index setteing

2015-04-20 Thread Prashant Agrawal

We are using version 1.3.0
On Apr 20, 2015 7:38 PM, "Colin Goodheart-Smithe-2 [via Elasticsearch
Users]"  wrote:

> Prashant,
>
> What version of Elasticsearch are you using?
>
> The index creation date added to the index settings API in version 1.4.0
> and will only show for indices created with that version or later (see
> https://github.com/elastic/elasticsearch/pull/7218).
>
> Colin
>
>
> On Monday, April 20, 2015 at 2:23:37 PM UTC+1, Prashy wrote:
>>
>> Hi All,
>>
>> I also require the indexing time to be returned by ES, but when i am
>> firing
>> the query like curl -XGET
>> 'http://192.168.0.179:9200/16-04-2015-index/_settings";
>> target="_blank" rel="nofollow" onmousedown="this.href='
>> http://www.google.com/url?q\75http%3A%2F%2F192.168.0.179%3A9200%2F16-04-2015-index%2F_settings\46sa\75D\46sntz\0751\46usg\75AFQjCNFc9za3EfTh2KMLpEv3Zd6SLxpoYw';return
>> 
>> true;" onclick="this.href='
>> http://www.google.com/url?q\75http%3A%2F%2F192.168.0.179%3A9200%2F16-04-2015-index%2F_settings\46sa\75D\46sntz\0751\46usg\75AFQjCNFc9za3EfTh2KMLpEv3Zd6SLxpoYw';return
>> 
>> true;">http://192.168.0.179:9200/16-04-2015-index/_settings'
>>
>> I am not able to get the index_creation time and getting the response as
>> :
>> {"16-04-2015-index":{"settings":{"index":{"uuid":"rHmX564PSnuI8cye4GxA1g","number_of_replicas":"0","number_of_shards":"5","version":{"created":"1030099"}
>>
>>
>> Please let me know how i can get the index creation time for the same.
>>
>> ~Prashant
>>
>>
>>
>> --
>> View this message in context: http://elasticsearch-users.115913.n3.nabble.com/creation-date-in-index-setteing-tp4073837p4073853.html";
>> target="_blank" rel="nofollow" onmousedown="this.href='
>> http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2Fcreation-date-in-index-setteing-tp4073837p4073853.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHHOI73zG2-BxkPf7wBxGUzmYqbtQ';return
>> 
>> true;" onclick="this.href='
>> http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2Fcreation-date-in-index-setteing-tp4073837p4073853.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHHOI73zG2-BxkPf7wBxGUzmYqbtQ';return
>> 
>> true;">http://elasticsearch-users.115913.n3.nabble.com/creation-date-in-index-setteing-tp4073837p4073853.html
>>
>> Sent from the Elasticsearch Users mailing list archive at Nabble.com.
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [hidden email]
> .
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/b12a4019-c5ab-4e0f-aaec-0d3ab083da4d%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://elasticsearch-users.115913.n3.nabble.com/creation-date-in-index-setteing-tp4073837p4073856.html
>  To unsubscribe from creation_date in index setteing, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://elasticsearch-users.115913.n3.nabble.com/creation-date-in-index-setteing-tp4073837p4073861.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearc

MongoDB river not copying all of the data from mongoDB to ES

2015-04-20 Thread Ramdev Wudali

Enter code here...

Hi:
   I have been successful at creating a river between a MongoDB database 
and an Elasticsearch instance.  
The MongoDB for the database and specific collection has 8M+ documents. 
However when the river is setup and running
less than 1/2 the number of docs are copied/transferred. 

I am using the elasticsearch-river-mongodb-2.0.7 version  with 
Elasticsearch 1.4.4 

Here is a sampling of the trace log messages from ES :

[2015-04-17 12:56:22,045][TRACE][org.elasticsearch.river.mongodb.Indexer] 
Insert operation - id: 553148b8e4b09c4dd2407f92 - contains attachment: false
[2015-04-17 12:56:22,045][TRACE][org.elasticsearch.river.mongodb.Indexer] 
updateBulkRequest for id: [553148b8e4b09c4dd2407fa6], operation: [INSERT]
[2015-04-17 12:56:22,045][TRACE][org.elasticsearch.river.mongodb.Indexer] 
Operation: INSERT - index: twitter - type: one-pct-sane - routing: null - 
parent: null
[2015-04-17 12:56:22,045][TRACE][org.elasticsearch.river.mongodb.Indexer] 
Insert operation - id: 553148b8e4b09c4dd2407fa6 - contains attachment: false
[2015-04-17 12:56:22,045][TRACE][org.elasticsearch.river.mongodb.Indexer] 
updateBulkRequest for id: [553148bae4b09c4dd2407fd5], operation: [INSERT]
[2015-04-17 12:56:22,045][TRACE][org.elasticsearch.river.mongodb.Indexer] 
Operation: INSERT - index: twitter - type: one-pct-sane - routing: null - 
parent: null
[2015-04-17 12:56:22,045][TRACE][org.elasticsearch.river.mongodb.Indexer] 
Insert operation - id: 553148bae4b09c4dd2407fd5 - contains attachment: false
[2015-04-17 12:56:22,045][TRACE][org.elasticsearch.river.mongodb.Indexer] 
updateBulkRequest for id: [553148bae4b09c4dd2407ffe], operation: [INSERT]
[2015-04-17 12:56:22,045][TRACE][org.elasticsearch.river.mongodb.Indexer] 
Operation: INSERT - index: twitter - type: one-pct-sane - routing: null - 
parent: null
[2015-04-17 12:56:22,045][TRACE][org.elasticsearch.river.mongodb.Indexer] 
Insert operation - id: 553148bae4b09c4dd2407ffe - contains attachment: false
[2015-04-17 12:56:22,055][TRACE][org.elasticsearch.river.mongodb.
MongoDBRiver] setLastTimestamp [one_pct_sane] [one-pct-sane.current] [
Timestamp.BSON(ts={ "$ts" : 1429282637 , "$inc" : 2})]
[2015-04-17 12:56:22,095][TRACE][org.elasticsearch.river.mongodb.
MongoDBRiverBulkProcessor] afterBulk - bulk [57498] success [80 items] [59 
ms] total [3952638]
[2015-04-17 12:56:22,217][TRACE][org.elasticsearch.river.mongodb.
MongoDBRiverBulkProcessor] bulkQueueSize [50] - queue [0] - availability [1]
[2015-04-17 12:56:22,217][TRACE][org.elasticsearch.river.mongodb.
MongoDBRiverBulkProcessor] beforeBulk - new bulk [57499] of items [49]
[2015-04-17 12:56:22,259][TRACE][org.elasticsearch.river.mongodb.
MongoDBRiverBulkProcessor] afterBulk - bulk [57499] success [49 items] [42 
ms] total [3952687]
[2015-04-17 12:56:22,387][TRACE][org.elasticsearch.river.mongodb.
MongoDBRiverBulkProcessor] bulkQueueSize [50] - queue [0] - availability [1]
[2015-04-17 12:56:22,387][TRACE][org.elasticsearch.river.mongodb.
MongoDBRiverBulkProcessor] beforeBulk - new bulk [57500] of items [1]
[2015-04-17 12:56:22,389][TRACE][org.elasticsearch.river.mongodb.
MongoDBRiverBulkProcessor] afterBulk - bulk [57500] success [1 items] [2 ms] 
total [3952688]
[2015-04-17 12:56:22,390][INFO ][cluster.metadata ] [Star-Dancer] [
_river] update_mapping [one_pct_sane] (dynamic)
[2015-04-17 13:06:20,497][INFO ][cluster.metadata ] [Star-Dancer] [
_river] update_mapping [one_pct_sane2] (dynamic)
[2015-04-17 13:06:20,513][INFO ][cluster.metadata ] [Star-Dancer] [
_river] update_mapping [one_pct_sane2] (dynamic)
[2015-04-17 13:06:20,523][INFO ][cluster.metadata ] [Star-Dancer] [
_river] update_mapping [one_pct_sane2] (dynamic)
[2015-04-17 13:06:22,394][INFO ][cluster.metadata ] [Star-Dancer] [
_river] update_mapping [one_pct_sane2] (dynamic)




Amongst the several questions I have, these are some :

1. Does the river copy data based on what exists in Oplogs ? (how does the 
river use the oplogs to get the data)


2.  There aren't any obvious errors being shown, documents do come in. but 
as I mentioned earlier, less than 1/2 the number of documents in MongoDB 
are being copied over.  
why would that be ?


Thanks for any help/assistance

Ramdev




-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fb1f60d1-0204-4021-93a5-85263d939c8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Distribute Search Results across a specific field / property

2015-04-20 Thread mark

I have a pull request in the works that adds an option for maintaining 
diversity in results: https://github.com/elastic/elasticsearch/pull/10221
This is mainly for the purposes of sample-based aggregations but if used 
with the top_hits aggregation it might give you some of what you need.

Cheers
Mark



On Monday, April 20, 2015 at 3:18:49 PM UTC+1, Frederik Lipfert wrote:
>
> Hi Guys,
>
> I am using ES to build out the search for an online store. The operator 
> would like to have the results being returned in a way that showcases the 
> varieties of  manufactures he offers. So instead of returning order by 
> score he would like there to be one result from each store on each page, at 
> least on the first few pages.
> The issue is there is one manufactures who makes 80% of the portfolio so 
> it always looks like there is only that one store.
> But if one could "mix" or "distribute" the results to showcase the variety 
> that be pretty cool. 
> I have seen a bunch of online stores do that somehow, although I do not 
> know how.
>
> Is there a way to use ES to somehow do a query like, return 20 results and 
> 2 of each Store?
>
> Thanks for your help.
>
> Frederik
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/feaece21-85c0-405c-8eda-77d627f74bb1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Distribute Search Results across a specific field / property

2015-04-20 Thread Frederik Lipfert

Hi Guys,

I am using ES to build out the search for an online store. The operator 
would like to have the results being returned in a way that showcases the 
varieties of  manufactures he offers. So instead of returning order by 
score he would like there to be one result from each store on each page, at 
least on the first few pages.
The issue is there is one manufactures who makes 80% of the portfolio so it 
always looks like there is only that one store.
But if one could "mix" or "distribute" the results to showcase the variety 
that be pretty cool. 
I have seen a bunch of online stores do that somehow, although I do not 
know how.

Is there a way to use ES to somehow do a query like, return 20 results and 
2 of each Store?

Thanks for your help.

Frederik

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5e3f3787-79a5-4481-908a-0c2092da31ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Elasticseach issue with some indicies not populating data

Having unassigned shards is perfectly fine on a one node cluster.
The fact that your cluster were yellow does not mean your cluster was not 
behaving correctly.


-- 
David Pilato - Developer | Evangelist 
elastic.co
@dadoonet  | @elasticsearchfr 
 | @scrutmydocs 






> Le 20 avr. 2015 à 15:54, Don Pich  a écrit :
> 
> Hello David,
> 
> I found and this online that made my cluster go 'green'.  
> http://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
>  
> 
>   I don't know for certain if that was 100% of the problem, but there are no 
> longer unassigned shards.
> 
> root@logstash:/# curl -XGET 'localhost:9200/_cluster/health?pretty=true'
> {
>   "cluster_name" : "es-logstash",
>   "status" : "green",
>   "timed_out" : false,
>   "number_of_nodes" : 2,
>   "number_of_data_nodes" : 2,
>   "active_primary_shards" : 2792,
>   "active_shards" : 5584,
>   "relocating_shards" : 0,
>   "initializing_shards" : 0,
>   "unassigned_shards" : 0
> }
> root@logstash:/#
> 
> However, the root of my problem still exists.  I did restart the forwarders, 
> and TCP dump does show that traffic is indeed hitting the server.  But my 
> indicies folder does not contain fresh data except for one source.
> 
> Don Pich | Jedi Master (aka System Administrator 2) | O: 701-952-5925
> 
> 3320 Westrac Drive South, Suite A * Fargo, ND 58103
> Facebook  | Youtube 
> | Twitter  
> | Google+  | Instagram 
>  | Linkedin 
>  | Our Guiding Principles 
> “If it goes on a truck we 
> got it, if it’s fun we do it” – RealTruck.com 
> 
> On Sun, Apr 19, 2015 at 10:04 PM, David Pilato  > wrote:
> Are you using the same exact JVM version?
> Where do those logs come from? LS ? ES ?
> 
> Could you try the same with a cleaned Elasticsearch ? I mean with no data ?
> My suspicion is that you have too many shards allocated on a single (tiny?) 
> node.
> 
> What is your node size BTW (memory / heap size)?
> 
> David
> 
> Le 19 avr. 2015 à 23:09, Don Pich  > a écrit :
> 
>> Thanks for taking the time to answer David.
>> 
>> Again, got my training wheels on with an ELK stack so I will do my best to 
>> answer.
>> 
>> Here is an example.  The one indecy that is working has a fresh directory 
>> with todays date in the elasticsearch directory.  The ones that are not 
>> working do not have a directory.
>> 
>> Logstash and Elastisearch are running with the logs not generating much 
>> information as far as pointing to any error.
>> 
>> log4j, [2015-04-19T13:41:44.723]  WARN: org.elasticsearch.transport.netty: 
>> [logstash-logstash-3170-2032] Message not fully read (request) for [2] and 
>> action [internal:discovery/zen/unicast_gte_1_4], resetting
>> log4j, [2015-04-19T13:41:49.569]  WARN: org.elasticsearch.transport.netty: 
>> [logstash-logstash-3170-2032] Message not fully read (request) for [5] and 
>> action [internal:discovery/zen/unicast_gte_1_4], resetting
>> log4j, [2015-04-19T13:41:54.572]  WARN: org.elasticsearch.transport.netty: 
>> [logstash-logstash-3170-2032] Message not fully read (request) for [10] and 
>> action [internal:discovery/zen/unicast_gte_1_4], resetting
>> 
>> 
>> 
>> Don Pich | Jedi Master (aka System Administrator 2) | O: 701-952-5925 
>> 
>> 
>> 3320 Westrac Drive South, Suite A * Fargo, ND 58103
>> Facebook  | Youtube 
>> | Twitter 
>>  | Google+  | 
>> Instagram  | Linkedin 
>>  | Our Guiding Principles 
>> “If it goes on a truck we 
>> got it, if it’s fun we do it” – RealTruck.com 
>> 
>> On Sun, Apr 19, 2015 at 2:38 PM, David Pilato > > wrote:
>> From an Elasticsearch point of view, I don't see anything wrong.
>> You have a way too much shards for sure so you might hit OOM exception or 
>> other troubles.
>> 
>> So to answer to your question, check your Elasticsearch logs and if nothing 
>> looks wrong, check logstash.
>> 
>> Just adding that Elasticsearch is not generating data so you probably meant 
>> that logstash stopped generating data, right?
>> 
>> HTH
>> 
>> David
>> 
>> Le 19 avr. 2015 à 21:08, dp...@realtruck.com  a 
>> écrit :
>> 
>>> I am new to elasticsearch and have a problem.  I have 5 indicies.  At first 
>>> all of them were r

Re: creation_date in index setteing

2015-04-20 Thread Colin Goodheart-Smithe

Prashant,

What version of Elasticsearch are you using? 

The index creation date added to the index settings API in version 1.4.0 
and will only show for indices created with that version or later 
(see https://github.com/elastic/elasticsearch/pull/7218).

Colin


On Monday, April 20, 2015 at 2:23:37 PM UTC+1, Prashy wrote:
>
> Hi All, 
>
> I also require the indexing time to be returned by ES, but when i am 
> firing 
> the query like curl -XGET 
> 'http://192.168.0.179:9200/16-04-2015-index/_settings' 
>
> I am not able to get the index_creation time and getting the response as : 
> {"16-04-2015-index":{"settings":{"index":{"uuid":"rHmX564PSnuI8cye4GxA1g","number_of_replicas":"0","number_of_shards":"5","version":{"created":"1030099"}
>  
>
>
> Please let me know how i can get the index creation time for the same. 
>
> ~Prashant 
>
>
>
> -- 
> View this message in context: 
> http://elasticsearch-users.115913.n3.nabble.com/creation-date-in-index-setteing-tp4073837p4073853.html
>  
> Sent from the Elasticsearch Users mailing list archive at Nabble.com. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b12a4019-c5ab-4e0f-aaec-0d3ab083da4d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Elasticseach issue with some indicies not populating data