Re: Latitude -> Lat, Longitude -> lon

2014-07-06 Thread David Pilato
You could may be try to use script filters and add on the fly lat and lon 
fields or a String representing your point.

See doc: https://github.com/elasticsearch/elasticsearch-river-couchdb

HTH

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 7 juil. 2014 à 02:23, Olivier B  a écrit :

Hi there,

I understand a geo-point can be mapped based on two fields : "lat", "long".
However my fields are name "longitude" and "latitude".
I'm using the river plugin for couchdb and I cannot really rename those fields 
before indexing. And those fields are part of an item in an array:

"items": [
  { 
"item_id" : "abcd",
"location": 
{
  "longitude": 145.7711,
  "latitude": -16.92359
}
  },
  { 
"item_id" : "efgh",
"location": 
{
  "longitude": 149.6611,
  "latitude": -19.94098
}
  }
]

So, any idea how I can rename those fields? And eventually map it to a geo 
point 
(http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-geo-point-type.html)?

Cheers

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/96700d4a-c3f5-441a-a53f-dfee9e934364%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6C446501-E0D4-4888-B737-CFE457DE5320%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


Re: Elasticsearch wont start on Ubuntu 14.04

2014-07-06 Thread Mark Walkom
Are you sure, unless I am misreading this, the OS is picking up v6;

> + for jdir in '$JDK_DIRS'
> + '[' -r /usr/lib/jvm/java-6-openjdk-amd64/bin/java -a -z '' ']'
> + JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64


Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com


On 7 July 2014 12:47, Steven Yue  wrote:

> No, I have 7.
>
> Run 'java -version' shows this:
>
> java version "1.7.0_60"
>
> Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
>
> Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
>
> On Tuesday, July 1, 2014 11:29:05 PM UTC-4, Mark Walkom wrote:
>>
>> You're on Java 6 by the looks of it, ES won't run on anything less than 7.
>>
>> Regards,
>> Mark Walkom
>>
>> Infrastructure Engineer
>> Campaign Monitor
>> email: ma...@campaignmonitor.com
>> web: www.campaignmonitor.com
>>
>>
>> On 1 July 2014 22:57, Steven Yue  wrote:
>>
>>> Hi, Alex
>>>
>>> Below is the output:
>>>
>>> ++ id -u
>>>
>>> + '[' 0 -ne 0 ']'
>>>
>>> + . /lib/lsb/init-functions
>>>
>>> +++ run-parts --lsbsysinit --list /lib/lsb/init-functions.d
>>>
>>> ++ for hook in '$(run-parts --lsbsysinit --list
>>> /lib/lsb/init-functions.d 2>/dev/null)'
>>>
>>> ++ '[' -r /lib/lsb/init-functions.d/20-left-info-blocks ']'
>>>
>>> ++ . /lib/lsb/init-functions.d/20-left-info-blocks
>>>
>>> ++ for hook in '$(run-parts --lsbsysinit --list
>>> /lib/lsb/init-functions.d 2>/dev/null)'
>>>
>>> ++ '[' -r /lib/lsb/init-functions.d/50-ubuntu-logging ']'
>>>
>>> ++ . /lib/lsb/init-functions.d/50-ubuntu-logging
>>>
>>> +++ LOG_DAEMON_MSG=
>>>
>>> ++ FANCYTTY=
>>>
>>> ++ '[' -e /etc/lsb-base-logging.sh ']'
>>>
>>> ++ true
>>>
>>> + '[' -r /etc/default/rcS ']'
>>>
>>> + . /etc/default/rcS
>>>
>>> ++ UTC=yes
>>>
>>> ++ FSCKFIX=no
>>>
>>> + ES_USER=elasticsearch
>>>
>>> + ES_GROUP=elasticsearch
>>>
>>> + JDK_DIRS='/usr/lib/jvm/java-7-oracle /usr/lib/jvm/java-7-openjdk
>>> /usr/lib/jvm/java-7-openjdk-amd64/ /usr/lib/jvm/java-7-openjdk-armhf
>>> /usr/lib/jvm/java-7-openjdk-i386/ /usr/lib/jvm/java-6-sun
>>> /usr/lib/jvm/java-6-openjdk /usr/lib/jvm/java-6-openjdk-amd64
>>> /usr/lib/jvm/java-6-openjdk-armhf /usr/lib/jvm/java-6-openjdk-i386
>>> /usr/lib/jvm/default-java'
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/java-7-oracle/bin/java -a -z '' ']'
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/java-7-openjdk/bin/java -a -z '' ']'
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/java-7-openjdk-amd64//bin/java -a -z '' ']'
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/java-7-openjdk-armhf/bin/java -a -z '' ']'
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/java-7-openjdk-i386//bin/java -a -z '' ']'
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/java-6-sun/bin/java -a -z '' ']'
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/java-6-openjdk/bin/java -a -z '' ']'
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/java-6-openjdk-amd64/bin/java -a -z '' ']'
>>>
>>> + JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/java-6-openjdk-armhf/bin/java -a -z
>>> /usr/lib/jvm/java-6-openjdk-amd64 ']'
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/java-6-openjdk-i386/bin/java -a -z
>>> /usr/lib/jvm/java-6-openjdk-amd64 ']'
>>>
>>> + for jdir in '$JDK_DIRS'
>>>
>>> + '[' -r /usr/lib/jvm/default-java/bin/java -a -z
>>> /usr/lib/jvm/java-6-openjdk-amd64 ']'
>>>
>>> + export JAVA_HOME
>>>
>>> + ES_HOME=/usr/share/elasticsearch
>>>
>>> + MAX_OPEN_FILES=65535
>>>
>>> + LOG_DIR=/var/log/elasticsearch
>>>
>>> + DATA_DIR=/var/lib/elasticsearch
>>>
>>> + WORK_DIR=/tmp/elasticsearch
>>>
>>> + CONF_DIR=/etc/elasticsearch
>>>
>>> + CONF_FILE=/etc/elasticsearch/elasticsearch.yml
>>>
>>> + MAX_MAP_COUNT=262144
>>>
>>> + '[' -f /etc/default/elasticsearch ']'
>>>
>>> + . /etc/default/elasticsearch
>>>
>>> ++ ES_USER=elasticsearch
>>>
>>> ++ ES_GROUP=elasticsearch
>>>
>>> ++ ES_HEAP_SIZE=2g
>>>
>>> ++ MAX_LOCKED_MEMORY=unlimited
>>>
>>> ++ LOG_DIR=/home/log/elasticsearch
>>>
>>> ++ DATA_DIR=/home/data/elasticsearch
>>>
>>> ++ WORK_DIR=/home/tmp/elasticsearch
>>>
>>> ++ CONF_DIR=/etc/elasticsearch
>>>
>>> ++ CONF_FILE=/etc/elasticsearch/elasticsearch.yml
>>>
>>> + PID_FILE=/var/run/elasticsearch.pid
>>>
>>> + DAEMON=/usr/share/elasticsearch/bin/elasticsearch
>>>
>>> + DAEMON_OPTS='-d -p /var/run/elasticsearch.pid -Des.default.config=/etc/
>>> elasticsearch/elasticsearch.yml 
>>> -Des.default.path.home=/usr/share/elasticsearch
>>> -Des.default.path.logs=/home/log/elasticsearch
>>> -Des.default.path.data=/home/data/elasticsearch
>>> -Des.default.path.work=/home/tmp/elasticsearch
>>> -Des.default.path.conf=/etc/elasticsearch'
>>>
>>> + export ES_HEAP_SIZE
>>>
>>> + export ES_HEAP_NEWSIZE
>>>
>>> + export ES_DIRECT_SIZE
>>>
>>> + export ES_JAVA_OPTS
>>>
>>> + test -x /usr/share/elasticsearch/bin/elasticsearch

Re: Memory issues on ES client node

2014-07-06 Thread Venkat Morampudi
It expected to nodes move huge volumes of data but what I was wondering why the 
objects are not being garbage collected. Also, there are 242 
TransportSearchQueryThenFetchAction$AsyncAction; I don't think that kind of 
concurrency is not expected. I couldn't yet find from the code which object is 
holding those objects.

The timeouts that you are referring to, are these between client node and data 
nodes or client node and consumer?  Is there any thing the consumer need to do 
to release objects.

Thanks for your time,
-VM


On Jul 2, 2014, at 7:08 AM, joergpra...@gmail.com wrote:

> I'm not sure but it looks like a node tries to move some GB of document hits 
> around. This might have triggered timeouts at other places (probably with 
> node disconnects) and maybe the GB chunk is not yet GC collected, so you see 
> this in your heap analyzer tool.
> 
> It depends on the search results and search hits you generated if the 
> heaviness of the search result is expected or not, so it would be useful to 
> know more about your queries.
> 
> Jörg
> 
> 
> On Wed, Jul 2, 2014 at 3:29 AM, Venkat Morampudi  
> wrote:
> Thanks for reply Jörg. I don't have any logs. I will try to enable them it 
> would but it would take some time though. If there anything in particular 
> that we need to enable, please let me know.
> 
> -VM
> 
> 
> On Tuesday, July 1, 2014 12:58:21 PM UTC-7, Jörg Prante wrote:
> Do you have anything in your logs, i.e. many disconnects/reconnects?
> 
> Jörg
> 
> 
> On Tue, Jul 1, 2014 at 7:59 PM, Venkat Morampudi  wrote:
> In the elastic search deployment we are seeing random client node crashed due 
> to out of memory exception. I got the memory dump from one of the crash and 
> analysed using Eclipse memory analyzer. I have attached leak suspect report. 
> Apparently 242 objects of type 
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction
>  are holding almost 8gb of memory. I have spent some time on source code but 
> couldn't find anything obvious. 
> 
> 
> I would really appreciate any help with this issue. 
> 
> 
> -VM
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearc...@googlegroups.com.
> 
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/37881ead-70c2-40d8-89b6-a771b2a36bdd%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/9930fcfd-d2d4-4f62-b8a0-8f1f989069f2%40googlegroups.com.
> 
> For more options, visit https://groups.google.com/d/optout.
> 
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "elasticsearch" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/elasticsearch/EH76o1CIeQQ/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE_Xum%2BU%3D-M-X_R93qbDdOKx-QFS2PFCbxcik-uqtpBbw%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7E54C4B5-AE5A-4E64-8199-B923E8337676%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Elasticsearch wont start on Ubuntu 14.04

2014-07-06 Thread Steven Yue
No, I have 7.

Run 'java -version' shows this:

java version "1.7.0_60"

Java(TM) SE Runtime Environment (build 1.7.0_60-b19)

Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)

On Tuesday, July 1, 2014 11:29:05 PM UTC-4, Mark Walkom wrote:
>
> You're on Java 6 by the looks of it, ES won't run on anything less than 7.
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com 
> web: www.campaignmonitor.com
>  
>
> On 1 July 2014 22:57, Steven Yue > wrote:
>
>> Hi, Alex
>>
>> Below is the output:
>>
>> ++ id -u
>>
>> + '[' 0 -ne 0 ']'
>>
>> + . /lib/lsb/init-functions
>>
>> +++ run-parts --lsbsysinit --list /lib/lsb/init-functions.d
>>
>> ++ for hook in '$(run-parts --lsbsysinit --list /lib/lsb/init-functions.d 
>> 2>/dev/null)'
>>
>> ++ '[' -r /lib/lsb/init-functions.d/20-left-info-blocks ']'
>>
>> ++ . /lib/lsb/init-functions.d/20-left-info-blocks
>>
>> ++ for hook in '$(run-parts --lsbsysinit --list /lib/lsb/init-functions.d 
>> 2>/dev/null)'
>>
>> ++ '[' -r /lib/lsb/init-functions.d/50-ubuntu-logging ']'
>>
>> ++ . /lib/lsb/init-functions.d/50-ubuntu-logging
>>
>> +++ LOG_DAEMON_MSG=
>>
>> ++ FANCYTTY=
>>
>> ++ '[' -e /etc/lsb-base-logging.sh ']'
>>
>> ++ true
>>
>> + '[' -r /etc/default/rcS ']'
>>
>> + . /etc/default/rcS
>>
>> ++ UTC=yes
>>
>> ++ FSCKFIX=no
>>
>> + ES_USER=elasticsearch
>>
>> + ES_GROUP=elasticsearch
>>
>> + JDK_DIRS='/usr/lib/jvm/java-7-oracle /usr/lib/jvm/java-7-openjdk 
>> /usr/lib/jvm/java-7-openjdk-amd64/ /usr/lib/jvm/java-7-openjdk-armhf 
>> /usr/lib/jvm/java-7-openjdk-i386/ /usr/lib/jvm/java-6-sun 
>> /usr/lib/jvm/java-6-openjdk /usr/lib/jvm/java-6-openjdk-amd64 
>> /usr/lib/jvm/java-6-openjdk-armhf /usr/lib/jvm/java-6-openjdk-i386 
>> /usr/lib/jvm/default-java'
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/java-7-oracle/bin/java -a -z '' ']'
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/java-7-openjdk/bin/java -a -z '' ']'
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/java-7-openjdk-amd64//bin/java -a -z '' ']'
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/java-7-openjdk-armhf/bin/java -a -z '' ']'
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/java-7-openjdk-i386//bin/java -a -z '' ']'
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/java-6-sun/bin/java -a -z '' ']'
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/java-6-openjdk/bin/java -a -z '' ']'
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/java-6-openjdk-amd64/bin/java -a -z '' ']'
>>
>> + JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/java-6-openjdk-armhf/bin/java -a -z 
>> /usr/lib/jvm/java-6-openjdk-amd64 ']'
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/java-6-openjdk-i386/bin/java -a -z 
>> /usr/lib/jvm/java-6-openjdk-amd64 ']'
>>
>> + for jdir in '$JDK_DIRS'
>>
>> + '[' -r /usr/lib/jvm/default-java/bin/java -a -z 
>> /usr/lib/jvm/java-6-openjdk-amd64 ']'
>>
>> + export JAVA_HOME
>>
>> + ES_HOME=/usr/share/elasticsearch
>>
>> + MAX_OPEN_FILES=65535
>>
>> + LOG_DIR=/var/log/elasticsearch
>>
>> + DATA_DIR=/var/lib/elasticsearch
>>
>> + WORK_DIR=/tmp/elasticsearch
>>
>> + CONF_DIR=/etc/elasticsearch
>>
>> + CONF_FILE=/etc/elasticsearch/elasticsearch.yml
>>
>> + MAX_MAP_COUNT=262144
>>
>> + '[' -f /etc/default/elasticsearch ']'
>>
>> + . /etc/default/elasticsearch
>>
>> ++ ES_USER=elasticsearch
>>
>> ++ ES_GROUP=elasticsearch
>>
>> ++ ES_HEAP_SIZE=2g
>>
>> ++ MAX_LOCKED_MEMORY=unlimited
>>
>> ++ LOG_DIR=/home/log/elasticsearch
>>
>> ++ DATA_DIR=/home/data/elasticsearch
>>
>> ++ WORK_DIR=/home/tmp/elasticsearch
>>
>> ++ CONF_DIR=/etc/elasticsearch
>>
>> ++ CONF_FILE=/etc/elasticsearch/elasticsearch.yml
>>
>> + PID_FILE=/var/run/elasticsearch.pid
>>
>> + DAEMON=/usr/share/elasticsearch/bin/elasticsearch
>>
>> + DAEMON_OPTS='-d -p /var/run/elasticsearch.pid 
>> -Des.default.config=/etc/elasticsearch/elasticsearch.yml 
>> -Des.default.path.home=/usr/share/elasticsearch 
>> -Des.default.path.logs=/home/log/elasticsearch 
>> -Des.default.path.data=/home/data/elasticsearch 
>> -Des.default.path.work=/home/tmp/elasticsearch 
>> -Des.default.path.conf=/etc/elasticsearch'
>>
>> + export ES_HEAP_SIZE
>>
>> + export ES_HEAP_NEWSIZE
>>
>> + export ES_DIRECT_SIZE
>>
>> + export ES_JAVA_OPTS
>>
>> + test -x /usr/share/elasticsearch/bin/elasticsearch
>>
>> + case "$1" in
>>
>> + checkJava
>>
>> + '[' -x /usr/lib/jvm/java-6-openjdk-amd64/bin/java ']'
>>
>> + JAVA=/usr/lib/jvm/java-6-openjdk-amd64/bin/java
>>
>> + '[' '!' -x /usr/lib/jvm/java-6-openjdk-amd64/bin/java ']'
>>
>> + '[' -n unlimited -a -z 2g ']'
>>
>> + log_daemon_msg 'Starting Elasticsearch Server'
>>
>> + '[' -z 'Starting Elasticsearch Server' ']'
>>
>> + log_use_fancy_output
>>
>> + TPUT=/usr/bin/tput
>>
>> + EXPR=/usr/bin/expr
>>
>> + '[' -t 1 ']'
>>
>> + '[' xxterm-256color '!=' x ']'
>>
>> + '[' xxterm-256color '!=' xdumb ']'
>>
>> 

Re: excessive merging/small segment sizes

2014-07-06 Thread Kireet Reddy
Just to reiterate, the problematic period is from 07/05 14:45 to 07/06 
02:10. I included a couple hours before and after in the logs.

On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote:
>
> They are linked below (node5 is the log of the normal node, node6 is the 
> log of the problematic node). 
>
> I don't think it was doing big merges, otherwise during the high load 
> period, the merges graph line would have had a "floor" > 0, similar to the 
> time period after I disabled refresh. We don't do routing and use mostly 
> default settings. I think the only settings we changed are:
>
> indices.memory.index_buffer_size: 50%
> index.translog.flush_threshold_ops: 5
>
> We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB of 
> memory with 4 spinning disks. 
>
> node 5 log (normal) 
> node 6 log (high load) 
> 
>
> On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote:
>>
>> Can you post the IndexWriter infoStream output?  I can see if anything 
>> stands out.
>>
>> Maybe it was just that this node was doing big merges?  I.e., if you 
>> waited long enough, the other shards would eventually do their big merges 
>> too?
>>
>> Have you changed any default settings, do custom routing, etc.?  Is there 
>> any reason to think that the docs that land on this node are "different" in 
>> any way?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy  wrote:
>>
>>>  From all the information I’ve collected, it seems to be the merging 
>>> activity:
>>>
>>>
>>>1. We capture the cluster stats into graphite and the current merges 
>>>stat seems to be about 10x higher on this node. 
>>>2. The actual node that the problem occurs on has happened on 
>>>different physical machines so a h/w issue seems unlikely. Once the 
>>> problem 
>>>starts it doesn't seem to stop. We have blown away the indices in the 
>>> past 
>>>and started indexing again after enabling more logging/stats. 
>>>3. I've stopped executing queries so the only thing that's happening 
>>>on the cluster is indexing.
>>>4. Last night when the problem was ongoing, I disabled refresh 
>>>(index.refresh_interval = -1) around 2:10am. Within 1 hour, the load 
>>>returned to normal. The merge activity seemed to reduce, it seems like 2 
>>>very long running merges are executing but not much else. 
>>>5. I grepped an hour of logs of the 2 machiese for "add merge=", it 
>>>was 540 on the high load node and 420 on a normal node. I pulled out the 
>>>size value from the log message and the merges seemed to be much smaller 
>>> on 
>>>the high load node. 
>>>
>>> I just created the indices a few days ago, so the shards of each index 
>>> are balanced across the nodes. We have external metrics around document 
>>> ingest rate and there was no spike during this time period. 
>>>
>>>
>>>
>>> Thanks
>>> Kireet
>>>
>>>
>>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote:
>>>
 It's perfectly normal/healthy for many small merges below the floor 
 size to happen.

 I think you should first figure out why this node is different from the 
 others?  Are you sure it's merging CPU cost that's different?

 Mike McCandless

 http://blog.mikemccandless.com


 On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy  wrote:

>  We have a situation where one of the four nodes in our cluster seems 
> to get caught up endlessly merging.  However it seems to be high CPU 
> activity and not I/O constrainted. I have enabled the IndexWriter info 
> stream logs, and often times it seems to do merges of quite small 
> segments 
> (100KB) that are much below the floor size (2MB). I suspect this is due 
> to 
> frequent refreshes and/or using lots of threads concurrently to do 
> indexing. Is this true?
>
> My supposition is that this is leading to the merge policy doing lots 
> of merges of very small segments into another small segment which will 
> again require a merge to even reach the floor size. My index has 64 
> segments and 25 are below the floor size. I am wondering if there should 
> be 
> an exception for the maxMergesAtOnce parameter for the first level so 
> that 
> many small segments could be merged at once in this case.
>
> I am considering changing the other parameters (wider tiers, lower 
> floor size, more concurrent merges allowed) but these all seem to have 
> side 
> effects I may not necessarily want. Is there a good solution here?
>  
> -- 
> You received this message because you are subscribed to the Google 
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to elastics

Latitude -> Lat, Longitude -> lon

2014-07-06 Thread Olivier B
Hi there,

I understand a geo-point can be mapped based on two fields : "lat", "long".
However my fields are name "longitude" and "latitude".
I'm using the river plugin for couchdb and I cannot really rename those 
fields before indexing. And those fields are part of an item in an array:

"items": [
  { 
"item_id" : "abcd",
"location": 
{
  "longitude": 145.7711,
  "latitude": -16.92359
}
  },
  { 
"item_id" : "efgh",
"location": 
{
  "longitude": 149.6611,
  "latitude": -19.94098
}
  }
]

So, any idea how I can rename those fields? And eventually map it to a geo 
point 
(http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-geo-point-type.html)?

Cheers

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/96700d4a-c3f5-441a-a53f-dfee9e934364%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: excessive merging/small segment sizes

2014-07-06 Thread Kireet Reddy
They are linked below (node5 is the log of the normal node, node6 is the 
log of the problematic node). 

I don't think it was doing big merges, otherwise during the high load 
period, the merges graph line would have had a "floor" > 0, similar to the 
time period after I disabled refresh. We don't do routing and use mostly 
default settings. I think the only settings we changed are:

indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 5

We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB of 
memory with 4 spinning disks. 

node 5 log (normal) 
node 6 log (high load) 

On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote:
>
> Can you post the IndexWriter infoStream output?  I can see if anything 
> stands out.
>
> Maybe it was just that this node was doing big merges?  I.e., if you 
> waited long enough, the other shards would eventually do their big merges 
> too?
>
> Have you changed any default settings, do custom routing, etc.?  Is there 
> any reason to think that the docs that land on this node are "different" in 
> any way?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy  > wrote:
>
>>  From all the information I’ve collected, it seems to be the merging 
>> activity:
>>
>>
>>1. We capture the cluster stats into graphite and the current merges 
>>stat seems to be about 10x higher on this node. 
>>2. The actual node that the problem occurs on has happened on 
>>different physical machines so a h/w issue seems unlikely. Once the 
>> problem 
>>starts it doesn't seem to stop. We have blown away the indices in the 
>> past 
>>and started indexing again after enabling more logging/stats. 
>>3. I've stopped executing queries so the only thing that's happening 
>>on the cluster is indexing.
>>4. Last night when the problem was ongoing, I disabled refresh 
>>(index.refresh_interval = -1) around 2:10am. Within 1 hour, the load 
>>returned to normal. The merge activity seemed to reduce, it seems like 2 
>>very long running merges are executing but not much else. 
>>5. I grepped an hour of logs of the 2 machiese for "add merge=", it 
>>was 540 on the high load node and 420 on a normal node. I pulled out the 
>>size value from the log message and the merges seemed to be much smaller 
>> on 
>>the high load node. 
>>
>> I just created the indices a few days ago, so the shards of each index 
>> are balanced across the nodes. We have external metrics around document 
>> ingest rate and there was no spike during this time period. 
>>
>>
>>
>> Thanks
>> Kireet
>>
>>
>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote:
>>
>>> It's perfectly normal/healthy for many small merges below the floor size 
>>> to happen.
>>>
>>> I think you should first figure out why this node is different from the 
>>> others?  Are you sure it's merging CPU cost that's different?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy  wrote:
>>>
  We have a situation where one of the four nodes in our cluster seems 
 to get caught up endlessly merging.  However it seems to be high CPU 
 activity and not I/O constrainted. I have enabled the IndexWriter info 
 stream logs, and often times it seems to do merges of quite small segments 
 (100KB) that are much below the floor size (2MB). I suspect this is due to 
 frequent refreshes and/or using lots of threads concurrently to do 
 indexing. Is this true?

 My supposition is that this is leading to the merge policy doing lots 
 of merges of very small segments into another small segment which will 
 again require a merge to even reach the floor size. My index has 64 
 segments and 25 are below the floor size. I am wondering if there should 
 be 
 an exception for the maxMergesAtOnce parameter for the first level so that 
 many small segments could be merged at once in this case.

 I am considering changing the other parameters (wider tiers, lower 
 floor size, more concurrent merges allowed) but these all seem to have 
 side 
 effects I may not necessarily want. Is there a good solution here?
  
 -- 
 You received this message because you are subscribed to the Google 
 Groups "elasticsearch" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to elasticsearc...@googlegroups.com.

 To view this discussion on the web visit https://groups.google.com/d/
 msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%
 40googlegroups.com 
 
 .

Re: excessive merging/small segment sizes

2014-07-06 Thread Michael McCandless
Can you post the IndexWriter infoStream output?  I can see if anything
stands out.

Maybe it was just that this node was doing big merges?  I.e., if you waited
long enough, the other shards would eventually do their big merges too?

Have you changed any default settings, do custom routing, etc.?  Is there
any reason to think that the docs that land on this node are "different" in
any way?

Mike McCandless

http://blog.mikemccandless.com


On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy  wrote:

>  From all the information I’ve collected, it seems to be the merging
> activity:
>
>
>1. We capture the cluster stats into graphite and the current merges
>stat seems to be about 10x higher on this node.
>2. The actual node that the problem occurs on has happened on
>different physical machines so a h/w issue seems unlikely. Once the problem
>starts it doesn't seem to stop. We have blown away the indices in the past
>and started indexing again after enabling more logging/stats.
>3. I've stopped executing queries so the only thing that's happening
>on the cluster is indexing.
>4. Last night when the problem was ongoing, I disabled refresh
>(index.refresh_interval = -1) around 2:10am. Within 1 hour, the load
>returned to normal. The merge activity seemed to reduce, it seems like 2
>very long running merges are executing but not much else.
>5. I grepped an hour of logs of the 2 machiese for "add merge=", it
>was 540 on the high load node and 420 on a normal node. I pulled out the
>size value from the log message and the merges seemed to be much smaller on
>the high load node.
>
> I just created the indices a few days ago, so the shards of each index are
> balanced across the nodes. We have external metrics around document ingest
> rate and there was no spike during this time period.
>
>
>
> Thanks
> Kireet
>
>
> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote:
>
>> It's perfectly normal/healthy for many small merges below the floor size
>> to happen.
>>
>> I think you should first figure out why this node is different from the
>> others?  Are you sure it's merging CPU cost that's different?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy  wrote:
>>
>>>  We have a situation where one of the four nodes in our cluster seems
>>> to get caught up endlessly merging.  However it seems to be high CPU
>>> activity and not I/O constrainted. I have enabled the IndexWriter info
>>> stream logs, and often times it seems to do merges of quite small segments
>>> (100KB) that are much below the floor size (2MB). I suspect this is due to
>>> frequent refreshes and/or using lots of threads concurrently to do
>>> indexing. Is this true?
>>>
>>> My supposition is that this is leading to the merge policy doing lots of
>>> merges of very small segments into another small segment which will again
>>> require a merge to even reach the floor size. My index has 64 segments and
>>> 25 are below the floor size. I am wondering if there should be an exception
>>> for the maxMergesAtOnce parameter for the first level so that many small
>>> segments could be merged at once in this case.
>>>
>>> I am considering changing the other parameters (wider tiers, lower floor
>>> size, more concurrent merges allowed) but these all seem to have side
>>> effects I may not necessarily want. Is there a good solution here?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>>
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%
>>> 40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d

Re: Search thread pools not released

2014-07-06 Thread joergpra...@gmail.com
Yes, socket appender blocks. Maybe the async appender of log4j can do
better ...

http://ricardozuasti.com/2009/asynchronous-logging-with-log4j/

Jörg


On Sun, Jul 6, 2014 at 11:22 PM, Ivan Brusic  wrote:

> Forgot to mention the thread dumps. I have taken them before, but not this
> time. Most of the block search thead pools are stuck in log4j.
>
> https://gist.github.com/brusic/fc12536d8e5706ec9c32
>
> I do have a socket appender to logstash (elasticsearch logs in
> elasticsearch!). Let me debug this connection.
>
> --
> Ivan
>
>
> On Sun, Jul 6, 2014 at 1:55 PM, joergpra...@gmail.com <
> joergpra...@gmail.com> wrote:
>
>> Can be anything seen in a thread dump what looks like stray queries?
>> Maybe some facet queries hanged while resources went low and never
>> returned?
>>
>> Jörg
>>
>>
>> On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic  wrote:
>>
>>> Having an issue on one of my clusters running version 1.1.1 with 8
>>> master/data nodes, unicast, connecting via the Java TransportClient. A few
>>> REST queries are executed via monitoring services.
>>>
>>> Currently there is almost no traffic on this cluster. The few queries
>>> that are currently running are either small test queries or large facet
>>> queries (which are infrequent and the longest runs for 16 seconds). What I
>>> am noticing is that the active search threads on some noded never decreases
>>> and when it reaches the limit, the entire cluster will stop accepting
>>> requests. The current max is the default (3 x 8).
>>>
>>> http://search06:9200/_cat/thread_pool
>>>
>>> search05 1.1.1.5 0 0 0 0 0 0 19 0 0
>>> search07 1.1.1.7 0 0 0 0 0 0  0 0 0
>>> search08 1.1.1.8 0 0 0 0 0 0  0 0 0
>>> search09 1.1.1.9 0 0 0 0 0 0  0 0 0
>>> search11 1.1.1.11 0 0 0 0 0 0  0 0 0
>>> search06 1.1.1.6 0 0 0 0 0 0  2 0 0
>>> search10 1.1.1.10 0 0 0 0 0 0  0 0 0
>>> search12 1.1.1.12 0 0 0 0 0 0  0 0 0
>>>
>>> In this case, both search05 and search06 have an active thread count
>>> that does not change. If I run a query against search05, the search will
>>> respond quickly and the total number of active search threads does not
>>> increase.
>>>
>>> So I have two related issues:
>>> 1) the active thread count does not decrease
>>> 2) the cluster will not accept requests if one node becomes unstable.
>>>
>>> I have seen the issue intermittently in the past, but the issue has
>>> started again and cluster restarts does not fix the problem. At the log
>>> level, there have been issues with the cluster state not propagating. Not
>>> every node will acknowledge the cluster state ([discovery.zen.publish]
>>> received cluster state version NNN) and the master would log a timeout
>>> (awaiting all nodes to process published state NNN timed out, timeout 30s).
>>> The nodes are fine and I can ping each other with no issues. Currently not
>>> seeing any log errors with the thread pool issue, so perhaps it is a red
>>> herring.
>>>
>>> Cheers,
>>>
>>> Ivan
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.

Re: Search thread pools not released

2014-07-06 Thread Ivan Brusic
Forgot to mention the thread dumps. I have taken them before, but not this
time. Most of the block search thead pools are stuck in log4j.

https://gist.github.com/brusic/fc12536d8e5706ec9c32

I do have a socket appender to logstash (elasticsearch logs in
elasticsearch!). Let me debug this connection.

-- 
Ivan


On Sun, Jul 6, 2014 at 1:55 PM, joergpra...@gmail.com  wrote:

> Can be anything seen in a thread dump what looks like stray queries?
> Maybe some facet queries hanged while resources went low and never
> returned?
>
> Jörg
>
>
> On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic  wrote:
>
>> Having an issue on one of my clusters running version 1.1.1 with 8
>> master/data nodes, unicast, connecting via the Java TransportClient. A few
>> REST queries are executed via monitoring services.
>>
>> Currently there is almost no traffic on this cluster. The few queries
>> that are currently running are either small test queries or large facet
>> queries (which are infrequent and the longest runs for 16 seconds). What I
>> am noticing is that the active search threads on some noded never decreases
>> and when it reaches the limit, the entire cluster will stop accepting
>> requests. The current max is the default (3 x 8).
>>
>> http://search06:9200/_cat/thread_pool
>>
>> search05 1.1.1.5 0 0 0 0 0 0 19 0 0
>> search07 1.1.1.7 0 0 0 0 0 0  0 0 0
>> search08 1.1.1.8 0 0 0 0 0 0  0 0 0
>> search09 1.1.1.9 0 0 0 0 0 0  0 0 0
>> search11 1.1.1.11 0 0 0 0 0 0  0 0 0
>> search06 1.1.1.6 0 0 0 0 0 0  2 0 0
>> search10 1.1.1.10 0 0 0 0 0 0  0 0 0
>> search12 1.1.1.12 0 0 0 0 0 0  0 0 0
>>
>> In this case, both search05 and search06 have an active thread count that
>> does not change. If I run a query against search05, the search will respond
>> quickly and the total number of active search threads does not increase.
>>
>> So I have two related issues:
>> 1) the active thread count does not decrease
>> 2) the cluster will not accept requests if one node becomes unstable.
>>
>> I have seen the issue intermittently in the past, but the issue has
>> started again and cluster restarts does not fix the problem. At the log
>> level, there have been issues with the cluster state not propagating. Not
>> every node will acknowledge the cluster state ([discovery.zen.publish]
>> received cluster state version NNN) and the master would log a timeout
>> (awaiting all nodes to process published state NNN timed out, timeout 30s).
>> The nodes are fine and I can ping each other with no issues. Currently not
>> seeing any log errors with the thread pool issue, so perhaps it is a red
>> herring.
>>
>> Cheers,
>>
>> Ivan
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Search thread pools not released

2014-07-06 Thread joergpra...@gmail.com
Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never returned?

Jörg


On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic  wrote:

> Having an issue on one of my clusters running version 1.1.1 with 8
> master/data nodes, unicast, connecting via the Java TransportClient. A few
> REST queries are executed via monitoring services.
>
> Currently there is almost no traffic on this cluster. The few queries that
> are currently running are either small test queries or large facet queries
> (which are infrequent and the longest runs for 16 seconds). What I am
> noticing is that the active search threads on some noded never decreases
> and when it reaches the limit, the entire cluster will stop accepting
> requests. The current max is the default (3 x 8).
>
> http://search06:9200/_cat/thread_pool
>
> search05 1.1.1.5 0 0 0 0 0 0 19 0 0
> search07 1.1.1.7 0 0 0 0 0 0  0 0 0
> search08 1.1.1.8 0 0 0 0 0 0  0 0 0
> search09 1.1.1.9 0 0 0 0 0 0  0 0 0
> search11 1.1.1.11 0 0 0 0 0 0  0 0 0
> search06 1.1.1.6 0 0 0 0 0 0  2 0 0
> search10 1.1.1.10 0 0 0 0 0 0  0 0 0
> search12 1.1.1.12 0 0 0 0 0 0  0 0 0
>
> In this case, both search05 and search06 have an active thread count that
> does not change. If I run a query against search05, the search will respond
> quickly and the total number of active search threads does not increase.
>
> So I have two related issues:
> 1) the active thread count does not decrease
> 2) the cluster will not accept requests if one node becomes unstable.
>
> I have seen the issue intermittently in the past, but the issue has
> started again and cluster restarts does not fix the problem. At the log
> level, there have been issues with the cluster state not propagating. Not
> every node will acknowledge the cluster state ([discovery.zen.publish]
> received cluster state version NNN) and the master would log a timeout
> (awaiting all nodes to process published state NNN timed out, timeout 30s).
> The nodes are fine and I can ping each other with no issues. Currently not
> seeing any log errors with the thread pool issue, so perhaps it is a red
> herring.
>
> Cheers,
>
> Ivan
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Custom Plugin for specifying custom filter attributes at query time

2014-07-06 Thread joergpra...@gmail.com
Thanks for being so patient with me :)

I understand now the following: there are 50m of documents in an external
DB, from which up to 1m is to be exported in form of document identifiers
to work as a filter in ES. The idea is to use internal mechanisms like bit
sets. There is no API for manipulating filters in ES on that level, ES
receives the terms and passes them into Lucene TermFilter class according
to the type of the filter.

What is a bit unclear to me: how is the filter set constructed? I assume it
should be a select statement on the database?

Next, if you have this large set of document identifiers selected, I do not
understand what is the base query you want to apply the filter on? Is there
a user given query for ES? How does such query looks like? Is it assumed
there are other documents in ES that are related somehow to the 50m
documents? An illustrative example of the steps in the scenario would
really help to understand the data model.

Just some food for thought: it is close to impossible to filter in ES on 1m
unique terms with a single step - the default setting of maximum clauses in
a Lucene Query is for good reason limited to 1024 terms. A workaround would
be iterating over 1m terms and execute 1000 filter queries and add up the
results. This takes a long time and may not be the desired solution.

Fortunately, in most situations, it is possible to find more concise
grouping to reduce the 1m document identifiers into fewer ones for more
efficient filtering.

Jörg



On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via elasticsearch
 wrote:

> Hi,
>
> Appreciate your continued assistance. :) Thanks,
>
> Disclaimer: I am yet to sufficiently understand ES sources so as to depict
> my scenario completely. Some info' below may be conjecture.
>
> I would have a corpus of 50M docs (actually lot more, but for testing now)
> out of which I would have say, upto, 1M DocIds to be used as a filter. This
> set of 1M docs can be different for different use cases, the point being,
> upto 1M docIds can form one logical set of documents for filtering results.
> If I use a simple IdsFilter from ES Java API, I would have to keep adding
> these 1M docs to the List implementation internally, and I have a feeling
> it may not scale very well as they may change per use case and per some
> combinations internal to a single use case also.
>
> As I debug the code, the IdsFilter will be converted to a Lucene filter.
> Lucene filters, on the other hand, operate on a docId bitset type. That
> gels very well with my requirement, since I can scale with BitSets (I
> assume).
>
> If I can find a way to directly plug this BitSet as a Lucene Filter to the
> Lucene search() call bypassing the ES filters using, I dont know, may some
> sort of a plugin, I believe that may support my cause. I assume I may not
> get to use the Filter cache from ES but probably I can cache these BitSets
> for subsequent use.
>
> Please let me know. And thanks!
>
> Thanks,
> Sandeep
>
>
> On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:
>
>> What I understand is a TermsFilter is required
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/
>> reference/current/query-dsl-terms-filter.html
>>
>> and the source of the terms is a DB. That is no problem. The plan is:
>> fetch the terms from the DB, build the query (either Java API or JSON) and
>> execute it.
>>
>> What I don't understand is the part with the "quick mapping", Lucene, and
>> the doc ids. Lucene doc IDs are not reliable and are not exposed by
>> Elasticsearch, Elasticsearch uses it's own document identifiers which are
>> stable and augmented with info about the index type they belong to, in
>> order to make them unique. But I do not understand why this is important in
>> this context.
>>
>> Elasticsearch API uses query builders and filter builders to build search
>> requests . A "quick mapping" is just fetching the terms from the DB as a
>> string array before this API is called.
>>
>> I also do not understand the role of the number "1M", is this the number
>> of fields, or the number of terms? Is it a total number or a number per
>> query?
>>
>> Did I misunderstand anything more? I am not really sure what is the
>> challenge...
>>
>> Jörg
>>
>>
>>
>> On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via
>> elasticsearch  wrote:
>>
>>> Hi,
>>>
>>> Just to give some background. I will have a large-ish corpus of more
>>> than 100M documents indexed. The filters that I want to apply will be on a
>>> field that is not indexed. I mean, I prefer to not have them indexed in
>>> ES/Lucene since they will be frequently changing. So, for that, I will be
>>> maintaining them elsewhere, like a DB etc.
>>>
>>> Everytime I have a query, I would want to filter the results by those
>>> fields that are not indexed in Lucene. And I am guessing that number may
>>> well be more than 1M. In that case, I think, since we will maintain some
>>> sort of TermsFilter, it may not 

Re: Use arrays as update parameters with elasticsearch-hadoop-mr

2014-07-06 Thread Costin Leau
Hi James,

Fwiw, I plan to address this bug shortly - as you pointed out, the JSON
array needs to be handled separately before passing its content in.


On Thu, Jul 3, 2014 at 8:58 PM, James Campbell 
wrote:

> I would like to update an existing document that has an array from
> elasticsearch hadoop.
>
> I notice that I can do that from curl directly, for example:
>
> PUT arraydemo/temp/1
> {
>   "counter" : 1,
>   "tags" : [ "I am an array", "With Multiple values" ],
>   "more_tags" : [ "I am a tag" ],
>   "even_more_tags": "I am a tag too!"
> }
>
> GET arraydemo/temp/1
>
> POST arraydemo/temp/1/_update
> {
>   "script" : "tmp = new HashSet(); tmp.addAll(ctx._source.tags); 
> tmp.addAll(new_tags); ctx._source.tags = tmp.toArray()",
>   "params" : {
> "new_tags" : [ "add me", "and me" ]
>   }
> }
>
>
> However, elasticsearch-hadoop appears to be unable to parse array
> parameters, such that an upsert operation from within elasticsearch hadoop
> using the same script and a document with the same JSON for parameters
> fails.
>
> I created an issue on github (elasticsearch hadoop (#223)), but thought I
> should post here for ideas or in case there's a workaround that someone
> might know of.
>
> James Campbell
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/70608cbd-5a4d-424e-b04e-6daee8ac0635%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJogdmdjfWMk0eKX-PDn7GHhCork0xfryXMTNKsOW9aJs-Gr6g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: excessive merging/small segment sizes

2014-07-06 Thread Michael McCandless
It's perfectly normal/healthy for many small merges below the floor size to
happen.

I think you should first figure out why this node is different from the
others?  Are you sure it's merging CPU cost that's different?

Mike McCandless

http://blog.mikemccandless.com


On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy  wrote:

> We have a situation where one of the four nodes in our cluster seems to
> get caught up endlessly merging.  However it seems to be high CPU
> activity and not I/O constrainted. I have enabled the IndexWriter info
> stream logs, and often times it seems to do merges of quite small segments
> (100KB) that are much below the floor size (2MB). I suspect this is due to
> frequent refreshes and/or using lots of threads concurrently to do
> indexing. Is this true?
>
> My supposition is that this is leading to the merge policy doing lots of
> merges of very small segments into another small segment which will again
> require a merge to even reach the floor size. My index has 64 segments and
> 25 are below the floor size. I am wondering if there should be an exception
> for the maxMergesAtOnce parameter for the first level so that many small
> segments could be merged at once in this case.
>
> I am considering changing the other parameters (wider tiers, lower floor
> size, more concurrent merges allowed) but these all seem to have side
> effects I may not necessarily want. Is there a good solution here?
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD7smRcuyvH6oE_dEZUD06%2B7bxy_guh__Q3O1LbjxdDWpmA9zw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: java.lang.NoSuchFieldError: ALLOW_UNQUOTED_FIELD_NAMES when trying to query elasticsearch using spark

2014-07-06 Thread Costin Leau
Hi,

Glad to see you sorted out the problem. Out of curiosity what version of
jackson were you using and what was pulling it in? Can you share you maven
pom/gradle build?


On Sun, Jul 6, 2014 at 10:27 PM, Brian Thomas 
wrote:

> I figured it out, dependency issue in my classpath.  Maven was pulling
> down a very old version of the jackson jar.  I added the following line to
> my dependencies and the error went away:
>
> compile 'org.codehaus.jackson:jackson-mapper-asl:1.9.13'
>
>
> On Friday, July 4, 2014 3:22:30 PM UTC-4, Brian Thomas wrote:
>>
>>  I am trying to test querying elasticsearch using Apache Spark using
>> elasticsearch-hadoop.  I am just trying to do a query to the elasticsearch
>> server and return the count of results.
>>
>> Below is my test class using the Java API:
>>
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.io.MapWritable;
>> import org.apache.hadoop.io.Text;
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.api.java.JavaPairRDD;
>> import org.apache.spark.api.java.JavaSparkContext;
>> import org.apache.spark.serializer.KryoSerializer;
>> import org.elasticsearch.hadoop.mr.EsInputFormat;
>>
>> import scala.Tuple2;
>>
>> public class ElasticsearchSparkQuery{
>>
>> public static int query(String masterUrl, String
>> elasticsearchHostPort) {
>> SparkConf sparkConfig = new SparkConf().setAppName("
>> ESQuery").setMaster(masterUrl);
>> sparkConfig.set("spark.serializer",
>> KryoSerializer.class.getName());
>> JavaSparkContext sparkContext = new JavaSparkContext(sparkConfig);
>>
>> Configuration conf = new Configuration();
>> conf.setBoolean("mapred.map.tasks.speculative.execution", false);
>> conf.setBoolean("mapred.reduce.tasks.speculative.execution",
>> false);
>> conf.set("es.nodes", elasticsearchHostPort);
>> conf.set("es.resource", "media/docs");
>> conf.set("es.query", "?q=*");
>>
>> JavaPairRDD esRDD =
>> sparkContext.newAPIHadoopRDD(conf, EsInputFormat.class, Text.class,
>> MapWritable.class);
>> return (int) esRDD.count();
>> }
>> }
>>
>>
>> When I try to run this I get the following error:
>>
>>
>> 4/07/04 14:58:07 INFO executor.Executor: Running task ID 0
>> 14/07/04 14:58:07 INFO storage.BlockManager: Found block broadcast_0
>> locally
>> 14/07/04 14:58:07 INFO rdd.NewHadoopRDD: Input split: ShardInputSplit
>> [node=[5UATWUzmTUuNzhmGxXWy_w/S'byll|10.45.71.152:9200],shard=0]
>> 14/07/04 14:58:07 WARN mr.EsInputFormat: Cannot determine task id...
>> 14/07/04 14:58:07 ERROR executor.Executor: Exception in task ID 0
>> java.lang.NoSuchFieldError: ALLOW_UNQUOTED_FIELD_NAMES
>> at org.elasticsearch.hadoop.serialization.json.
>> JacksonJsonParser.(JacksonJsonParser.java:38)
>> at org.elasticsearch.hadoop.serialization.ScrollReader.
>> read(ScrollReader.java:75)
>> at org.elasticsearch.hadoop.rest.RestRepository.scroll(
>> RestRepository.java:267)
>> at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(
>> ScrollQuery.java:75)
>> at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.next(
>> EsInputFormat.java:319)
>> at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.
>> nextKeyValue(EsInputFormat.java:255)
>> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(
>> NewHadoopRDD.scala:122)
>> at org.apache.spark.InterruptibleIterator.hasNext(
>> InterruptibleIterator.scala:39)
>> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1014)
>> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
>> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
>> at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(
>> SparkContext.scala:1080)
>> at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(
>> SparkContext.scala:1080)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
>> scala:111)
>> at org.apache.spark.scheduler.Task.run(Task.scala:51)
>> at org.apache.spark.executor.Executor$TaskRunner.run(
>> Executor.scala:187)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Has anyone run into this issue with the JacksonJsonParser?
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/9c2b2f2e-5196-4a72-bfbc-4cd0fda9edf0%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/o

Search thread pools not released

2014-07-06 Thread Ivan Brusic
Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.

Currently there is almost no traffic on this cluster. The few queries that
are currently running are either small test queries or large facet queries
(which are infrequent and the longest runs for 16 seconds). What I am
noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).

http://search06:9200/_cat/thread_pool

search05 1.1.1.5 0 0 0 0 0 0 19 0 0
search07 1.1.1.7 0 0 0 0 0 0  0 0 0
search08 1.1.1.8 0 0 0 0 0 0  0 0 0
search09 1.1.1.9 0 0 0 0 0 0  0 0 0
search11 1.1.1.11 0 0 0 0 0 0  0 0 0
search06 1.1.1.6 0 0 0 0 0 0  2 0 0
search10 1.1.1.10 0 0 0 0 0 0  0 0 0
search12 1.1.1.12 0 0 0 0 0 0  0 0 0

In this case, both search05 and search06 have an active thread count that
does not change. If I run a query against search05, the search will respond
quickly and the total number of active search threads does not increase.

So I have two related issues:
1) the active thread count does not decrease
2) the cluster will not accept requests if one node becomes unstable.

I have seen the issue intermittently in the past, but the issue has started
again and cluster restarts does not fix the problem. At the log level,
there have been issues with the cluster state not propagating. Not every
node will acknowledge the cluster state ([discovery.zen.publish]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.

Cheers,

Ivan

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Custom Plugin for specifying custom filter attributes at query time

2014-07-06 Thread 'Sandeep Ramesh Khanzode' via elasticsearch
Hi,

Appreciate your continued assistance. :) Thanks,

Disclaimer: I am yet to sufficiently understand ES sources so as to depict 
my scenario completely. Some info' below may be conjecture.

I would have a corpus of 50M docs (actually lot more, but for testing now) 
out of which I would have say, upto, 1M DocIds to be used as a filter. This 
set of 1M docs can be different for different use cases, the point being, 
upto 1M docIds can form one logical set of documents for filtering results. 
If I use a simple IdsFilter from ES Java API, I would have to keep adding 
these 1M docs to the List implementation internally, and I have a feeling 
it may not scale very well as they may change per use case and per some 
combinations internal to a single use case also.

As I debug the code, the IdsFilter will be converted to a Lucene filter. 
Lucene filters, on the other hand, operate on a docId bitset type. That 
gels very well with my requirement, since I can scale with BitSets (I 
assume).

If I can find a way to directly plug this BitSet as a Lucene Filter to the 
Lucene search() call bypassing the ES filters using, I dont know, may some 
sort of a plugin, I believe that may support my cause. I assume I may not 
get to use the Filter cache from ES but probably I can cache these BitSets 
for subsequent use. 

Please let me know. And thanks!

Thanks,
Sandeep


On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:
>
> What I understand is a TermsFilter is required
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html
>
> and the source of the terms is a DB. That is no problem. The plan is: 
> fetch the terms from the DB, build the query (either Java API or JSON) and 
> execute it.
>
> What I don't understand is the part with the "quick mapping", Lucene, and 
> the doc ids. Lucene doc IDs are not reliable and are not exposed by 
> Elasticsearch, Elasticsearch uses it's own document identifiers which are 
> stable and augmented with info about the index type they belong to, in 
> order to make them unique. But I do not understand why this is important in 
> this context.
>
> Elasticsearch API uses query builders and filter builders to build search 
> requests . A "quick mapping" is just fetching the terms from the DB as a 
> string array before this API is called.
>
> I also do not understand the role of the number "1M", is this the number 
> of fields, or the number of terms? Is it a total number or a number per 
> query?
>
> Did I misunderstand anything more? I am not really sure what is the 
> challenge...
>
> Jörg
>
>
>
> On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via 
> elasticsearch > wrote:
>
>> Hi,
>>
>> Just to give some background. I will have a large-ish corpus of more than 
>> 100M documents indexed. The filters that I want to apply will be on a field 
>> that is not indexed. I mean, I prefer to not have them indexed in ES/Lucene 
>> since they will be frequently changing. So, for that, I will be maintaining 
>> them elsewhere, like a DB etc.
>>
>> Everytime I have a query, I would want to filter the results by those 
>> fields that are not indexed in Lucene. And I am guessing that number may 
>> well be more than 1M. In that case, I think, since we will maintain some 
>> sort of TermsFilter, it may not scale linearly. What I would want to do, 
>> preferably, is to have a hook inside the ES query, so that I can, at query 
>> time, inject the required filter values. Since the filter values have to be 
>> recognized by Lucene, and I will not be indexing them, I will need to do 
>> some quick mapping to get those fields and map them quickly to some field 
>> in Lucene that I can save in the filter. I am not sure whether we can 
>> access and set Lucene DocIDs in the filter or whether they are even exposed 
>> in ES.
>>
>> Please assist with this query. Thanks,
>>
>> Thanks,
>> Sandeep
>>
>>
>> On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:
>>
>>> Maybe I do not fully understand, but in a client, you can fetch the 
>>> required filter terms from any external source before a JSON query is 
>>> constructed?
>>>
>>> Can you give an example what you want to achieve?
>>>
>>> Jörg
>>>
>>>
>>> On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via 
>>> elasticsearch  wrote:
>>>
 Hi All,

 I am new to ES and I have the following requirement:
 I need to specify a list of strings as a filter that applies to a 
 specific field in the document. Like what a filter does, but instead of 
 sending them on the query, I would like them to be populated from an 
 external sources, like a DB or something. Can you please guide me to the 
 relevant examples or references to achieve this on v1.1.2? 

 Thanks,
 Sandeep

 -- 
 You received this message because you are subscribed to the Google 
 Groups "elasticsearch" group.
 To unsubscribe from this group and stop receiving ema

Re: java.lang.NoSuchFieldError: ALLOW_UNQUOTED_FIELD_NAMES when trying to query elasticsearch using spark

2014-07-06 Thread Brian Thomas
I figured it out, dependency issue in my classpath.  Maven was pulling down 
a very old version of the jackson jar.  I added the following line to my 
dependencies and the error went away:

compile 'org.codehaus.jackson:jackson-mapper-asl:1.9.13'

On Friday, July 4, 2014 3:22:30 PM UTC-4, Brian Thomas wrote:
>
>  I am trying to test querying elasticsearch using Apache Spark using 
> elasticsearch-hadoop.  I am just trying to do a query to the elasticsearch 
> server and return the count of results.
>
> Below is my test class using the Java API:
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.io.MapWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaPairRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.serializer.KryoSerializer;
> import org.elasticsearch.hadoop.mr.EsInputFormat;
>
> import scala.Tuple2;
>
> public class ElasticsearchSparkQuery{
>
> public static int query(String masterUrl, String 
> elasticsearchHostPort) {
> SparkConf sparkConfig = new 
> SparkConf().setAppName("ESQuery").setMaster(masterUrl);
> sparkConfig.set("spark.serializer", 
> KryoSerializer.class.getName());
> JavaSparkContext sparkContext = new JavaSparkContext(sparkConfig);
>
> Configuration conf = new Configuration();
> conf.setBoolean("mapred.map.tasks.speculative.execution", false);
> conf.setBoolean("mapred.reduce.tasks.speculative.execution", 
> false);
> conf.set("es.nodes", elasticsearchHostPort);
> conf.set("es.resource", "media/docs");
> conf.set("es.query", "?q=*");
>
> JavaPairRDD esRDD = 
> sparkContext.newAPIHadoopRDD(conf, EsInputFormat.class, Text.class,
> MapWritable.class);
> return (int) esRDD.count();
> }
> }
>
>
> When I try to run this I get the following error:
>
>
> 4/07/04 14:58:07 INFO executor.Executor: Running task ID 0
> 14/07/04 14:58:07 INFO storage.BlockManager: Found block broadcast_0 
> locally
> 14/07/04 14:58:07 INFO rdd.NewHadoopRDD: Input split: ShardInputSplit 
> [node=[5UATWUzmTUuNzhmGxXWy_w/S'byll|10.45.71.152:9200],shard=0]
> 14/07/04 14:58:07 WARN mr.EsInputFormat: Cannot determine task id...
> 14/07/04 14:58:07 ERROR executor.Executor: Exception in task ID 0
> java.lang.NoSuchFieldError: ALLOW_UNQUOTED_FIELD_NAMES
> at 
> org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.(JacksonJsonParser.java:38)
> at 
> org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:75)
> at 
> org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:267)
> at 
> org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:75)
> at 
> org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.next(EsInputFormat.java:319)
> at 
> org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.nextKeyValue(EsInputFormat.java:255)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1014)
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Has anyone run into this issue with the JacksonJsonParser?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9c2b2f2e-5196-4a72-bfbc-4cd0fda9edf0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Match All query performance

2014-07-06 Thread joergpra...@gmail.com
What you see on the CPU is maybe the overhead of spinning off tasks to be
executed on the segments, maybe your segment number is high and your index
needs optimizing.

On an optimized index with 3 shards on 3 nodes on Red Hat Linux I see
match_all  times around 20-50ms ("took" field).

Jörg


On Sun, Jul 6, 2014 at 6:35 PM, Aaron Mefford  wrote:

> Is there any reason that match all queries would be impacted significantly
> by index size?
>
> It seems that in the absence of any sort, query or other mechanism
> requiring scoring it should just be a matter of fetching the first document
> from a shard.  In practice that does not seem to be the case.  On a cluster
> with more than sufficient ram, registering no noticeable disk io, the
> match_all query is reporting took times of 400-500ms.  The match_all query
> seems to use a significant amount of CPU, and when attempted concurrently
> drives the CPU to 100% with only 30 concurrent requests.  This also puts a
> significant level of context switching on the nodes of the cluster.
>
> The cluster in question is described in this post, though it now has 4
> such nodes and performance has not improved.  Sairam has posted a few times
> about it but each thread has just ended with no direction.
>
> https://groups.google.com/d/msg/elasticsearch/P1o_4bVvECA/lDbCp_rCH_YJ
>
> We were able to make some tweaks to the query with filters and sorts, such
> that it is now significantly faster than the match_all query, took times as
> low as 8 where previously it was 800.
>
> Is there something that I am missing?
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/0e88051f-b3b1-44d1-87e5-26245b4e3ab3%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFXH6w5A%2BaQsV2nBjB%3DjqpzRZpVCcCnnMQLrqSfG0WkEw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Match All query performance

2014-07-06 Thread Aaron Mefford
Is there any reason that match all queries would be impacted significantly 
by index size?

It seems that in the absence of any sort, query or other mechanism 
requiring scoring it should just be a matter of fetching the first document 
from a shard.  In practice that does not seem to be the case.  On a cluster 
with more than sufficient ram, registering no noticeable disk io, the 
match_all query is reporting took times of 400-500ms.  The match_all query 
seems to use a significant amount of CPU, and when attempted concurrently 
drives the CPU to 100% with only 30 concurrent requests.  This also puts a 
significant level of context switching on the nodes of the cluster.

The cluster in question is described in this post, though it now has 4 such 
nodes and performance has not improved.  Sairam has posted a few times 
about it but each thread has just ended with no direction.

https://groups.google.com/d/msg/elasticsearch/P1o_4bVvECA/lDbCp_rCH_YJ

We were able to make some tweaks to the query with filters and sorts, such 
that it is now significantly faster than the match_all query, took times as 
low as 8 where previously it was 800. 

Is there something that I am missing?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0e88051f-b3b1-44d1-87e5-26245b4e3ab3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Need some ideas: Getting visits from hits out of logstash index

2014-07-06 Thread Stefan
Ah nice, this looks exactly like what i need. But what is about memory 
consideration? The problem about histogram facets was, that all related 
data has to be loaded into memory, which is horrible if you want to group 
big data. 
( 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-histogram-facet.html#_memory_considerations_3
 
). Do you know how the new aggregation feature works internally?

Am Sonntag, 6. Juli 2014 15:15:50 UTC+2 schrieb Antonio Augusto Santos:
>
> DateHistogram aggregation can generate buckets by timeframe 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
>
> You probably want to aggregate by the page and latter aggregate by time or 
> the oposite, what best suites your needs.
>
> On Sunday, July 6, 2014 9:08:03 AM UTC-3, Stefan wrote:
>>
>> Yes, I'm using kibana as well. Out of kibana i can manually extract this 
>> data, but the problem is that a SQL like "group by domain, ip" is not 
>> really doable on a large index. As far as I know anything with grouping 
>> involved is done internally with facets, which doesn't respect any kind of 
>> time filter.
>>
>> Am Sonntag, 6. Juli 2014 11:28:22 UTC+2 schrieb Mark Walkom:
>>>
>>> Are you using kibana? You should be able to extract this pretty simply 
>>> if you are, if not, check it out.
>>>
>>> Regards,
>>> Mark Walkom
>>>
>>> Infrastructure Engineer
>>> Campaign Monitor
>>> email: ma...@campaignmonitor.com
>>> web: www.campaignmonitor.com
>>>  
>>>
>>> On 6 July 2014 19:12, Stefan Hasenstab  wrote:
>>>
  Problem: 

 I have aggregated accesslog data from different webservers in a large 
 logstash index. My goal is to get the page *visits* out of the 
 accesslog hits.

 A *visit* is defined as following: A visit results out of one or more 
 hits from a single ip address in a specific time frame. Due to different 
 products on the webservers each domain should be considered separately.
  My questions are: 

- Can this problem already be solved with build-in elasticsearch 
features? If *yes*, how?
- If *no*:
   - What kind of plugin would you suggest? 

 My own considerations lead from building a custom filter to retrieve 
 just the data I need, to build a plugin which analyses the accesslog index 
 and put the visit-data into a new index.

 Maybe someone can help me? I appreciate every answer. Thank you for 
 your time!

 -- 
 You received this message because you are subscribed to the Google 
 Groups "elasticsearch" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/1abed157-cdc2-4e0f-b314-a954c20b89f2%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dac3a78f-579b-42b9-b1b6-c93900a542b4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Need some ideas: Getting visits from hits out of logstash index

2014-07-06 Thread Antonio Augusto Santos
DateHistogram aggregation can generate buckets by 
timeframe 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html

You probably want to aggregate by the page and latter aggregate by time or 
the oposite, what best suites your needs.

On Sunday, July 6, 2014 9:08:03 AM UTC-3, Stefan wrote:
>
> Yes, I'm using kibana as well. Out of kibana i can manually extract this 
> data, but the problem is that a SQL like "group by domain, ip" is not 
> really doable on a large index. As far as I know anything with grouping 
> involved is done internally with facets, which doesn't respect any kind of 
> time filter.
>
> Am Sonntag, 6. Juli 2014 11:28:22 UTC+2 schrieb Mark Walkom:
>>
>> Are you using kibana? You should be able to extract this pretty simply if 
>> you are, if not, check it out.
>>
>> Regards,
>> Mark Walkom
>>
>> Infrastructure Engineer
>> Campaign Monitor
>> email: ma...@campaignmonitor.com
>> web: www.campaignmonitor.com
>>  
>>
>> On 6 July 2014 19:12, Stefan Hasenstab  wrote:
>>
>>>  Problem: 
>>>
>>> I have aggregated accesslog data from different webservers in a large 
>>> logstash index. My goal is to get the page *visits* out of the 
>>> accesslog hits.
>>>
>>> A *visit* is defined as following: A visit results out of one or more 
>>> hits from a single ip address in a specific time frame. Due to different 
>>> products on the webservers each domain should be considered separately.
>>>  My questions are: 
>>>
>>>- Can this problem already be solved with build-in elasticsearch 
>>>features? If *yes*, how?
>>>- If *no*:
>>>   - What kind of plugin would you suggest? 
>>>
>>> My own considerations lead from building a custom filter to retrieve 
>>> just the data I need, to build a plugin which analyses the accesslog index 
>>> and put the visit-data into a new index.
>>>
>>> Maybe someone can help me? I appreciate every answer. Thank you for your 
>>> time!
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/1abed157-cdc2-4e0f-b314-a954c20b89f2%40googlegroups.com
>>>  
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6beae3e9-1f11-4e36-983b-42bc1bdb5e42%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Need some ideas: Getting visits from hits out of logstash index

2014-07-06 Thread Stefan
Yes, I'm using kibana as well. Out of kibana i can manually extract this 
data, but the problem is that a SQL like "group by domain, ip" is not 
really doable on a large index. As far as I know anything with grouping 
involved is done internally with facets, which doesn't respect any kind of 
time filter.

Am Sonntag, 6. Juli 2014 11:28:22 UTC+2 schrieb Mark Walkom:
>
> Are you using kibana? You should be able to extract this pretty simply if 
> you are, if not, check it out.
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com 
> web: www.campaignmonitor.com
>  
>
> On 6 July 2014 19:12, Stefan Hasenstab > 
> wrote:
>
>>  Problem: 
>>
>> I have aggregated accesslog data from different webservers in a large 
>> logstash index. My goal is to get the page *visits* out of the accesslog 
>> hits.
>>
>> A *visit* is defined as following: A visit results out of one or more 
>> hits from a single ip address in a specific time frame. Due to different 
>> products on the webservers each domain should be considered separately.
>>  My questions are: 
>>
>>- Can this problem already be solved with build-in elasticsearch 
>>features? If *yes*, how?
>>- If *no*:
>>   - What kind of plugin would you suggest? 
>>
>> My own considerations lead from building a custom filter to retrieve just 
>> the data I need, to build a plugin which analyses the accesslog index and 
>> put the visit-data into a new index.
>>
>> Maybe someone can help me? I appreciate every answer. Thank you for your 
>> time!
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/1abed157-cdc2-4e0f-b314-a954c20b89f2%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/06924fcf-cd3e-4354-aa66-6e58428a9734%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Need some ideas: Getting visits from hits out of logstash index

2014-07-06 Thread Mark Walkom
Are you using kibana? You should be able to extract this pretty simply if
you are, if not, check it out.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com


On 6 July 2014 19:12, Stefan Hasenstab  wrote:

> Problem:
>
> I have aggregated accesslog data from different webservers in a large
> logstash index. My goal is to get the page *visits* out of the accesslog
> hits.
>
> A *visit* is defined as following: A visit results out of one or more
> hits from a single ip address in a specific time frame. Due to different
> products on the webservers each domain should be considered separately.
> My questions are:
>
>- Can this problem already be solved with build-in elasticsearch
>features? If *yes*, how?
>- If *no*:
>   - What kind of plugin would you suggest?
>
> My own considerations lead from building a custom filter to retrieve just
> the data I need, to build a plugin which analyses the accesslog index and
> put the visit-data into a new index.
>
> Maybe someone can help me? I appreciate every answer. Thank you for your
> time!
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/1abed157-cdc2-4e0f-b314-a954c20b89f2%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEM624beaqys7wyXm_Ye5v37bPcZ9VROGV%2BSCLGh0MseWVsw9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Need some ideas: Getting visits from hits out of logstash index

2014-07-06 Thread Stefan Hasenstab
Problem:

I have aggregated accesslog data from different webservers in a large 
logstash index. My goal is to get the page *visits* out of the accesslog 
hits.

A *visit* is defined as following: A visit results out of one or more hits 
from a single ip address in a specific time frame. Due to different 
products on the webservers each domain should be considered separately.
My questions are:
   
   - Can this problem already be solved with build-in elasticsearch 
   features? If *yes*, how?
   - If *no*:
  - What kind of plugin would you suggest?
   
My own considerations lead from building a custom filter to retrieve just 
the data I need, to build a plugin which analyses the accesslog index and 
put the visit-data into a new index.

Maybe someone can help me? I appreciate every answer. Thank you for your 
time!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1abed157-cdc2-4e0f-b314-a954c20b89f2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: An Open Source implementation of Google Drive Realtime API

2014-07-06 Thread Mark Walkom
Very, very neat!

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com


On 5 July 2014 19:32, 田传武  wrote:

> Hi all,
>
> I'd like to share an open-source project which implements nearly all
> features of the google drive realtime api. The *Google Drive Realtime API*
>  provides Google
> Docs–style instant collaboration. It lets multiple people edit the same
> data simultaneously.
>
> This project was inspired by Google Wave. The server runs on vert.x, and
> uses ElasticSearch for persistent data store and search engine.
> The project is available on github at
> https://github.com/goodow/realtime-store
>
> You can try out the features of the Realtime API on the *live playground*
> , or get the *android demo app*
> 
> on google play.
> There is also an *Objective-C client library
> *, but it is not yet fully tested, so
> please use at your own risk!
>
> Enjoy!
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/820ccf96-21d0-4dd3-abfb-b838759d24bc%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEM624Yt5Qk%3D5OA5H-Z_eteF2S-RMzJ%2BgeM7sf2FeON%2B_DZX5w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.