date:20140825

You should really ask this on the Logstash list -
https://groups.google.com/forum/#!forum/logstash-users

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com


On 26 August 2014 00:49, Shih-Peng Lin  wrote:

>  I am using LogStash to collect the logs from my service. The volume of
> the data is so large (20GB/day) that I am afraid that some of the data will
> be dropped at peak time.
>
> So I asked question
> 
>  in
> Stack Overflow and decided to add a Redis as a buffer between ELB and
> LogStash to prevent data loss.
>
> However, I am curious about *when will LogStash exceed the queue capacity
> and drop messages?*
>
> Because I've done some experiments and the result shows that LogStash can
> completely process all the data without any loss, e.g., local file (a 20GB
> text file) --> LogStash --> local file, netcat --> LogStash --> local file.
>
> Can someone give me a solid example (or scenario, if any) when LogStash
> eventually drops messages? So I can have a better understanding about why
> we need a buffer in front of it.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/57ad4bed-de0e-442a-bb40-a7d1079a148d%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEM624aJdffkGA135wQERfMjYdRZdTdDMXa11NHXOfvLHOpy9w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Building an ERP with Elasticsearch. Am I crazy?

2014-08-25 Thread Raphael Waldmann

Hi, 

First I would like to thanks all of you for Elastic. I am thinking in use 
it in a ERP that I am building. What do you think about this? Am I crazy?

Has someone face this? I really don't think that I am comfy enough to do 
this, change the problems that I already know, for new problems that I 
really don't know how to deal. 

I believe that nosql will prevail over traditional sql, but I don't know if 
I am ready to this task.

So how you think that I should integrate (or not) postgresql with 
ELASTICSEARCH?


Thanks again,


rsw1981

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dfe531a6-675c-4fd6-b6c7-881ff6c00a97%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kibana server-size integration with R, Perl, and other tools

2014-08-25 Thread Brian

Is there some existing method to integrate processing between the Kibana/
Elasticsearch response JSON and the graphing?

For example, I have a Perl script that can convert an Elasticsearch JSON
response into a CSV, even reversing the response to put the oldest event
first (for gnuplot compatibility). I then have an R script that can accept
a CSV and perform custom statistical analysis from it. It can even
auto-detect the timestamp and ordering and reverse the CSV events (adapting
without change to either an Elasticsearch response as CSV, or a direct CSV
export from Splunk).

I've showed the process to a few people, but all balk outright or else shy
away politely at the thought of going to Kibana's Info button, copying and
pasting the curl-based query, and then running it along with the Perl CSV
conversion script and R processing script from the command line. And I
can't blame them!

It may be that Kibana already has the capability to pipe data through
server-installed commands and scripts, but my lack of Javascript experience
and lack of Kibana internals expertise doesn't seem to help me discover it.

Or perhaps this would be a great new addition to Kibana:

1. Allow a server-side command to be in the middle of the response and the
charting.
2. Deliver the response as a CSV with headers, including the @timestamp
field of course, to the server-side command, along with the appropriate
arguments and options for the particular panel.
3. Document the graphite / graphviz / other format required to display the
plots.

Just a thought.

Brian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/132cfc20-ea67-42c8-a518-48404593d35d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shards

2014-08-25 Thread Markus Wiesenbacher

Hi folks,

 

I am using a single Node-Cluster (v1.3.2) on my PC, and I was wondering that
there are always 5 shards in the file-system (separate Lucene-indices), no
matter how many I configure in in elasticsearch.yml or programmatically with
Java-API (loadFromSource with JSON-String). Do I missunderstand something?

 

Many thanks!

 

Markus ;)

 

BTW: Here´s my JSON for the settings:

 

{ 
   "analysis":{ ... },
   "settings":{ 
  "index":{ 
 "number_of_replicas":1,
 "number_of_shards":3
  }
   }
}

 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/009901cfc0af%2448b7ab40%24da2701c0%24%40codefreun.de.
For more options, visit https://groups.google.com/d/optout.

Re: Thousand of shards

That sort of shard count is ok on your cluster as you have 17 nodes :)

Can you give us more details on what sort of hardware you run on, your
java, ES and OS versions and releases?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com


On 26 August 2014 05:29, Casper Thrane  wrote:

> Hi!
>
> I am new to ES, and the system we are using is setup by an external
> consultant. The cluster is very unstable. I have tried to run this:
> -bash-4.1$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
>
> {
>   "cluster_name" : "elasticsearch",
>   "status" : "green",
>   "timed_out" : false,
>   "number_of_nodes" : 17,
>   "number_of_data_nodes" : 4,
>   "active_primary_shards" : 6238,
>   "active_shards" : 12268,
>   "relocating_shards" : 2,
>   "initializing_shards" : 0,
>   "unassigned_shards" : 0
> }
>
> I have read a lot of places, that thousand of shards is a problem. Does
> the above say enough or should I get more data?
>
> Br
> Casper
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/71037f8b-6283-44ac-a8e7-06dfa1461e2d%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEM624Z-jUCJ7-p1QfXfV9joS0OHZCoTYxMRFHuc5uqHxvtYtg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: JVM crash on 64 bit SPARC with Elasticsearch 1.2.2 due to unaligned memory access

I captured a WireShark trace of the interaction between ES and Logstash 
1.4.1.  The error occurs even before my data is sent.  Can you try to 
reproduce it on your testbed with this message I captured?

curl -XPUT http://amssc103-mgmt-app2:9200/_template/logstash -d @y

Contests of file 'y":
{  "template" : "logstash-*",  "settings" : {"index.refresh_interval" : 
"5s"  },  "mappings" : {"_default_" : {   "_all" : {"enabled" : 
true},   "dynamic_templates" : [ { "string_fields" : { 
  "match" : "*",   "match_mapping_type" : "string",   
"mapping" : { "type" : "string", "index" : "analyzed", 
"omit_norms" : true,   "fields" : { "raw" : 
{"type": "string", "index" : "not_analyzed", "ignore_above" : 256} 
  }   } }   } ],   "properties" : { 
"@version": { "type": "string", "index": "not_analyzed" }, "geoip" 
 : {   "type" : "object", "dynamic": true, 
"path": "full", "properties" : {   "location" : { 
"type" : "geo_point" } } }   }}  }}



On Monday, August 25, 2014 3:53:18 PM UTC-4, tony@iqor.com wrote:
>
> I have no plugins installed (yet) and only changed "es.logger.level" to 
> DEBUG in logging.yml. 
>
> elasticsearch.yml:
> cluster.name: es-AMS1Cluster
> node.name: "KYLIE1"
> node.rack: amssc2client02
> path.data: /export/home/apontet/elasticsearch/data
> path.work: /export/home/apontet/elasticsearch/work
> path.logs: /export/home/apontet/elasticsearch/logs
> network.host:    <= sanitized line; file contains actual 
> server IP 
> discovery.zen.ping.multicast.enabled: false
> discovery.zen.ping.unicast.hosts: ["s1", "s2", "s3", "s5" , "s6", "s7"]   
> <= Also sanitized
>
> Thanks,
> Tony
>
>
>
>
> On Saturday, August 23, 2014 6:29:40 AM UTC-4, Jörg Prante wrote:
>>
>> I tested a simple "Hello World" document on Elasticsearch 1.3.2 with 
>> Oracle JDK 1.7.0_17 64-bit Server VM, Sparc Solaris 10, default settings.
>>
>> No issues.
>>
>> So I would like to know more about the settings in elasticsearch.yml, the 
>> mappings, and the installed plugins.
>>
>> Jörg
>>
>>
>> On Sat, Aug 23, 2014 at 11:25 AM, joerg...@gmail.com  
>> wrote:
>>
>>> I have some Solaris 10 Sparc V440/V445 servers available and can try to 
>>> reproduce over the weekend.
>>>
>>> Jörg
>>>
>>>
>>> On Sat, Aug 23, 2014 at 4:37 AM, Robert Muir >> > wrote:
>>>
 How big is it? Maybe i can have it anyway? I pulled two ancient 
 ultrasparcs out of my closet to try to debug your issue, but unfortunately 
 they are a pita to work with (dead nvram battery on both, zeroed mac 
 address, etc.) Id still love to get to the bottom of this.
  On Aug 22, 2014 3:59 PM,  wrote:

> Hi Adrien,
> It's a bunch of garbled binary data, basically a dump of the process 
> image.
> Tony
>
>
> On Thursday, August 21, 2014 6:36:12 PM UTC-4, Adrien Grand wrote:
>>
>> Hi Tony,
>>
>> Do you have more information in the core dump file? (cf. the "Core 
>> dump written" line that you pasted)
>>
>>
>> On Thu, Aug 21, 2014 at 7:53 PM,  wrote:
>>
>>> Hello,
>>> I installed ES 1.3.2 on a spare Solaris 11/ T4-4 SPARC server to 
>>> scale out of small x86 machine.  I get a similar exception running ES 
>>> with 
>>> JAVA_OPTS=-d64.  When Logstash 1.4.1 sends the first message I get the 
>>> error below on the ES process:
>>>
>>>
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  SIGBUS (0xa) at pc=0x7a9a3d8c, pid=14473, tid=209
>>> #
>>> # JRE version: 7.0_25-b15
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode 
>>> solaris-sparc compressed oops)
>>> # Problematic frame:
>>> # V  [libjvm.so+0xba3d8c]  Unsafe_GetInt+0x158
>>> #
>>> # Core dump written. Default location: 
>>> /export/home/elasticsearch/elasticsearch-1.3.2/core 
>>> or core.14473
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> #   http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>>
>>> ---  T H R E A D  ---
>>>
>>> Current thread (0x000107078000):  JavaThread 
>>> "elasticsearch[KYLIE1][http_server_worker][T#17]{New I/O worker 
>>> #147}" daemon [_thread_in_vm, id=209, stack(0x5b80,
>>> 0x5b84)]
>>>
>>> siginfo:si_signo=SIGBUS: si_errno=0, si_code=1 (BUS_ADRALN), 
>>> si_addr=0x000709cc09e7
>>>
>>>
>>> I can run ES using 32bit java but have to shrink ES_HEAPS_SIZE more 
>>> than I want to.  Any assistance would be appreciated.
>>>
>>> Regards,
>>> Tony
>>>
>>>
>>> On Tuesday, July 22, 2014 5:43:28 AM UTC-4, David Roberts wrote:

 Hello,
>

Native client strictly as client.

2014-08-25 Thread John Smith

Using 1.3.2

Just to be sure...

If using the Native client APIs...

If creating a node client essentially that client becomes a node in the 
cluster and you can also proxy through it (as i see in the logs it's 
actually binds 9300 and 9200)?

If using the transport client then it's strictly a client an no one else 
can connect or it or proxy through it?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e1c92141-ce7d-467b-8f2d-b052f41c1a10%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Is it possible to register a RestFilter without creating a plugin?

2014-08-25 Thread Jinyuan Zhou

Thanks,

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/41dab07d-b7f1-4622-8c77-a9d56b19abed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: JVM crash on 64 bit SPARC with Elasticsearch 1.2.2 due to unaligned memory access

I was able to trim the heap size and, consequently, the core file down to 
about 530m.  

Tony

On Monday, August 25, 2014 3:41:14 PM UTC-4, tony@iqor.com wrote:
>
> It's as big as my ES_HEAP_SIZE parameter, 30g.
>
> Tony
>
> On Friday, August 22, 2014 10:37:39 PM UTC-4, Robert Muir wrote:
>>
>> How big is it? Maybe i can have it anyway? I pulled two ancient 
>> ultrasparcs out of my closet to try to debug your issue, but unfortunately 
>> they are a pita to work with (dead nvram battery on both, zeroed mac 
>> address, etc.) Id still love to get to the bottom of this.
>> On Aug 22, 2014 3:59 PM,  wrote:
>>
>>> Hi Adrien,
>>> It's a bunch of garbled binary data, basically a dump of the process 
>>> image.
>>> Tony
>>>
>>>
>>> On Thursday, August 21, 2014 6:36:12 PM UTC-4, Adrien Grand wrote:

 Hi Tony,

 Do you have more information in the core dump file? (cf. the "Core dump 
 written" line that you pasted)


 On Thu, Aug 21, 2014 at 7:53 PM,  wrote:

> Hello,
> I installed ES 1.3.2 on a spare Solaris 11/ T4-4 SPARC server to scale 
> out of small x86 machine.  I get a similar exception running ES with 
> JAVA_OPTS=-d64.  When Logstash 1.4.1 sends the first message I get the 
> error below on the ES process:
>
>
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0xa) at pc=0x7a9a3d8c, pid=14473, tid=209
> #
> # JRE version: 7.0_25-b15
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode 
> solaris-sparc compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0xba3d8c]  Unsafe_GetInt+0x158
> #
> # Core dump written. Default location: 
> /export/home/elasticsearch/elasticsearch-1.3.2/core 
> or core.14473
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.sun.com/bugreport/crash.jsp
> #
>
> ---  T H R E A D  ---
>
> Current thread (0x000107078000):  JavaThread 
> "elasticsearch[KYLIE1][http_server_worker][T#17]{New I/O worker 
> #147}" daemon [_thread_in_vm, id=209, stack(0x5b80,
> 0x5b84)]
>
> siginfo:si_signo=SIGBUS: si_errno=0, si_code=1 (BUS_ADRALN), 
> si_addr=0x000709cc09e7
>
>
> I can run ES using 32bit java but have to shrink ES_HEAPS_SIZE more 
> than I want to.  Any assistance would be appreciated.
>
> Regards,
> Tony
>
>
> On Tuesday, July 22, 2014 5:43:28 AM UTC-4, David Roberts wrote:
>>
>> Hello,
>>
>> After upgrading from Elasticsearch 1.0.1 to 1.2.2 I'm getting JVM 
>> core dumps on Solaris 10 on SPARC.
>>
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGBUS (0xa) at pc=0x7e452d78, pid=15483, tid=263
>> #
>> # JRE version: Java(TM) SE Runtime Environment (7.0_55-b13) (build 
>> 1.7.0_55-b13)
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.55-b03 mixed mode 
>> solaris-sparc compressed oops)
>> # Problematic frame:
>> # V  [libjvm.so+0xc52d78]  Unsafe_GetLong+0x158
>>
>> I'm pretty sure the problem here is that Elasticsearch is making 
>> increasing use of "unsafe" functions in Java, presumably to speed things 
>> up, and some CPUs are more picky than others about memory alignment.  In 
>> particular, x86 will tolerate misaligned memory access whereas SPARC 
>> won't.
>>
>> Somebody has tried to report this to Oracle in the past and 
>> (understandably) Oracle has said that if you're going to use unsafe 
>> functions you need to understand what you're doing: 
>> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8021574
>>
>> A quick grep through the code of the two versions of Elasticsearch 
>> shows that the new use of "unsafe" memory access functions is in the 
>> BytesReference, MurmurHash3 and HyperLogLogPlusPlus classes:
>>
>> bash-3.2$ git checkout v1.0.1
>> Checking out files: 100% (2904/2904), done.
>>
>> bash-3.2$ find . -name '*.java' | xargs grep UnsafeUtils
>> ./src/main/java/org/elasticsearch/common/util/UnsafeUtils.java:public 
>> enum UnsafeUtils {
>> ./src/main/java/org/elasticsearch/search/aggregations/bucket/
>> BytesRefHash.java:if (id == -1L || 
>> UnsafeUtils.equals(key, get(id, spare))) {
>> ./src/main/java/org/elasticsearch/search/aggregations/bucket/
>> BytesRefHash.java:} else if (UnsafeUtils.equals(key, 
>> get(curId, spare))) {
>> ./src/test/java/org/elasticsearch/benchmark/common/util/Byte
>> sRefComparisonsBenchmark.java:import org.elasticsearch.common.util.
>> UnsafeUtils;
>> ./src/test/java/org/elasticsearch/benchmark/common/util/Byte
>> sRefComparisonsBenchmark.java:return 
>> UnsafeUtils.equals(b1, b2);

Re: JVM crash on 64 bit SPARC with Elasticsearch 1.2.2 due to unaligned memory access

I have no plugins installed (yet) and only changed "es.logger.level" to 
DEBUG in logging.yml. 

elasticsearch.yml:
cluster.name: es-AMS1Cluster
node.name: "KYLIE1"
node.rack: amssc2client02
path.data: /export/home/apontet/elasticsearch/data
path.work: /export/home/apontet/elasticsearch/work
path.logs: /export/home/apontet/elasticsearch/logs
network.host:    <= sanitized line; file contains actual 
server IP 
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["s1", "s2", "s3", "s5" , "s6", "s7"]   
<= Also sanitized

Thanks,
Tony




On Saturday, August 23, 2014 6:29:40 AM UTC-4, Jörg Prante wrote:
>
> I tested a simple "Hello World" document on Elasticsearch 1.3.2 with 
> Oracle JDK 1.7.0_17 64-bit Server VM, Sparc Solaris 10, default settings.
>
> No issues.
>
> So I would like to know more about the settings in elasticsearch.yml, the 
> mappings, and the installed plugins.
>
> Jörg
>
>
> On Sat, Aug 23, 2014 at 11:25 AM, joerg...@gmail.com  <
> joerg...@gmail.com > wrote:
>
>> I have some Solaris 10 Sparc V440/V445 servers available and can try to 
>> reproduce over the weekend.
>>
>> Jörg
>>
>>
>> On Sat, Aug 23, 2014 at 4:37 AM, Robert Muir > > wrote:
>>
>>> How big is it? Maybe i can have it anyway? I pulled two ancient 
>>> ultrasparcs out of my closet to try to debug your issue, but unfortunately 
>>> they are a pita to work with (dead nvram battery on both, zeroed mac 
>>> address, etc.) Id still love to get to the bottom of this.
>>>  On Aug 22, 2014 3:59 PM, > wrote:
>>>
 Hi Adrien,
 It's a bunch of garbled binary data, basically a dump of the process 
 image.
 Tony


 On Thursday, August 21, 2014 6:36:12 PM UTC-4, Adrien Grand wrote:
>
> Hi Tony,
>
> Do you have more information in the core dump file? (cf. the "Core 
> dump written" line that you pasted)
>
>
> On Thu, Aug 21, 2014 at 7:53 PM,  wrote:
>
>> Hello,
>> I installed ES 1.3.2 on a spare Solaris 11/ T4-4 SPARC server to 
>> scale out of small x86 machine.  I get a similar exception running ES 
>> with 
>> JAVA_OPTS=-d64.  When Logstash 1.4.1 sends the first message I get the 
>> error below on the ES process:
>>
>>
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGBUS (0xa) at pc=0x7a9a3d8c, pid=14473, tid=209
>> #
>> # JRE version: 7.0_25-b15
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode 
>> solaris-sparc compressed oops)
>> # Problematic frame:
>> # V  [libjvm.so+0xba3d8c]  Unsafe_GetInt+0x158
>> #
>> # Core dump written. Default location: 
>> /export/home/elasticsearch/elasticsearch-1.3.2/core 
>> or core.14473
>> #
>> # If you would like to submit a bug report, please visit:
>> #   http://bugreport.sun.com/bugreport/crash.jsp
>> #
>>
>> ---  T H R E A D  ---
>>
>> Current thread (0x000107078000):  JavaThread 
>> "elasticsearch[KYLIE1][http_server_worker][T#17]{New I/O worker 
>> #147}" daemon [_thread_in_vm, id=209, stack(0x5b80,
>> 0x5b84)]
>>
>> siginfo:si_signo=SIGBUS: si_errno=0, si_code=1 (BUS_ADRALN), 
>> si_addr=0x000709cc09e7
>>
>>
>> I can run ES using 32bit java but have to shrink ES_HEAPS_SIZE more 
>> than I want to.  Any assistance would be appreciated.
>>
>> Regards,
>> Tony
>>
>>
>> On Tuesday, July 22, 2014 5:43:28 AM UTC-4, David Roberts wrote:
>>>
>>> Hello,
>>>
>>> After upgrading from Elasticsearch 1.0.1 to 1.2.2 I'm getting JVM 
>>> core dumps on Solaris 10 on SPARC.
>>>
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  SIGBUS (0xa) at pc=0x7e452d78, pid=15483, tid=263
>>> #
>>> # JRE version: Java(TM) SE Runtime Environment (7.0_55-b13) (build 
>>> 1.7.0_55-b13)
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.55-b03 mixed mode 
>>> solaris-sparc compressed oops)
>>> # Problematic frame:
>>> # V  [libjvm.so+0xc52d78]  Unsafe_GetLong+0x158
>>>
>>> I'm pretty sure the problem here is that Elasticsearch is making 
>>> increasing use of "unsafe" functions in Java, presumably to speed 
>>> things 
>>> up, and some CPUs are more picky than others about memory alignment.  
>>> In 
>>> particular, x86 will tolerate misaligned memory access whereas SPARC 
>>> won't.
>>>
>>> Somebody has tried to report this to Oracle in the past and 
>>> (understandably) Oracle has said that if you're going to use unsafe 
>>> functions you need to understand what you're doing: 
>>> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8021574
>>>
>>> A quick grep through the code of the two versions of Elasticsearch 
>>> shows that the new use of "unsa

aggregate on analyzed field

2014-08-25 Thread kti_sk

I am aggregating documents by customer name to find how many documents we 
have per customer.
The aggregates bucketize words in names. For example, if I have customer, 
Tom Cruise, I would get 2 buckets, "Tom" and "Cruise"

How would I treat the analyzed field as not_analyzed  in aggregate query?
I still want the field to remain analyzed so that I can do fulltext search.

thanks

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ff43bdfa-db39-4894-8cf8-a1f6b0df96ce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: JVM crash on 64 bit SPARC with Elasticsearch 1.2.2 due to unaligned memory access

2014-08-25 Thread 'Sandeep Ramesh Khanzode' via elasticsearch

It's as big as my ES_HEAP_SIZE parameter, 30g.

Tony

On Friday, August 22, 2014 10:37:39 PM UTC-4, Robert Muir wrote:
>
> How big is it? Maybe i can have it anyway? I pulled two ancient 
> ultrasparcs out of my closet to try to debug your issue, but unfortunately 
> they are a pita to work with (dead nvram battery on both, zeroed mac 
> address, etc.) Id still love to get to the bottom of this.
> On Aug 22, 2014 3:59 PM, > wrote:
>
>> Hi Adrien,
>> It's a bunch of garbled binary data, basically a dump of the process 
>> image.
>> Tony
>>
>>
>> On Thursday, August 21, 2014 6:36:12 PM UTC-4, Adrien Grand wrote:
>>>
>>> Hi Tony,
>>>
>>> Do you have more information in the core dump file? (cf. the "Core dump 
>>> written" line that you pasted)
>>>
>>>
>>> On Thu, Aug 21, 2014 at 7:53 PM,  wrote:
>>>
 Hello,
 I installed ES 1.3.2 on a spare Solaris 11/ T4-4 SPARC server to scale 
 out of small x86 machine.  I get a similar exception running ES with 
 JAVA_OPTS=-d64.  When Logstash 1.4.1 sends the first message I get the 
 error below on the ES process:


 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGBUS (0xa) at pc=0x7a9a3d8c, pid=14473, tid=209
 #
 # JRE version: 7.0_25-b15
 # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode 
 solaris-sparc compressed oops)
 # Problematic frame:
 # V  [libjvm.so+0xba3d8c]  Unsafe_GetInt+0x158
 #
 # Core dump written. Default location: 
 /export/home/elasticsearch/elasticsearch-1.3.2/core 
 or core.14473
 #
 # If you would like to submit a bug report, please visit:
 #   http://bugreport.sun.com/bugreport/crash.jsp
 #

 ---  T H R E A D  ---

 Current thread (0x000107078000):  JavaThread 
 "elasticsearch[KYLIE1][http_server_worker][T#17]{New I/O worker #147}" 
 daemon [_thread_in_vm, id=209, stack(0x5b80,
 0x5b84)]

 siginfo:si_signo=SIGBUS: si_errno=0, si_code=1 (BUS_ADRALN), 
 si_addr=0x000709cc09e7


 I can run ES using 32bit java but have to shrink ES_HEAPS_SIZE more 
 than I want to.  Any assistance would be appreciated.

 Regards,
 Tony


 On Tuesday, July 22, 2014 5:43:28 AM UTC-4, David Roberts wrote:
>
> Hello,
>
> After upgrading from Elasticsearch 1.0.1 to 1.2.2 I'm getting JVM core 
> dumps on Solaris 10 on SPARC.
>
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0xa) at pc=0x7e452d78, pid=15483, tid=263
> #
> # JRE version: Java(TM) SE Runtime Environment (7.0_55-b13) (build 
> 1.7.0_55-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.55-b03 mixed mode 
> solaris-sparc compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0xc52d78]  Unsafe_GetLong+0x158
>
> I'm pretty sure the problem here is that Elasticsearch is making 
> increasing use of "unsafe" functions in Java, presumably to speed things 
> up, and some CPUs are more picky than others about memory alignment.  In 
> particular, x86 will tolerate misaligned memory access whereas SPARC 
> won't.
>
> Somebody has tried to report this to Oracle in the past and 
> (understandably) Oracle has said that if you're going to use unsafe 
> functions you need to understand what you're doing: 
> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8021574
>
> A quick grep through the code of the two versions of Elasticsearch 
> shows that the new use of "unsafe" memory access functions is in the 
> BytesReference, MurmurHash3 and HyperLogLogPlusPlus classes:
>
> bash-3.2$ git checkout v1.0.1
> Checking out files: 100% (2904/2904), done.
>
> bash-3.2$ find . -name '*.java' | xargs grep UnsafeUtils
> ./src/main/java/org/elasticsearch/common/util/UnsafeUtils.java:public 
> enum UnsafeUtils {
> ./src/main/java/org/elasticsearch/search/aggregations/bucket/
> BytesRefHash.java:if (id == -1L || 
> UnsafeUtils.equals(key, get(id, spare))) {
> ./src/main/java/org/elasticsearch/search/aggregations/bucket/
> BytesRefHash.java:} else if (UnsafeUtils.equals(key, 
> get(curId, spare))) {
> ./src/test/java/org/elasticsearch/benchmark/common/util/Byte
> sRefComparisonsBenchmark.java:import org.elasticsearch.common.util.
> UnsafeUtils;
> ./src/test/java/org/elasticsearch/benchmark/common/util/Byte
> sRefComparisonsBenchmark.java:return 
> UnsafeUtils.equals(b1, b2);
>
> bash-3.2$ git checkout v1.2.2
> Checking out files: 100% (2220/2220), done.
>
> bash-3.2$ find . -name '*.java' | xargs grep UnsafeUtils
> ./src/main/java/org/elasticsearch/common/bytes/BytesReference.java:import 
> org.elasticsearch.common.util.UnsafeUtils;
>

Thousand of shards

2014-08-25 Thread Casper Thrane

Hi!

I am new to ES, and the system we are using is setup by an external 
consultant. The cluster is very unstable. I have tried to run this:
-bash-4.1$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'

{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 17,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 6238,
  "active_shards" : 12268,
  "relocating_shards" : 2,
  "initializing_shards" : 0,
  "unassigned_shards" : 0
}

I have read a lot of places, that thousand of shards is a problem. Does the 
above say enough or should I get more data?

Br
Casper

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/71037f8b-6283-44ac-a8e7-06dfa1461e2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

how to disable default-mapping.json for a new index?

2014-08-25 Thread asanderson

I've got a couple dozen or so indexes for which I've defined 
config/default-mapping.json that includes dynamic_templates and properties 
which works fine; however, I now have a new index for which I do not want 
the default-mapping.json to apply. i.e. I just want to use the default 
out-of-the-box Elasticsearch dynamic mappings.

What's the easiest way to do this without having to define every type in 
this new index?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d2aa8f4b-624f-4d76-8cc4-1b522e719777%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Query Visualizer

2014-08-25 Thread Ryan Henszey

Greetings all

Awhile back I wrote a query visualizer to help with debugging large 
programmatically generated queries. Figured I would share it here in case 
anyone else could benefit from it. Its not so much an app as it is just a 
page right now. 

github: https://github.com/henszey/elasticsearch-query-visualizer
demo: http://henszey.github.io/elasticsearch-query-visualizer/

If there is something better out there please let me know.

Ryan

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/acc34fd4-4035-4ab8-a006-32991f251fc3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Can't open file to read checksums

2014-08-25 Thread Casper Thrane

Hi!

We get the following errors, on two of our nodes. And after that our 
cluster doesn't work. I have no idea what it means.

[2014-08-25 17:46:39,323][WARN ][indices.store] 
[p-elasticlog03] Can't open file to read checksums
java.io.FileNotFoundException: No such file [_6cq_es090_0.doc]
at 
org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:173)
at 
org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:144)
at 
org.elasticsearch.index.store.DistributorDirectory.openInput(DistributorDirectory.java:130)
at 
org.elasticsearch.index.store.Store$MetadataSnapshot.checksumFromLuceneFile(Store.java:532)
at 
org.elasticsearch.index.store.Store$MetadataSnapshot.buildMetadata(Store.java:459)
at 
org.elasticsearch.index.store.Store$MetadataSnapshot.(Store.java:433)
at 
org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:271)
at 
org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:186)
at 
org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:140)
at 
org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61)
at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:277)
at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:268)
at 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

Br
Casper

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5f417878-3e49-478d-90e7-ca5c42734567%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Reduce Number of Segments

2014-08-25 Thread Michael McCandless

Which version of ES are you using?  Versions before 1.2 have a bug that
caused merge throttling to throttle far more than requested such that you
couldn't get any faster than ~8 MB / sec.  See
https://github.com/elasticsearch/elasticsearch/issues/6018

Tiered merge policy is best.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Aug 25, 2014 at 1:08 PM, Chris Decker 
wrote:

> All,
>
> I’m looking for advice on how to reduce the number of segments for my
> indices because in my use case (log analysis), quick searches are more
> important than real-time access to data.  I've turned many of the "knobs"
> available within ES, and read many blog postings, ES documentation, etc.,
> but still feel like there is room for important.
>
> Specific questions I have:
> 1. How can I increase the current merge rate?  According to Elastic HQ, my
> merge rate is 6 MB/s (according to Elastic HQ).  I know I don't have SSDs,
> but with 15k drives it seems like I should be able to get better rates.  I
> tried increasing indices.store.throttle.max_bytes_per_sec from the default
> of 20mb to 40mb in my templates, but I didn't see a noticeable change in
> disk IOps or the merge rate the next day.  Did I do something incorrectly?
>  I'm going to experiment with setting it overall
> with index.store.throttle.max_bytes_per_sec and removing it from my
> templates.
> 2. Should I move away from the default merge policy, or stick with the
> default ("tiered")?
>
> Any advice you have is much appreciated; additional details on my
> situation are below.
>
> 
>
> - I generate 2 indices per day - “high” and “low”.  I usually end up with
> ~ 450 segments for my ‘high’ index (see attached), and another ~ 200
> segments for my ‘low’ index, which I then optimize once I roll-over to the
> next day’s indices.
> - 4 ES servers (soon to be 8).
>   — Each server has:
> 12 Xeon cores running at 2.3 GHz
> 15k drives
> 128 GB of RAM
> 68 GB used for OS / file system machine
> 60 GB used by 2 JVMs
> - Index ~ 750 GB per day; 1.5 TB if you include the replicas
> - Relevant configs:
> TEMPLATE:
>   "index.refresh_interval" : "60s",
>   "index.number_of_replicas" : "1",
>   "index.number_of_shards" : "4",
>   "index.merge.policy.max_merged_segment" : "50g",
>   "index.merge.policy.segments_per_tier" : "5",
>   "index.merge.policy.max_merge_at_once" : “5”,
>   "indices.store.throttle.max_bytes_per_sec" : "40mb".
>
> ELASTICSEARCH.YML:
> indices.memory.index_buffer_size: 30%
>
>
>
> Thanks in advance!,
> Chris
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/002cb4cc-fa2e-43c3-b2d3-29580742c91a%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD7smReAtdSsxEnJzXH%2BAWxSv6G5_-iQWUdbhzu3__rH4LsTNg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

TimeZone for logging

2014-08-25 Thread IronMan2014

How do I change logging timestamps to EST.









appender:

  console:

type: console

layout:

  type: consolePattern

  conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"



  file:

type: dailyRollingFile

file: ${path.logs}/${cluster.name}.log

datePattern: "'.'-MM-dd"

layout:

  type: pattern

  conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"



  index_search_slow_log_file:

type: dailyRollingFile

file: ${path.logs}/${cluster.name}_index_search_slowlog.log

datePattern: "'.'-MM-dd"

layout:

  type: pattern

  conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"



  index_indexing_slow_log_file:

type: dailyRollingFile

file: ${path.logs}/${cluster.name}_index_indexing_slowlog.log

datePattern: "'.'-MM-dd"

layout:

  type: pattern

  conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/89bbc204-6b2d-4a7e-b90a-4eecf4c4e5a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

set connect_timeout of elasticsearch php client

2014-08-25 Thread Niv Penso

Hey,

i want to configure a a small timeout between my elasticsearch php client 
to the my elasticsearch server.

i tried to pass some parameters to the guzzle client but it seems this 
doesn't work.
here is the code:

$params = array();
> $params['hosts'] = $hosts;
> $params['guzzleOptions']['connect_timeout'] = 2.0;
> $params['guzzleOptions']['timeout'] = 2.0;
> $this->elastica_obj = new Elasticsearch\Client($params);


i searched and found that the problem might occured because the timeout is 
set in the cURL layer (that is lower than the guzzle)
(http://stackoverflow.com/questions/20847633/limit-connecting-time-with-guzzle-http-php-client)

i guess i need somehow to set CURLOPT_CONNECTTIMEOUT_MS parameter to the 
value i want (2000ms) but i don't see any good way to pass it through the 
elasticsearch php client.

Does someone knows how to do it?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/cb7eba1e-aa65-4c7f-996f-877b5f1d2639%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Indexing large number of files each with a huge size

2014-08-25 Thread joergpra...@gmail.com

Can you show the program how you index?

Before tuning heap sizes or batch sizes, it is good to check if the program
works correct.

Jörg


On Mon, Aug 25, 2014 at 7:00 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch  wrote:

> Hi,
>
> I am trying to index documents, each file approx ~10-20 MB. I start seeing
> memory issues if I try to index them all in a multi-threaded environment
> from a single TransportClient on one machine to a single node cluster with
> 32GB ES server. It seems like the memory is an issue on the client as well
> as server side, and I probably understand and expect that :).
>
> I have tried tuning the heap sizes and batch sizes in Bulk APIs. However,
> am I trying to push the limits too much? One thought is to probably stream
> the data so that I do not hold it all in memory. Is it possible? Is this a
> general problem or just that my usage is wrong?
>
> Thanks,
> Sandeep
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/d2612109-b31c-4127-857b-f8aa27fb0aeb%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG7oByjnRhFoHboLJRRzhdBbsr%2BXC8NO0JU9KP0VEU4HQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reduce Number of Segments

2014-08-25 Thread Chris Decker

All,

I’m looking for advice on how to reduce the number of segments for my 
indices because in my use case (log analysis), quick searches are more 
important than real-time access to data.  I've turned many of the "knobs" 
available within ES, and read many blog postings, ES documentation, etc., 
but still feel like there is room for important.

Specific questions I have:
1. How can I increase the current merge rate?  According to Elastic HQ, my 
merge rate is 6 MB/s (according to Elastic HQ).  I know I don't have SSDs, 
but with 15k drives it seems like I should be able to get better rates.  I 
tried increasing indices.store.throttle.max_bytes_per_sec from the default 
of 20mb to 40mb in my templates, but I didn't see a noticeable change in 
disk IOps or the merge rate the next day.  Did I do something incorrectly? 
 I'm going to experiment with setting it overall 
with index.store.throttle.max_bytes_per_sec and removing it from my 
templates.
2. Should I move away from the default merge policy, or stick with the 
default ("tiered")?

Any advice you have is much appreciated; additional details on my situation 
are below.



- I generate 2 indices per day - “high” and “low”.  I usually end up with ~ 
450 segments for my ‘high’ index (see attached), and another ~ 200 segments 
for my ‘low’ index, which I then optimize once I roll-over to the next 
day’s indices.
- 4 ES servers (soon to be 8).
  — Each server has:  
12 Xeon cores running at 2.3 GHz
15k drives
128 GB of RAM
68 GB used for OS / file system machine
60 GB used by 2 JVMs
- Index ~ 750 GB per day; 1.5 TB if you include the replicas
- Relevant configs:
TEMPLATE:
  "index.refresh_interval" : "60s",
  "index.number_of_replicas" : "1",
  "index.number_of_shards" : "4",
  "index.merge.policy.max_merged_segment" : "50g",
  "index.merge.policy.segments_per_tier" : "5",
  "index.merge.policy.max_merge_at_once" : “5”,
  "indices.store.throttle.max_bytes_per_sec" : "40mb".

ELASTICSEARCH.YML:
indices.memory.index_buffer_size: 30%



Thanks in advance!,
Chris

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/002cb4cc-fa2e-43c3-b2d3-29580742c91a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
index   shard prirep ipsegment generation docs.count 
docs.deleted size size.memory committed searchable version compound 
high-2014.08.24 0 p  1.2.3.4 _lrk 2820858151130 
   3.9gb55697309 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _10g547237   190555190 
  12.6gb   178323656 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _1l2673950   166506980 
   9.6gb   155076791 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _1suo8404842598150 
   2.4gb40337901 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _1xt39047147413940 
   2.9gb44799302 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _24y59972582418050 
   5.3gb79225058 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _27hs   10302434323200 
   2.3gb33132147 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _2ccn   10931952103010 
   3.6gb49775842 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _2frr   113751   101417040 
   6.5gb97469720 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _2iks   11738821203220 
   1.5gb20871023 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _2jvj   11907126549990 
   1.9gb26059406 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _2nc6   12355859417140 
   4.2gb56767385 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _2oid   12507711019720 
 833.2mb11028090 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _2pbu   12613814358020 
 1gb14298968 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _2qlr   12779123866600 
   1.7gb23440373 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _2r4e   128462 6356460 
 485.2mb 7910210 true  true   4.8 false
high-2014.08.24 0 p  1.2.3.4 _2rlz   12909

Indexing large number of files each with a huge size

Hi,

I am trying to index documents, each file approx ~10-20 MB. I start seeing 
memory issues if I try to index them all in a multi-threaded environment 
from a single TransportClient on one machine to a single node cluster with 
32GB ES server. It seems like the memory is an issue on the client as well 
as server side, and I probably understand and expect that :). 

I have tried tuning the heap sizes and batch sizes in Bulk APIs. However, 
am I trying to push the limits too much? One thought is to probably stream 
the data so that I do not hold it all in memory. Is it possible? Is this a 
general problem or just that my usage is wrong?

Thanks,
Sandeep

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d2612109-b31c-4127-857b-f8aa27fb0aeb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Json Data not getting parsed when sent to Elasticsearch

2014-08-25 Thread Didjit

bump. Anyone?
Thank you,
Chris

On Sunday, August 24, 2014 10:32:23 AM UTC-4, Didjit wrote:
>
> Pretty simple (below). . I just added to json codec and tried again and 
> received the same results. Thank you!
>
> elasticsearch { 
> host => localhost 
> cluster => cjceswin
> node_name => cjcnode
> codec => json
>  index => "logstash-dwhse-%{+.MM.dd}"
>  workers => 3
> }
>
> }
>
> On Sunday, August 24, 2014 10:11:44 AM UTC-4, moshe zada wrote:
>>
>> what is your logstash configuration?
>> did you tried the json codec 
>> ?
>>
>> On Sunday, August 24, 2014 4:54:08 PM UTC+3, Didjit wrote:
>>>
>>> Hi,
>>>
>>> The following is a debug from Logstash:
>>>
>>> {
>>> "message" => 
>>> "{\"EventTime\":\"2014-08-24T09:44:46-0400\",\"URI\":\"
>>> http://ME/rest/venue/ME/hours/2014-08-24\
>>> ",\"uri_payload\":{\"value\":[{\"open\":\"2014-08-24T13:00:00.000+\",\"close\":\"2014-08-24T23:00:00.000+\",\"isOpen\":true,\"date\":\"2014-08-24\"}],\"Count\":1}}\r",
>>>"@version" => "1",
>>>  "@timestamp" => "2014-08-24T13:44:48.036Z",
>>>"host" => "127.0.0.1:60778",
>>>"type" => "MY_Detail",
>>>   "EventTime" => "2014-08-24T09:44:46-0400",
>>> "URI" => "http://ME/rest/venue/ME//hours/2014-08-24";,
>>> "uri_payload" => {
>>> "value" => [
>>> [0] {
>>>   "open" => "2014-08-24T13:00:00.000+",
>>>  "close" => "2014-08-24T23:00:00.000+",
>>> "isOpen" => true,
>>>   "date" => "2014-08-24"
>>> }
>>> ],
>>> "Count" => 1,
>>> "0" => {}
>>> },
>>>  "MYId" => "ME"
>>> }
>>> ___
>>>
>>> When i look into Elasticsearch, the fields under URI Payload are not 
>>> parsed. It shows:
>>>
>>> uri_payload.value as the field with "
>>> {"open":"2014-08-21T13:00:00.000+","close":"2014-08-21T23:00:00.000+","isOpen":true,"date":"2014-08-21"}"
>>>
>>> How can I get all the parsed values as fields in elasticsearch? In my 
>>> example, fields Open, Close, IsOpen. Initially I thought Logstash was not 
>>> parsing all the json, but looking at the debug it is.
>>>
>>> Thank you,
>>>
>>> Chris
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/860ab9c6-1867-43d0-b5da-12660ac7eab0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: inconsistent paging

2014-08-25 Thread Ron Sher

Thanks for the answer and sorry for the duplicate (posted from a different 
source by mistake)

On Monday, August 18, 2014 11:02:47 AM UTC+3, Adrien Grand wrote:
>
> Hi Ron,
>
> The cause of this issue is that Elasticsearch uses Lucene's internal doc 
> IDs as tie-breakers. Internal doc IDs might be completely different 
> across replicas of the same data, so this explains why documents that have 
> the same sort values are not consistently ordered.
>
> There are 2 potential ways to fix that problem:
>  1. Use scroll as David mentionned. It will create a context around your 
> request and will make sure that the same shards will be used for all pages. 
> However, it also gives another warranty, which is that the same 
> point-in-time view on the index will be used for each page, and this is 
> expensive to maintain.
>  2. Use a custom string value as a preference in order to always hit the 
> same shards for a given session[1]. This will help with always hitting 
> the same shards likely to 1. but without adding the additional cost of a 
> scroll.
>
> [1] 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-preference.html
>
>
>
> On Mon, Aug 18, 2014 at 8:02 AM, Ron Sher 
> > wrote:
>
>> Hi,
>>
>> We've noticed a strange behavior in elasticsearch during paging. 
>>
>> In one case we use a paging size of 60 and we have 63 documents. So the 
>> first page is using size 60 and offset 0. The second page is using size 60 
>> and offset 60. What we see is that the result is inconsistent. Meaning, 
>> on the 2nd page, we sometimes get results that were before in the 1st page. 
>>
>> The query we use has an order by some numberic field that has many 
>> documents with the same value (0). 
>> It looks like the ordering between documents according to the same value, 
>> which is 0, isn't consistent. 
>>
>> Did anyone encounter such behavior? Any suggestions on resolving this? 
>>
>> We're using version 1.3.1. 
>>
>> Thanks, 
>> Ron
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/CAKHuyJpcYKepYzh%2BBU2MSD2RQ19zjHYiXgf3anWBL9esq9fkGQ%40mail.gmail.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/277ec3ee-f7bf-4862-a816-efe2937a9609%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Parent/Child query performance in version 1.1.2

2014-08-25 Thread Mark Greene

Hey Clinton,

Thanks for the heads up on what's on the horizon. That definitely sounds 
like a drastic improvement. That being said, my fear here is that even with 
that improvement, this data model (parent/child) doesn't seem to that 
performant with a moderate amount of documents. In order for us to really 
adopt this methodology of using parent/child, we'd expect to see sub 100ms 
performance so long as we were feeding ES with enough RAM. 

My hunch here is there must be some code path that is hit when running on 
more than 1 data node that either doesn't write to the cache or skips it on 
the read and hits the disk. We don't have a ton of load on our data nodes, 
CPU is well under 30% and IOWait is usually under 0.30.

Just to reiterate, when we run the parent/child query on one data node, it 
runs in less than 100ms, when it runs across two data nodes, it's >10s. 
This is being experienced on version 1.1.2 and 1.3.2.

On Monday, August 25, 2014 10:55:15 AM UTC-4, Clinton Gormley wrote:
>
> Something else to note: parent-child now uses global ordinals to make 
> queries 3x faster than they were previously, but global ordinals need to be 
> rebuilt after the index has refreshed (assuming some data has changed).
>
> Currently there is no way to refresh p/c global ordinals "eagerly" (ie 
> during the refresh phase) and so it happens on the first query after a 
> refresh.  1.3.3 and 1.4.0 will include an option to allow eager building of 
> global ordinals which should remove this latency spike: 
> https://github.com/elasticsearch/elasticsearch/issues/7394
>
> You may want to consider increasing the refresh_interval so that global 
> ordinals remain valid for longer.
>
>
> On 25 August 2014 16:48, Mark Greene > 
> wrote:
>
>> Hi Adrien,
>>
>> Thanks for reaching out.
>>
>> We actually were exited to see the performance improvements stated in the 
>> 1.2.0 release notes so we upgraded to 1.3.2. We saw some performance 
>> improvement but it wasn't orders of magnitude and queries are still running 
>> very slow.
>>
>> We also tried your suggestion of using the 'preference=_local' query 
>> param but we didn't see any difference there. Additionally, running the 
>> query 10 times, we saw no improvement in speed.
>>
>> Currently, the only major performance increase we've seen with 
>> parent/child queries is dropping down to 1 data node, at which, we see 
>> queries executing well under the 100ms mark.
>>
>>
>>
>>
>> On Friday, August 22, 2014 6:42:27 PM UTC-4, Adrien Grand wrote:
>>
>>> Hi Mark,
>>>
>>> Given that you had 1 replica in your first setup, it could take several 
>>> queries to warm up the field data cache completely, does the query still 
>>> take 16 seconds to run if you run it eg. 10 times? (3 should be enough, but 
>>> just to be sure)
>>>
>>> Does it change anything if you query elasticsearch with 
>>> preference=_local? This should be equivalent to your single-node setup, so 
>>> it would be interesting to see if that changes something.
>>>
>>> As a side note, you might want to try out a more recent version of 
>>> Elasticsearch since parent/child performance improved quite significantly 
>>> in 1.2.0 because of https://github.com/elasticsearch/elasticsearch/
>>> pull/5846
>>>
>>>
>>>
>>> On Fri, Aug 22, 2014 at 11:15 PM, Mark Greene  
>>> wrote:
>>>
 I wanted to update the list with an interesting piece of information. 
 We found that when we took one of our two data nodes out of the cluster, 
 leaving just one data node with no replicas, the query performance 
 increased dramatically. The queries are now returning in <100ms on 
 subsequent executions which is what we'd expect to see as a result of the 
 data being stored in the field data cache. 

 Is it possible that there is some kind of inefficient code path when a 
 query is spread across primary and replica shards?


 On Thursday, August 21, 2014 3:53:40 PM UTC-4, Mark Greene wrote:
>
> We are experiencing slow parent/child queries even when we run the 
> query a second time and I wanted to know if this is just the limit of 
> this 
> feature within ElasticSearch. According to the ES Docs (
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/c
> urrent/parent-child-performance.html) parent/child queries can be 
> 5-10x slower and consume a lot of memory. 
>
> My impression has been that as long as we give ES enough memory via 
> the field data cache, subsequent queries would be quicker than the first 
> time it is executed. We are seeing the following query take ~16 seconds 
> to 
> complete every time. 
>
>
> {
> "from": 0,
> "size": 100,
> "query": {
> "filtered": {
> "query": {
> "match_all": {}
> },
> "filter": {
> "bool": {
> "must": [
>

Re: Simple howto stunnel for elastcisearch cluster.

2014-08-25 Thread John Smith

And yay native API clients are nodes also, which allows them to become 
proxies. So then you need to stunnel protect them also. Rinse and repeat lol

So...

1- For port 9300 bind to localhost
2- Put stunnel infront of port 9300 and configure all nodes same way to 
have cluster node coms in SSL.
3- Restrict any access to 9300. (clients can become proxy nodes, so if they 
are somewhere external to the ES cluster, then you could connect to them 
unauthenticated/non ssl)
3- a) For port 9200 bind to localhost and put Ngnx as reverse proxy (This 
is straight passthrough)
b) Or use 3rd party plugin like jetty plugin (you have to rely that the 
plugin is doing the right thing and has no bugs, plus plugins are not 
necessarily up to speed with latest ES releases)

It's a bit cumbersome but this secures ES to the max. Also this forces the 
use of HTTP client which you then lose some of the niceties you get with 
native client. (Read more here: https://github.com/searchbox-io/Jest)






On Friday, 22 August 2014 13:47:12 UTC-4, John Smith wrote:
>
> Ok so I think I figured it out and seems to be working ok. Please feel 
> free to publish this or improve upon it etc... Note: client certs have not 
> been tested yet.
>
> Software versions used (though I don't think it matters really)
> Ubuntu 14.04
> JDK 1.8_20
> elasticsearch 1.3.2
> stunnel4
>
> This config is for 2 node config.
>
> 
> NODE 1
> 
>
> Required config changes to elasticsearch.yml
>
> # First bind elasticsearch to localhost (this makes es invisible to the 
> outside world)
> network.bind_host: 127.0.0.1
> transport.tcp.port: 9300
>
> # Since we are going to hide this node from the outside, we have to tell 
> the rest of the nodes how he looks on the outside
> network.publish_host: 
> transport.publish_port: 9700
>
> http.port: 9200
>
> # Disable muslticast
> discovery.zen.ping.multicast.enabled: false
>
> # Since we are hiding all the nodes behind stunnel we also need to proxy 
> es client requests through SSL. 
> # For each additional node add 127.0.0.1:970x where x is incremented by 1 
> I.e: 9702, 9703 etc...
> # Connect to NODE 2
> discovery.zen.ping.unicast.hosts: 127.0.0.1:9701
>
> stunnel.conf on NODE 1
>
> ;Proxy ssl for tcp transport.
> [es-trasnport]
> accept = :9300
> connect = 127.0.0.1:9300
> cert = stunnel.pem
>
> ;Proxy ssl for http
> [es-http]
> accept = :9200
> connect = 127.0.0.1:9200
> cert = stunnel.pem
>
> ;ES clustering does some local discovery.
> ;Since stunnel binds it's own ports, we pick an arbitrary port that is not 
> used by other "systems/protocols"
> ; See the publish settings of elasticsearch.yml above.
> [es-transport-local]
> client = yes
> accept = :9700
> connect = :9300
>
> ; The ssl client tunnel for es to connect ssl to node 2.
> [es-transport-node2]
> client = yes
> accept = 127.0.0.1:9701
> connect = :9301
>
> ;For each additional node increment x by 1, I.e: 9702, 9703 etc...
> [es-transport-nodex]
> client = yes
> accept = 127.0.0.1:970x
> connect = :930x
>
> 
> NODE 2
> 
>
> Required config changes to elasticsearch.yml
>
> # First bind elasticsearch to localhost (this makes es invisible to the 
> outside world)
> network.bind_host: 127.0.0.1
> transport.tcp.port: 9301
>
> # Since we are going to hide this node from the outside, we have to tell 
> the rest of the nodes how he looks on the outside
> network.publish_host: 
> transport.publish_port: 9701
>
> http.port: 9200
>
> # Disable muslticast
> discovery.zen.ping.multicast.enabled: false
>
> # Since we are hiding all the nodes behind stunnel we also need to proxy 
> es client requests through SSL. 
> # For each additional node add 127.0.0.1:970x where x is incremented by 1 
> I.e: 9702, 9703 etc...
> # Connect to NODE 1
> discovery.zen.ping.unicast.hosts: 127.0.0.1:9700
>
> stunnel.conf on NODE 2
>
> ;Proxy ssl for tcp transport.
> [es-trasnport]
> accept = :9301
> connect = 127.0.0.1:9301
> cert = stunnel.pem
>
> ;Proxy ssl for http
> [es-http]
> accept = :9200
> connect = 127.0.0.1:9200
> cert = stunnel.pem
>
> ;ES clustering does some local discovery.
> ;Since stunnel binds it's own ports, we pick an arbitrary port that is not 
> used by other "systems/protocols"
> ; See the publish settings of elasticsearch.yml above.
> [es-transport-local]
> client = yes
> accept = :9701
> connect = :9301
>
>
> ; The ssl client tunnel for es to connect ssl to node 1.
> [es-transport-node1]
> client = yes
> accept = 127.0.0.1:9700
> connect = :9300
>
> ;For each additional node increment x by 1, I.e: 9702, 9703 etc...
> [es-transport-nodex]
> client = yes
> accept = 127.0.0.1:970x
> connect = :930x
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsu

collect fields into the hash

2014-08-25 Thread vitaly

I have the following index:
{
   "message" => "Thu Jun 05 08:00:00 2014 RID 978a1861-1401973200416 
URL .  ",
  "@version" => "1",
"@timestamp" => "2014-08-22T15:46:22.729Z",
  "host" => "",
"kw" => "Ready Mix Concrete",
  "town" => "Zephyrhills",
 "state" => "FL",
"ip" => "63.251.207.54",
   "src" => "comlocal5"
}
{
   "message" => "Thu Jun 05 08:00:00 2014 RID 978a1861-1401973200435 
URL  .  ",
  "@version" => "1",
"@timestamp" => "2014-08-22T15:46:22.729Z",
  "host" => "",
"kw" => "video",
  "town" => "Norfolk",
 "state" => "VA",
"ip" => "216.54.94.2",
   "src" => "Lsxppc21128"
}
For simplicity only 2 documents.

I want to get hash with field "kw" as a key and frequency as a value. 
In this case it will be
hash{"Ready Mix Concrete} => 1
hash{video} => 1

I know that I should possibly use aggregates, but it did not work for me:
>curl -XGET 'http://localhost:9200/_search?search_type=count' -d 
'{"aggregations":{"terms":{"field":"kw"}}}'

{"took":24,"timed_out":false,"_shards":{"total":10,"successful":10,"failed":0},"hits":{"total":4,"max_score":0.0,"hits":[]}}

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/173839c3-62ad-41dd-927b-99628d114a63%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Parent/Child query performance in version 1.1.2

2014-08-25 Thread Clinton Gormley

Something else to note: parent-child now uses global ordinals to make
queries 3x faster than they were previously, but global ordinals need to be
rebuilt after the index has refreshed (assuming some data has changed).

Currently there is no way to refresh p/c global ordinals "eagerly" (ie
during the refresh phase) and so it happens on the first query after a
refresh.  1.3.3 and 1.4.0 will include an option to allow eager building of
global ordinals which should remove this latency spike:
https://github.com/elasticsearch/elasticsearch/issues/7394

You may want to consider increasing the refresh_interval so that global
ordinals remain valid for longer.

On 25 August 2014 16:48, Mark Greene  wrote:

> Hi Adrien,
>
> Thanks for reaching out.
>
> We actually were exited to see the performance improvements stated in the
> 1.2.0 release notes so we upgraded to 1.3.2. We saw some performance
> improvement but it wasn't orders of magnitude and queries are still running
> very slow.
>
> We also tried your suggestion of using the 'preference=_local' query param
> but we didn't see any difference there. Additionally, running the query 10
> times, we saw no improvement in speed.
>
> Currently, the only major performance increase we've seen with
> parent/child queries is dropping down to 1 data node, at which, we see
> queries executing well under the 100ms mark.
>
>
>
>
> On Friday, August 22, 2014 6:42:27 PM UTC-4, Adrien Grand wrote:
>
>> Hi Mark,
>>
>> Given that you had 1 replica in your first setup, it could take several
>> queries to warm up the field data cache completely, does the query still
>> take 16 seconds to run if you run it eg. 10 times? (3 should be enough, but
>> just to be sure)
>>
>> Does it change anything if you query elasticsearch with
>> preference=_local? This should be equivalent to your single-node setup, so
>> it would be interesting to see if that changes something.
>>
>> As a side note, you might want to try out a more recent version of
>> Elasticsearch since parent/child performance improved quite significantly
>> in 1.2.0 because of https://github.com/elasticsearch/elasticsearch/
>> pull/5846
>>
>>
>>
>> On Fri, Aug 22, 2014 at 11:15 PM, Mark Greene  wrote:
>>
>>> I wanted to update the list with an interesting piece of information. We
>>> found that when we took one of our two data nodes out of the cluster,
>>> leaving just one data node with no replicas, the query performance
>>> increased dramatically. The queries are now returning in <100ms on
>>> subsequent executions which is what we'd expect to see as a result of the
>>> data being stored in the field data cache.
>>>
>>> Is it possible that there is some kind of inefficient code path when a
>>> query is spread across primary and replica shards?
>>>
>>>
>>> On Thursday, August 21, 2014 3:53:40 PM UTC-4, Mark Greene wrote:

 We are experiencing slow parent/child queries even when we run the
 query a second time and I wanted to know if this is just the limit of this
 feature within ElasticSearch. According to the ES Docs (
 http://www.elasticsearch.org/guide/en/elasticsearch/guide/c
 urrent/parent-child-performance.html) parent/child queries can be
 5-10x slower and consume a lot of memory.

 My impression has been that as long as we give ES enough memory via the
 field data cache, subsequent queries would be quicker than the first time
 it is executed. We are seeing the following query take ~16 seconds to
 complete every time.

 {
 "from": 0,
 "size": 100,
 "query": {
 "filtered": {
 "query": {
 "match_all": {}
 },
 "filter": {
 "bool": {
 "must": [
 {
 "term": {
 "oid": 61
 }
 },
 {
 "has_child": {
 "type": "social",
 "query": {
 "bool": {
 "should": [
 {
 "term": {
 "engagement.type":
 "like"
 }
 },
 {
 "term": {

 "content.remote_id": "20697868961_10152270678178962"
 }
 }
 ]
 }
 }
>

When will LogStash exceed the queue capacity and drop messages?

2014-08-25 Thread Shih-Peng Lin



I am using LogStash to collect the logs from my service. The volume of the 
data is so large (20GB/day) that I am afraid that some of the data will be 
dropped at peak time.

So I asked question 

 in 
Stack Overflow and decided to add a Redis as a buffer between ELB and 
LogStash to prevent data loss.

However, I am curious about *when will LogStash exceed the queue capacity 
and drop messages?*

Because I've done some experiments and the result shows that LogStash can 
completely process all the data without any loss, e.g., local file (a 20GB 
text file) --> LogStash --> local file, netcat --> LogStash --> local file.

Can someone give me a solid example (or scenario, if any) when LogStash 
eventually drops messages? So I can have a better understanding about why 
we need a buffer in front of it.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/57ad4bed-de0e-442a-bb40-a7d1079a148d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Parent/Child query performance in version 1.1.2

2014-08-25 Thread Mark Greene

Hi Adrien,

Thanks for reaching out.

We actually were exited to see the performance improvements stated in the 
1.2.0 release notes so we upgraded to 1.3.2. We saw some performance 
improvement but it wasn't orders of magnitude and queries are still running 
very slow.

We also tried your suggestion of using the 'preference=_local' query param 
but we didn't see any difference there. Additionally, running the query 10 
times, we saw no improvement in speed.

Currently, the only major performance increase we've seen with parent/child 
queries is dropping down to 1 data node, at which, we see queries executing 
well under the 100ms mark.




On Friday, August 22, 2014 6:42:27 PM UTC-4, Adrien Grand wrote:
>
> Hi Mark,
>
> Given that you had 1 replica in your first setup, it could take several 
> queries to warm up the field data cache completely, does the query still 
> take 16 seconds to run if you run it eg. 10 times? (3 should be enough, but 
> just to be sure)
>
> Does it change anything if you query elasticsearch with preference=_local? 
> This should be equivalent to your single-node setup, so it would be 
> interesting to see if that changes something.
>
> As a side note, you might want to try out a more recent version of 
> Elasticsearch since parent/child performance improved quite significantly 
> in 1.2.0 because of 
> https://github.com/elasticsearch/elasticsearch/pull/5846
>
>
>
> On Fri, Aug 22, 2014 at 11:15 PM, Mark Greene  > wrote:
>
>> I wanted to update the list with an interesting piece of information. We 
>> found that when we took one of our two data nodes out of the cluster, 
>> leaving just one data node with no replicas, the query performance 
>> increased dramatically. The queries are now returning in <100ms on 
>> subsequent executions which is what we'd expect to see as a result of the 
>> data being stored in the field data cache. 
>>
>> Is it possible that there is some kind of inefficient code path when a 
>> query is spread across primary and replica shards?
>>
>>
>> On Thursday, August 21, 2014 3:53:40 PM UTC-4, Mark Greene wrote:
>>>
>>> We are experiencing slow parent/child queries even when we run the query 
>>> a second time and I wanted to know if this is just the limit of this 
>>> feature within ElasticSearch. According to the ES Docs (
>>> http://www.elasticsearch.org/guide/en/elasticsearch/guide/
>>> current/parent-child-performance.html) parent/child queries can be 
>>> 5-10x slower and consume a lot of memory. 
>>>
>>> My impression has been that as long as we give ES enough memory via the 
>>> field data cache, subsequent queries would be quicker than the first time 
>>> it is executed. We are seeing the following query take ~16 seconds to 
>>> complete every time. 
>>>
>>>
>>> {
>>> "from": 0,
>>> "size": 100,
>>> "query": {
>>> "filtered": {
>>> "query": {
>>> "match_all": {}
>>> },
>>> "filter": {
>>> "bool": {
>>> "must": [
>>> {
>>> "term": {
>>> "oid": 61
>>> }
>>> },
>>> {
>>> "has_child": {
>>> "type": "social",
>>> "query": {
>>> "bool": {
>>> "should": [
>>> {
>>> "term": {
>>> "engagement.type": 
>>> "like"
>>> }
>>> },
>>> {
>>> "term": {
>>> "content.remote_id": 
>>> "20697868961_10152270678178962"
>>> }
>>> }
>>> ]
>>> }
>>> }
>>> }
>>> }
>>> ]
>>> }
>>> }
>>> }
>>> },
>>> "fields": "id",
>>> "sort": [
>>> {
>>> "_score": {}
>>> },
>>> {
>>> "id": {
>>> "order": "asc"
>>> }
>>> }
>>> ]
>>> }
>>>
>>>
>>> The index (which has 5 shards with 1 replica shard) we are testing this 
>>> on has 2.2 million parent documents and 1.1 million child documents.
>>>
>>> We are running our two data nodes on r3.2xlarge's which have 8 CPU's, 
>>> 60GB of RAM, and SSD.
>>>
>>> Our ES data nodes have 30G of heap and the field data cache is only 
>>> consuming around ~3G

ElasticSearch-mapper-attachement plugin

2014-08-25 Thread Santosh B

Hi,
As i was exploring the mapper-attachement plugin. I realized that all the 
documents content has to be converted int base64 encoding.? is there any 
way not to do that...?
Sily question..? WHY..? bcoz of memory...? 

And when viewing in Kibana in TABLE PANEL format it shows in base64 
encoding in but in terms panel it shows proper text..am I missing on 
something?
Any help greatl appreciated...

Thanks,
Santosh B


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/83dbc7e6-20db-49e0-85d0-76c736f97e37%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Node is trying to talk to itslef

2014-08-25 Thread Eugene Strokin

Hello,
I have an old cluster of ES 0.20.1, and it worked fine until recently one 
node got disconnected for unknown reason (probably network failure), and 
after restart it tries to join master but sends request to itself and 
failing with such message:

[2014-08-25 13:59:11,768][INFO ][transport] [es-5] 
bound_address {inet[/10.128.14.228:9300]}, publish_address 
{inet[/10.128.14.228:9300]}
[2014-08-25 13:59:14,980][INFO ][discovery.zen] [es-5] failed 
to send join request to master [[van Dyne, 
Janet][GhTyDsKjReWbQydbsmTSSw][inet[/10.128.14.228:9300]]], reason 
[org.elasticsearch.transport.RemoteTransportException: 
[es-5][inet[/10.128.14.228:9300]][discovery/zen/join]; 
org.elasticsearch.ElasticSearchIllegalStateException: Node 
[[es-5][fbR1Fd2sS4KnA4U5nWMqtw][inet[/10.128.14.228:9300]]] not master for 
join request from 
[[es-5][fbR1Fd2sS4KnA4U5nWMqtw][inet[/10.128.14.228:9300


Note, that bound_address and publish_address are set, and the configured 
node name is [es-5] now (it was set recently as well). But the node calls [van 
Dyne, Janet] which I believe is the old node's name, with the same IP as it 
is.
If anyway to heal this situation without full cluster restart?
Thank you,
Eugene

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6e29db00-2202-43c6-ab78-5c9f3bed41bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: One large index vs. many smaller indexes

2014-08-25 Thread Chris Neal

Thanks Adrien!

Very much appreciate your time and help.

Chris


On Mon, Aug 25, 2014 at 3:44 AM, Adrien Grand <
adrien.gr...@elasticsearch.com> wrote:

> I meant tens of shards per node. So if you have N nodes with I indices
> which have S shards and R replicas, that would be (I * S * (1 + R)) / N.
>
> One shard per node is optimal but doesn't allows for growth: if you add
> one more node, you cannot spread the indexing work load, that is why it is
> common to have a few shards per node in order to allow elasticsearch to
> spread the load in case you would introduce a new node in your cluster to
> improve your cluster capacity.
>
>
> On Mon, Aug 25, 2014 at 12:07 AM, Chris Neal 
> wrote:
>
>> Adrien,
>>
>> Thanks so much for the response.  It was very helpful.  I will check out
>> those links on capacity planning for sure.
>>
>> One followup question.  You mention that tens of shards per node would be
>> ok.  Are you meaning tens of shards from tens of indexes?  Or tens of
>> shards for a single index?  Right now I have two servers configured with
>> the index getting 2 shards (one per server), and 1 replica (per server).
>>
>> Chris
>>
>>
>> On Fri, Aug 22, 2014 at 5:58 PM, Adrien Grand <
>> adrien.gr...@elasticsearch.com> wrote:
>>
>>> Hi Chris,
>>>
>>> Usually, the problem is not that much in terms of indices but shards,
>>> which are the physical units of data storage (an index being a logical view
>>> over several shards).
>>>
>>> Something to beware of is that shards typically have some constant
>>> overhead (disk space, file descriptors, memory usage) that does not depend
>>> on the amount of data that they store. Although it would be ok to have up
>>> to a few tens of shards per nodes, you should avoid to have eg. thousands
>>> of shards per node.
>>>
>>> if you plan on always adding a filter for a specific application in your
>>> search requests, then splitting by application makes sense since this will
>>> make the filter useless at search time, you will just need to query the
>>> application-specific index. On the other hand if you don't filter by
>>> application, then splitting data by yourself into smaller indices would be
>>> pretty equivalent to storing everything in a single index with a higher
>>> number of shards.
>>>
>>> You might want to check out the following resources that talk about
>>> capacity planning:
>>>  - http://www.elasticsearch.org/videos/big-data-search-and-analytics/
>>>  -
>>> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/capacity-planning.html
>>>
>>>
>>>
>>> On Fri, Aug 22, 2014 at 9:08 PM, Chris Neal 
>>> wrote:
>>>
 Hi all,

 As the subject says, I'm wondering about index size vs. number of
 indexes.

 I'm indexing many application log files, currently with an index by day
 for all logs, which will make a very large index.  For just a few
 applications in Development, the index is 55GB a day (across 2 servers).
  In prod with all applications, it will be "much more than that".  1TB a
 day maybe?

 I'm wondering if there is value in splitting the indexes by day and by
 application, which would produce more indexes per day, but they would be
 smaller, vs. value in having a single, mammoth index by day alone.

 Is it just a resource question?  If I have enough RAM/disk/CPU to
 support a "mammoth" index, then I'm fine?  Or are there other reasons to
 (or to not) split up indexes?

 Very much appreciate your time.
 Chris

 --
 You received this message because you are subscribed to the Google
 Groups "elasticsearch" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>>
>>>
>>> --
>>> Adrien Grand
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5i7AAnasMYZgR83aTXvELan%3DkR6OLvGYKfs9d5Subi4A%40mail.gmail.com
>>> 
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To un

Re: How to index Office files? .txt and .pdf are working...

2014-08-25 Thread Dirk Bauer

Hi David,

thx for your help, but it's still not working.

What I did:

The query 

{
  "query": {
"match": {
  "_all": "test"
}
  }
}

delivers all my indexed document (also the '.doc / *.docx files) and I can 
see the base64 stuff in the file.file field.
So this looks good to me.

Then I went to ..\config\logging.yml and added under the "logger:" section 
an entry for 

1st attempt: "org.apache.plugin.mapper.attachments: TRACE"
2nd attempt: "org.apache.tika: TRACE"

After shutdown of ES, restart, deleting the existing index and reindexing 
of my test documents there was no additional entry from the mapper plug or 
tika in the log.
ES is logging fine...

logger:
  # log action execution errors for easier debugging
  action: DEBUG
  # reduce the logging for aws, too much is logged under the default INFO
  com.amazonaws: WARN

  # gateway
  #gateway: DEBUG
  #index.gateway: DEBUG

  # peer shard recovery
  #indices.recovery: DEBUG

  # discovery
  #discovery: TRACE

  index.search.slowlog: TRACE, index_search_slow_log_file
  index.indexing.slowlog: TRACE, index_indexing_slow_log_file

  # DBA: Enabled logger for plugin mapper.attachments
  org.apache.plugin.mapper.attachments: TRACE


The next idea was that maybe the mapping plugin is missing some files for 
parsing for Office documents?
In the plug-in folder I can see *.jar files for 

rome-0.9.jar
tagsoup-1.2.1.jar
tika-core-1.5.jar
tika-parsers-1.5.jar
vorbis-java-core-0.1.jar
vorbis-java-core-0.1-tests.jar
vorbis-java-tika-0.1.jar
xercesImpl-2.8.1.jar
xml-apis-1.3.03.jar
xmpcore-5.1.2.jar
xz-1.2.jar
apache-mime4j-core-0.7.2.jar
apache-mime4j-dom-0.7.2.jar
asm-debug-all-4.1.jar
aspectjrt-1.6.11.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
boilerpipe-1.1.0.jar
commons-compress-1.5.jar
commons-logging-1.1.1.jar
elasticsearch-mapper-attachments-2.3.1.jar
fontbox-1.8.4.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
isoparser-1.0-RC-1.jar
jdom-1.0.jar
jempbox-1.8.4.jar
jhighlight-1.0.jar
juniversalchardet-1.0.3.jar
metadata-extractor-2.6.2.jar
netcdf-4.2-min.jar
pdfbox-1.8.4.jar


Not sure but here you will find additional files "poi*.jar" that should be 
responsible to parse the office files:

http://mvnrepository.com/artifact/org.apache.tika/tika-parsers/1.5


The following files were downloaded to the plugin folder but the documents 
are still not parsed...

poi-3.10-beta2.jar
poi-ooxml-3.10-beta2.jar
poi-scratchpad-3.10-beta2.jar

The last check was to make sure the word document are not corruped. A 
colleage of mine has checked a test file with

java -jar tika-app-1.5.jar –g


and the output was fine for the document.


So, anyone some more ideas??


Thanks

Dirk

Am Montag, 25. August 2014 10:56:54 UTC+2 schrieb David Pilato:
>
> From my experience, this should work. Indexing Word docs should work as 
> Tika support office docs.
>
> Not sure what you are doing wrong. Try to send a match all query and ask 
> for field file.file.
>
> Also, you could set mapper plugin to TRACE mode in logging.yml and see if 
> it tells something interesting.
>
> HTH
>
> --
> David ;-)
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
> Le 25 août 2014 à 09:05, Dirk Bauer > a 
> écrit :
>
> Hi,
>
> using elasticsearch-1.3.2 with 
>
> Plug-in
> -
> name: mapper-attachments
> version: 2.3.1
> description: Adds the attachment type allowing to parse difference 
> attachment formats
> jvm: true
> site: false
>
> on Windows 8 for evaluation purpose.
>
> JVM 
> -
> version: 1.7.0_67
> vm_name: Java HotSpot(TM) Client VM
> vm_version: 24.65-b04
> vm_vendor: Oracle Corporation
>
>
> I have created the following mapping:
>
> {
> myIndex: {
> mappings: {
> dokument: {
> properties: {
> created: {
> type: date
> format: dateOptionalTime
> }
> description: {
> type: string
> }
> file: {
> type: attachment
> path: full
> fields: {
> file: {
> type: string
> store: true
> term_vector: with_positions_offsets
> }
> author: {
> type: string
> }
> title: {
> type: string
> }
> name: {
> type: string
> }
> date: {
> type: date
> format: dateOptionalTime
> }
> keywords: {
> type: string
> }
> content_type: {
> type: string
> }
> content_length: {
> type: integer
> }
> language: {
> type: string
> }
> }
> }
> id: {
> type: string
> }
> title: {
> type: string
> }
> }
> }
> }
> }
> }
>
> Because I like to use ES from C#/.NET I have created a little C# app that 
> reads a file as base64 encodes stream from hard drive and put the document 
> to the index of ES. I'm working with this POST request:
>
> {
>   "id": "8dbf1d73-44d1-4e20-aa35-13b18ddf5057",
>   "title": "Test",
>   "description": "Test Description",
>   "created": "2014-01-20T19:04:20.1019885+01:00",
>   "file": {
> "_content_type": "application/pdf",
> "_name": "Test.pdf",
> "content": "---my base64 stuff here---"
>   }
> }
>
> and send it as index command to ES like this:
>
> myIndex/dokument/8dbf1d73-44d1-4e20-aa35-13b18ddf5057?refresh=true
>
> After that I query ES with this request:
>
> {
>   "fi

Re: Completion mapping type throws a misleading error on null value

2014-08-25 Thread Gérald Quintana

Replacing null by " " (espace) does the job.

Gérald

Le lundi 25 août 2014 13:51:12 UTC+2, Gérald Quintana a écrit :
>
> Hello,
>
> I am experiencing the same problem: I am indexing a field (libelleVoie) 
> with a mapping of type completion, but this field is sometimes null, and 
> then I get this error:
>
> [2014-08-25 13:16:49,500][DEBUG][action.bulk  ] 
> [gqa-es-node-1] [fiche_immeuble][4] failed to execute bulk item (index) 
> index {[fiche_immeuble][terrain][020002], 
> source[{"id":350,"numero":"020002","numeroVoie":null,"libelleVoie":null,"commune":{"codePostal":"02130","codeInsee":"02127","libelle":"BRUYERES
>  
> SUR FERE"}}]}
> org.elasticsearch.index.mapper.MapperParsingException: failed to parse
> at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:536)
> at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:462)
> at 
> org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:394)
> at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:413)
>
> at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:155)
> at 
> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:534)
> at 
> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:433)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: 
> Unknown field name[commune], must be one of [context, payload, input, 
> weight, output]
> at 
> org.elasticsearch.index.mapper.core.CompletionFieldMapper.parse(CompletionFieldMapper.java:256)
> at 
> org.elasticsearch.index.mapper.core.AbstractFieldMapper$MultiFields.parse(AbstractFieldMapper.java:927)
> at 
> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:422)
> at 
> org.elasticsearch.index.mapper.object.ObjectMapper.serializeNullValue(ObjectMapper.java:526)
> at 
> org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:486)
> at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:515)
> ... 9 more
>
> Is there a worlaround (using ES 1.2.2)?
>
> Gérald

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c61f3b19-0d99-4bf3-b72a-a1cf0bc67534%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: stuck thread problem?

2014-08-25 Thread Patrick Proniewski

On 25 août 2014, at 13:51, Mark Walkom  wrote:

> (You should really set Xms and Xmx to be the same.)

Ok, I'll do this next time I restart.


> But it's not faulty, it's probably just GC which should be visible in the
> logs. How much data do you have in your "cluster"?


NAME   USED  AVAIL  REFER  MOUNTPOINT
zdata/elasticsearch   4.07G  1.01T  4.07G  /zdata/elasticsearch

It's only daily indices (logstash)
Is GC supposed to last more than 3 days? And I can't find any reference to a 
garbage collecting in /var/log/elasticsearch/*.



> On 25 August 2014 21:19, Patrick Proniewski 
> wrote:
> 
>> Hello,
>> 
>> I'm running an ELK install for few months now, and few weeks ago I've
>> noticed a strange behavior: ES had some kind of stuck thread consuming
>> 20-70% of a CPU core. It remained unnoticed for days. Then I've restarted
>> ES, and it all came back to normal, until it started again 2 weeks later to
>> consume CPU doing nothing. New restart, and 2 weeks later same problem.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6D4EDFB2-21CB-46DB-B765-CD6FC8C74D45%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

Re: stuck thread problem?

(You should really set Xms and Xmx to be the same.)

But it's not faulty, it's probably just GC which should be visible in the
logs. How much data do you have in your "cluster"?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com


On 25 August 2014 21:19, Patrick Proniewski 
wrote:

> Hello,
>
> I'm running an ELK install for few months now, and few weeks ago I've
> noticed a strange behavior: ES had some kind of stuck thread consuming
> 20-70% of a CPU core. It remained unnoticed for days. Then I've restarted
> ES, and it all came back to normal, until it started again 2 weeks later to
> consume CPU doing nothing. New restart, and 2 weeks later same problem.
>
> I'm running:
>
> ES 1.1.0 on FreeBSD 9.2-RELEASE amd64.
>
> /usr/local/openjdk7/bin/java -Des.pidfile=/var/run/elasticsearch.pid
> -Djava.net.preferIPv4Stack=true -server -Xms1g -Xmx2g -Xss256k
> -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
> -Des.config=/usr/local/etc/elasticsearch/elasticsearch.yml -cp
> /usr/local/lib/elasticsearch/elasticsearch-1.1.0.jar:/usr/local/lib/elasticsearch/*:/usr/local/lib/elasticsearch/sigar/*
> org.elasticsearch.bootstrap.Elasticsearch
>
> # java -version
> openjdk version "1.7.0_51"
> OpenJDK Runtime Environment (build 1.7.0_51-b13)
> OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
>
> Here is a sample top output showing the faulty thread:
>
> last pid: 37009;  load averages:  0.79,  0.73,  0.71   up
> 53+18:11:28  11:48:59
> 932 processes: 9 running, 901 sleeping, 22 waiting
> CPU: 10.8% user,  0.0% nice,  0.2% system,  0.0% interrupt, 89.0% idle
> Mem: 4139M Active, 1129M Inact, 8962M Wired, 1596M Free
> ARC: 5116M Total, 2135M MFU, 2227M MRU, 12M Anon, 133M Header, 609M Other
> Swap: 4096M Total, 4096M Free
>
>   PID USERNAME  PRI NICE   SIZERES STATE   C   TIME   WCPU COMMAND
>../..
> 74417 elasticsearch  330  2791M  2058M uwait   5 153:33 23.97%
> java{java}<--
> 74417 elasticsearch  270  2791M  2058M uwait   7  43:02  7.96%
> java{java}
> 74417 elasticsearch  270  2791M  2058M uwait   6  43:02  7.96%
> java{java}
> 74417 elasticsearch  220  2791M  2058M uwait   1   7:32  2.20%
> java{java}
> 74417 elasticsearch  220  2791M  2058M uwait   5   8:26  1.76%
> java{java}
> 74417 elasticsearch  220  2791M  2058M uwait   7   8:25  1.76%
> java{java}
> 74417 elasticsearch  220  2791M  2058M uwait   5   8:25  1.76%
> java{java}
> 74417 elasticsearch  220  2791M  2058M uwait   5   8:26  1.66%
> java{java}
> 74417 elasticsearch  220  2791M  2058M uwait   7   8:26  1.66%
> java{java}
> 74417 elasticsearch  220  2791M  2058M uwait   4   8:25  1.66%
> java{java}
> 74417 elasticsearch  220  2791M  2058M uwait   6   8:25  1.66%
> java{java}
> 74417 elasticsearch  220  2791M  2058M uwait   1   8:25  1.66%
> java{java}
>
> Nothing to be found in  log files...
>
> Any idea?
>
> Patrick
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/7C4F5EC0-BA5D-4D8C-AB39-EEA4250956DC%40patpro.net
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEM624Z2AiKZPs2EoQDSqAvYLVzAwJvwC%3DiFV4G4f6OGPcPOCA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Completion mapping type throws a misleading error on null value

2014-08-25 Thread Gérald Quintana

Hello,

I am experiencing the same problem: I am indexing a field (libelleVoie) 
with a mapping of type completion, but this field is sometimes null, and 
then I get this error:

[2014-08-25 13:16:49,500][DEBUG][action.bulk  ] [gqa-es-node-1] 
[fiche_immeuble][4] failed to execute bulk item (index) index 
{[fiche_immeuble][terrain][020002], 
source[{"id":350,"numero":"020002","numeroVoie":null,"libelleVoie":null,"commune":{"codePostal":"02130","codeInsee":"02127","libelle":"BRUYERES
 
SUR FERE"}}]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse
at 
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:536)
at 
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:462)
at 
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:394)
at 
org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:413)

at 
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:155)
at 
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:534)
at 
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:433)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: Unknown 
field name[commune], must be one of [context, payload, input, weight, 
output]
at 
org.elasticsearch.index.mapper.core.CompletionFieldMapper.parse(CompletionFieldMapper.java:256)
at 
org.elasticsearch.index.mapper.core.AbstractFieldMapper$MultiFields.parse(AbstractFieldMapper.java:927)
at 
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:422)
at 
org.elasticsearch.index.mapper.object.ObjectMapper.serializeNullValue(ObjectMapper.java:526)
at 
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:486)
at 
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:515)
... 9 more

Is there a worlaround (using ES 1.2.2)?

Gérald

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/104f8954-cebf-4def-aff1-56342ca0b80b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Query pre-processing before execution?

2014-08-25 Thread Otis Gospodnetic

Hi,

On Monday, August 25, 2014 11:40:53 AM UTC+2, Jörg Prante wrote:
>
> I do not fully understand what an external filter service is but I 
> remember such a question before. It does not matter where the filter terms 
> come from, you can set up your application, and add filter terms at ES 
> language level from there. This is the most flexible and scalable approach.
>

I think by "your application" you mean the client making the call to ES to 
execute a query, right?
If yes, I agree.  But that requires this client application to do all this 
work.  What if one wants to alter the query on the server/ES-side without 
the client having to do the work?

It is not feasible to build a long string of fileld1:foo AND field2:bar AND 
> field3:test. You should really use the ES Java API or the DSL to build 
> filters, not Lucene Query language.
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html
>

I *think* the Lucene query-like string was just "pseudo-query" to avoid 
typing in the JSON variant of that. 

A special case is where a service already provides Lucene bitsets and this 
> should be processed in ES. I have not enough imagination how this can work 
> at all, regarding the distributed nature of ES, but you can plug Lucene 
> filters (bitsets) via the filter API into ES queries, see 
> org.elasticsearch.index.query.TermsFilterParser 
> for an example how ES transforms JSON filter terms to Lucene filters.
>

Thanks Jörg, will look into that!

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/
 

> On Mon, Aug 25, 2014 at 10:58 AM, Pawel > 
> wrote:
>
>> Hi Joerg,
>> You are right about analyzer. I have also a little different case or 
>> maybe I missed something (and analyzer-way can also handle my case).
>>
>> I'd like to process a query and add additional filter to each of queries. 
>> To build this filter external service should be queried to fetch additional 
>> data and use this data to build proper filter. Filter can be build for a 
>> few fields: for example fileld1:foo AND field2:bar AND field3:test. Do you 
>> have any suggestions?
>>
>> --
>> Paweł
>>
>> On Thu, Aug 21, 2014 at 10:31 AM, joerg...@gmail.com  <
>> joerg...@gmail.com > wrote:
>>
>>> I would rather use the analyzer/token filter machinery of Lucene for 
>>> search/index extensions, plugging this into ES is a breeze. 
>>>
>>> If you want field specific mangling, I would use the field mapper to 
>>> create a new field type. There, you have read access to the whole 
>>> (immutable) document source and you can pre-process the field input data in 
>>> the given document context before indexing.
>>>
>>> Jörg
>>>
>>>
>>> On Wed, Aug 20, 2014 at 12:00 PM, Otis Gospodnetic <
>>> otis.gos...@gmail.com > wrote:
>>>
 Hi,

 What is the best way to pre-process a query a bit before ES executes 
 it? (e.g. I want to shingle the query string a bit and expand/rewrite a 
 query before letting ES execute it)

 I can create a custom Rest Action and a new API endpoint, but I'd 
 prefer to hide custom query pre-processing behind the standard ES query 
 API.

 Is there any way to do that?

 Thanks,
 Otis
 --
 Elasticsearch Performance Monitoring * Log Management * Search Analytics
 http://sematext.com/



-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d1f3ea20-1118-4044-92d6-f96fa0f83cd7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

stuck thread problem?

2014-08-25 Thread Patrick Proniewski

Hello,

I'm running an ELK install for few months now, and few weeks ago I've noticed a 
strange behavior: ES had some kind of stuck thread consuming 20-70% of a CPU 
core. It remained unnoticed for days. Then I've restarted ES, and it all came 
back to normal, until it started again 2 weeks later to consume CPU doing 
nothing. New restart, and 2 weeks later same problem.

I'm running: 

ES 1.1.0 on FreeBSD 9.2-RELEASE amd64.

/usr/local/openjdk7/bin/java -Des.pidfile=/var/run/elasticsearch.pid 
-Djava.net.preferIPv4Stack=true -server -Xms1g -Xmx2g -Xss256k 
-Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly 
-XX:+HeapDumpOnOutOfMemoryError -Delasticsearch 
-Des.config=/usr/local/etc/elasticsearch/elasticsearch.yml -cp 
/usr/local/lib/elasticsearch/elasticsearch-1.1.0.jar:/usr/local/lib/elasticsearch/*:/usr/local/lib/elasticsearch/sigar/*
 org.elasticsearch.bootstrap.Elasticsearch

# java -version
openjdk version "1.7.0_51"
OpenJDK Runtime Environment (build 1.7.0_51-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Here is a sample top output showing the faulty thread:

last pid: 37009;  load averages:  0.79,  0.73,  0.71   up 
53+18:11:28  11:48:59
932 processes: 9 running, 901 sleeping, 22 waiting
CPU: 10.8% user,  0.0% nice,  0.2% system,  0.0% interrupt, 89.0% idle
Mem: 4139M Active, 1129M Inact, 8962M Wired, 1596M Free
ARC: 5116M Total, 2135M MFU, 2227M MRU, 12M Anon, 133M Header, 609M Other
Swap: 4096M Total, 4096M Free

  PID USERNAME  PRI NICE   SIZERES STATE   C   TIME   WCPU COMMAND
   ../..
74417 elasticsearch  330  2791M  2058M uwait   5 153:33 23.97% java{java}   
 <--
74417 elasticsearch  270  2791M  2058M uwait   7  43:02  7.96% java{java}
74417 elasticsearch  270  2791M  2058M uwait   6  43:02  7.96% java{java}
74417 elasticsearch  220  2791M  2058M uwait   1   7:32  2.20% java{java}
74417 elasticsearch  220  2791M  2058M uwait   5   8:26  1.76% java{java}
74417 elasticsearch  220  2791M  2058M uwait   7   8:25  1.76% java{java}
74417 elasticsearch  220  2791M  2058M uwait   5   8:25  1.76% java{java}
74417 elasticsearch  220  2791M  2058M uwait   5   8:26  1.66% java{java}
74417 elasticsearch  220  2791M  2058M uwait   7   8:26  1.66% java{java}
74417 elasticsearch  220  2791M  2058M uwait   4   8:25  1.66% java{java}
74417 elasticsearch  220  2791M  2058M uwait   6   8:25  1.66% java{java}
74417 elasticsearch  220  2791M  2058M uwait   1   8:25  1.66% java{java}

Nothing to be found in  log files...

Any idea?

Patrick

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7C4F5EC0-BA5D-4D8C-AB39-EEA4250956DC%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

Re: Topics/Entities with relevancy scores and searching

2014-08-25 Thread Clinton Gormley

On 24 August 2014 19:46, Scott Decker  wrote:

> Have you done this? any concerns to performance with this sort of scoring,
> or, it is just as fast if you were doing base lucene scoring if we override
> the score function and just use our own?
> -- we will of course try it and run our own performance tests, just
> looking to see if you all ready have any insights.
>

I haven't benchmarked it myself.  Obviously accessing payloads is slower
than not, and some further work could be done on the scripting side to
cache some term statistics lookups, but I don't know how performance will
compare to doing this natively.

Would be interested in your feedback

clint

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQi%3DLMo83S6w-LZyrGz%3DD3gHPf0B1ZbU-EGkS6p9c9jPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Error running ES DSL in hadoop mapreduce

2014-08-25 Thread Sona Samad

One more update on the issue:

I tried changing the query using 'sum'

{
  "size":0,
  "aggs": {
"group_by_BodyPart": {
  "terms": {
"field": "body_part",
"size": 5,
 "order" : { "examcount" : "desc" }   
},
  "aggs" : {
"examcount" : { "sum" : { "field" : "ExamRowKey" } }
}
}
  }
}

But both these queries returns entire search result records to Mapper 
method, inspite of giving "size": 5 in query string.


Thanks,
Sona


On Monday, August 25, 2014 3:17:03 PM UTC+5:30, Sona Samad wrote:
>
> Thanks Adrien.
>
> The ExamRowKey and body_part are Strings uploaded from csv file using 
> LogStash to ElasticSearch.
>
> - how reproducible is it? Ie. if you run this query 10 times, how many of 
> these queries will write such lines to the logs?
> This query returns the error each time its run in the cluster. 
> Other simple queries are returning values. Eg:
>{"query":
>{"term":
> {"ExamRowKey":"4090741090"}
> }
> }
>
> - is it common that several queries will be executing at the same time on 
> your elasticsearch cluster?
> For testing purpose, I was running only the above specified query.
>
> - are there other exception in your logs that happen approximately at the 
> same time?
> No, the stack trace I have posted is the only errors I got today after 
> running the query.
>
> Thanks,
> Sona
>
> On Monday, August 25, 2014 1:34:26 PM UTC+5:30, Adrien Grand wrote:
>>
>> Thanks Sona,
>>
>> This stack trace indicates a bug in the cardinality aggregation. I just 
>> opened an issue for it: 
>> https://github.com/elasticsearch/elasticsearch/issues/7429
>>
>> In order to help me understand/reproduce this bug, could you please 
>> provide the mappings of your ExamRowKey and body_part fields? Also answers 
>> to the questions below would help me understand better what is happening:
>>  - how reproducible is it? Ie. if you run this query 10 times, how many 
>> of these queries will write such lines to the logs?
>>  - is it common that several queries will be executing at the same time 
>> on your elasticsearch cluster?
>>  - are there other exception in your logs that happen approximately at 
>> the same time?
>>
>> Thanks!
>>
>>
>>
>> On Mon, Aug 25, 2014 at 6:10 AM, Sona Samad  wrote:
>>
>>> Hi Adrien,
>>>  
>>> My elasticsearch version is :  elasticsearch-1.2.1 
>>>  
>>> The Maven dependency for hadoop:
>>>  
>>> 
>>>   org.elasticsearch
>>>   elasticsearch-hadoop-mr
>>>   2.0.1
>>>  
>>>  
>>>  
>>> The full stack trace is given below:
>>>  
>>> [2014-08-25 09:31:58,892][DEBUG][action.search.type   ] [Thane 
>>> Ector] [mr][4], node[1ZbXSvkKQC-kDvgMXuC8iQ], [P], s[STARTED]: Failed to 
>>> execute [org.elasticsearch.action.search.SearchRequest@6ed78f6d]
>>> org.elasticsearch.search.query.QueryPhaseExecutionException: [mr][4]: 
>>> query[ConstantScore(cache(_type:logs))],from[0],size[50]: Query Failed 
>>> [Failed to execute main query]
>>>
>>>  at 
>>> org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
>>>  at 
>>> org.elasticsearch.search.SearchService.executeScan(SearchService.java:215)
>>>  at 
>>> org.elasticsearch.search.action.SearchServiceTransportAction$19.call(SearchServiceTransportAction.java:444)
>>>  at 
>>> org.elasticsearch.search.action.SearchServiceTransportAction$19.call(SearchServiceTransportAction.java:441)
>>>  at 
>>> org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
>>>  at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>  at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>  at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 97
>>>  at 
>>> org.elasticsearch.common.util.BigArrays$IntArrayWrapper.set(BigArrays.java:185)
>>>  at 
>>> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus$Hashset.values(HyperLogLogPlusPlus.java:499)
>>>  at 
>>> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.upgradeToHll(HyperLogLogPlusPlus.java:307)
>>>  at 
>>> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collectLcEncoded(HyperLogLogPlusPlus.java:245)
>>>  at 
>>> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collectLc(HyperLogLogPlusPlus.java:239)
>>>  at 
>>> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collect(HyperLogLogPlusPlus.java:231)
>>>  at 
>>> org.elasticsearch.search.aggregations.metrics.cardinality.CardinalityAggregator$DirectCollector.collect(CardinalityAggregator.java:204)
>>>  at 
>>> org.elasticsearch.search.aggregations.metrics.cardinality.CardinalityAggregator.collect(CardinalityAggregator.java:118)
>>>  at 
>>> org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectBucketNoCounts(BucketsAggregato

Re: Elastic search dynamic number of replicas from Java API

2014-08-25 Thread joergpra...@gmail.com

An example of a server-side cluster state listener is in JDBC river plugin

https://github.com/jprante/elasticsearch-river-jdbc/blob/master/src/main/java/org/xbib/elasticsearch/action/river/jdbc/state/RiverStateService.java

I use it to augment the cluster state with river state info.

Jörg


On Fri, Aug 22, 2014 at 10:08 AM, 'Sandeep Ramesh Khanzode' via
elasticsearch  wrote:

> Hi Jorg,
>
> Can you please give a server-side or client-side example of using
> CLusterStateListener?
> Do I have to use a plugin. if so, which module do I register/override?
> If not, do I have to use a Node Client (not a TransportClient), and
> retrieve the ClusterService somehow and then register?
>
> Thanks
> Sandeep
>
>
> On Thursday, 10 July 2014 22:25:51 UTC+5:30, Jörg Prante wrote:
>
>> On the client side, you can't use cluster state listener, it is for nodes
>> that have access to a local copy of the master cluster state. Clients must
>> execute an action to ask for cluster state, and with the current transport
>> request/response cycle, they must poll for new events ...
>>
>> Jörg
>>
>>
>> On Thu, Jul 10, 2014 at 6:38 PM, Ivan Brusic  wrote:
>>
>>> Jörg, have you actually implemented your own ClusterStateListener? I
>>> never had much success. Tried using that interface or even
>>> PublishClusterStateAction.NewClusterStateListener, but either I could
>>> not configure successfully the module (the former) or received no events
>>> (the latter). Implemented on the client side, not as a plugin.
>>>
>>> Cheers,
>>>
>>> Ivan
>>>
>>>
>>>
>>> On Wed, Jul 9, 2014 at 4:21 PM, joerg...@gmail.com 
>>> wrote:
>>>

 4. Yes. Use org.elasticsearch.cluster.ClusterStateListener

  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/CALY%3DcQBB%3DW_qG9E7i-
>>> sEc6HZeMskxKgbqzaKgqzSQ26sjgT5%2BQ%40mail.gmail.com
>>> 
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/35f6b64e-3787-4891-a3a8-518dfd7638e3%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEcFAp443apYZfesyF7%2Bkj-KPMiimObDQ5wOeFuQiUHpw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Error running ES DSL in hadoop mapreduce

2014-08-25 Thread Sona Samad

Thanks Adrien.

The ExamRowKey and body_part are Strings uploaded from csv file using 
LogStash to ElasticSearch.

- how reproducible is it? Ie. if you run this query 10 times, how many of 
these queries will write such lines to the logs?
This query returns the error each time its run in the cluster. 
Other simple queries are returning values. Eg:
   {"query":
   {"term":
{"ExamRowKey":"4090741090"}
}
}

- is it common that several queries will be executing at the same time on 
your elasticsearch cluster?
For testing purpose, I was running only the above specified query.

- are there other exception in your logs that happen approximately at the 
same time?
No, the stack trace I have posted is the only errors I got today after 
running the query.

Thanks,
Sona

On Monday, August 25, 2014 1:34:26 PM UTC+5:30, Adrien Grand wrote:
>
> Thanks Sona,
>
> This stack trace indicates a bug in the cardinality aggregation. I just 
> opened an issue for it: 
> https://github.com/elasticsearch/elasticsearch/issues/7429
>
> In order to help me understand/reproduce this bug, could you please 
> provide the mappings of your ExamRowKey and body_part fields? Also answers 
> to the questions below would help me understand better what is happening:
>  - how reproducible is it? Ie. if you run this query 10 times, how many of 
> these queries will write such lines to the logs?
>  - is it common that several queries will be executing at the same time on 
> your elasticsearch cluster?
>  - are there other exception in your logs that happen approximately at the 
> same time?
>
> Thanks!
>
>
>
> On Mon, Aug 25, 2014 at 6:10 AM, Sona Samad  > wrote:
>
>> Hi Adrien,
>>  
>> My elasticsearch version is :  elasticsearch-1.2.1 
>>  
>> The Maven dependency for hadoop:
>>  
>> 
>>   org.elasticsearch
>>   elasticsearch-hadoop-mr
>>   2.0.1
>>  
>>  
>>  
>> The full stack trace is given below:
>>  
>> [2014-08-25 09:31:58,892][DEBUG][action.search.type   ] [Thane Ector] 
>> [mr][4], node[1ZbXSvkKQC-kDvgMXuC8iQ], [P], s[STARTED]: Failed to execute 
>> [org.elasticsearch.action.search.SearchRequest@6ed78f6d]
>> org.elasticsearch.search.query.QueryPhaseExecutionException: [mr][4]: 
>> query[ConstantScore(cache(_type:logs))],from[0],size[50]: Query Failed 
>> [Failed to execute main query]
>>
>>  at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
>>  at 
>> org.elasticsearch.search.SearchService.executeScan(SearchService.java:215)
>>  at 
>> org.elasticsearch.search.action.SearchServiceTransportAction$19.call(SearchServiceTransportAction.java:444)
>>  at 
>> org.elasticsearch.search.action.SearchServiceTransportAction$19.call(SearchServiceTransportAction.java:441)
>>  at 
>> org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 97
>>  at 
>> org.elasticsearch.common.util.BigArrays$IntArrayWrapper.set(BigArrays.java:185)
>>  at 
>> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus$Hashset.values(HyperLogLogPlusPlus.java:499)
>>  at 
>> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.upgradeToHll(HyperLogLogPlusPlus.java:307)
>>  at 
>> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collectLcEncoded(HyperLogLogPlusPlus.java:245)
>>  at 
>> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collectLc(HyperLogLogPlusPlus.java:239)
>>  at 
>> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collect(HyperLogLogPlusPlus.java:231)
>>  at 
>> org.elasticsearch.search.aggregations.metrics.cardinality.CardinalityAggregator$DirectCollector.collect(CardinalityAggregator.java:204)
>>  at 
>> org.elasticsearch.search.aggregations.metrics.cardinality.CardinalityAggregator.collect(CardinalityAggregator.java:118)
>>  at 
>> org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectBucketNoCounts(BucketsAggregator.java:74)
>>  at 
>> org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:63)
>>  at 
>> org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.collect(GlobalOrdinalsStringTermsAggregator.java:98)
>>  at 
>> org.elasticsearch.search.aggregations.AggregationPhase$AggregationsCollector.collect(AggregationPhase.java:157)
>>  at 
>> org.elasticsearch.common.lucene.MultiCollector.collect(MultiCollector.java:60)
>>  at 
>> org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:193)
>>  at 
>> org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
>>  at org.apach

Re: Query pre-processing before execution?

2014-08-25 Thread joergpra...@gmail.com

I do not fully understand what an external filter service is but I remember
such a question before. It does not matter where the filter terms come
from, you can set up your application, and add filter terms at ES language
level from there. This is the most flexible and scalable approach.

It is not feasible to build a long string of fileld1:foo AND field2:bar AND
field3:test. You should really use the ES Java API or the DSL to build
filters, not Lucene Query language.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html

A special case is where a service already provides Lucene bitsets and this
should be processed in ES. I have not enough imagination how this can work
at all, regarding the distributed nature of ES, but you can plug Lucene
filters (bitsets) via the filter API into ES queries, see
org.elasticsearch.index.query.TermsFilterParser
for an example how ES transforms JSON filter terms to Lucene filters.

Jörg




On Mon, Aug 25, 2014 at 10:58 AM, Pawel  wrote:

> Hi Joerg,
> You are right about analyzer. I have also a little different case or maybe
> I missed something (and analyzer-way can also handle my case).
>
> I'd like to process a query and add additional filter to each of queries.
> To build this filter external service should be queried to fetch additional
> data and use this data to build proper filter. Filter can be build for a
> few fields: for example fileld1:foo AND field2:bar AND field3:test. Do you
> have any suggestions?
>
> --
> Paweł
>
> On Thu, Aug 21, 2014 at 10:31 AM, joergpra...@gmail.com <
> joergpra...@gmail.com> wrote:
>
>> I would rather use the analyzer/token filter machinery of Lucene for
>> search/index extensions, plugging this into ES is a breeze.
>>
>> If you want field specific mangling, I would use the field mapper to
>> create a new field type. There, you have read access to the whole
>> (immutable) document source and you can pre-process the field input data in
>> the given document context before indexing.
>>
>> Jörg
>>
>>
>> On Wed, Aug 20, 2014 at 12:00 PM, Otis Gospodnetic <
>> otis.gospodne...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> What is the best way to pre-process a query a bit before ES executes it?
>>> (e.g. I want to shingle the query string a bit and expand/rewrite a query
>>> before letting ES execute it)
>>>
>>> I can create a custom Rest Action and a new API endpoint, but I'd prefer
>>> to hide custom query pre-processing behind the standard ES query API.
>>>
>>> Is there any way to do that?
>>>
>>> Thanks,
>>> Otis
>>> --
>>> Elasticsearch Performance Monitoring * Log Management * Search Analytics
>>> http://sematext.com/
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/e99854af-65e1-42c0-9dab-e384c8c281e6%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFzggLcxBa0axOcyZ2v5CxG8v9NmquFcmHp%2ByiS8Z0F4w%40mail.gmail.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAF9ZkbMx70HZ2rOyU%3D%3DQh63j9wYVvjju%2BJMZsO1DQxKnZjN8Xw%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGzwd%2BrEkCq3SG%3Dap2UjLezS1xDtShcnxfxpfinnwrDLw%40mail.gmail.com.
For more options, visi

Re: Sustainable way to regularly purge deleted docs

I left some comments inline:

On Sat, Aug 23, 2014 at 5:08 PM, Jonathan Foy  wrote:

> I was a bit surprised to see the number of deleted docs grow so large, but
> I won't rule out my having something setup wrong.  Non-default merge
> settings are below, by all means let me know if I've done something stupid:
>
> indices.store.throttle.type: none
> index.merge.policy.reclaim_deletes_weight: 6.0
> index.merge.policy.max_merge_at_once: 5
> index.merge.policy.segments_per_tier: 5
> index.merge.policy.max_merged_segment: 2gb
> index.merge.scheduler.max_thread_count: 3
>

These settings don't look particularly bad, but merge policy tuning is
quite hard and I tend to refrain myself from trying to modify the default
parameters. If you're interested you can have a look at
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
to get a sense of the challenges of a good merge policy.

> I make extensive use of nested documents, and to a smaller degree child
> docs.  Right now things are hovering around 15% deleted after a cleanup on
> Wednesday.  I've also cleaned up my mappings a lot since I saw the 45%
> deleted number (less redundant data, broke some things off into child docs
> to maintain separately), but it was up to 30% this last weekend.  When I've
> looked in the past when I saw the 40+% numbers, the segments in the largest
> tier (2 GB) would sometimes have up to 50+% deleted docs in them, the
> smaller segments all seemed pretty contained, which I guess makes sense as
> they didn't stick around for nearly as long.
>
> As for where the memory is spent, according to ElasticHQ, right now on one
> server I have a 20 GB heap (out of 30.5, which I know is above the 50%
> suggested, just trying to get things to work), I'm using 90% as follows:
>
> Field cache: 5.9 GB
> Filter cache: 4.0 GB (I had reduced this before the last restart, but
> forgot to make the changes permanent.  I do use a lot of filters though, so
> would like to be able to use the cache).
> ID cache: 3.5 GB
>

If you need to get some memory back, you can decrease the size of your
filter cache (uncached filters happen to be quite fast already!) to eg. 1GB
in combination with opting out for caching filters in your queries
(typically  term filters are cached by default although they don't really
need, you can quite safely turn caching off on them, especially if there is
no particular reason that they would be reused across queries).

> Node stats "Segments: memory_in_bytes": 6.65 GB (I'm not exactly sure how
> this one contributes to the total heap number).
>

This is the amount of memory that is used by the index itself. It mostly
loads some small data-structures in memory in order to make search fast. I
said mostly because there is one that can be quite large: the bloom filters
that are loaded to save disk seeks when doing primary-key lookups. We
recently made good improvements that make this bloom filter not necessary
anymore and in 1.4 it will be disabled by default:
https://github.com/elasticsearch/elasticsearch/pull/6959

You can already unload it by setting the `index.codec.bloom.load` setting
to false (it's a live setting so no need to restart or reopen the index),
note that this might however hurt indexing speed.

> As for the disk-based "doc values", I don't know how I have not come
> across them thus far, but that sounds quite promising.  I'm a little late
> in the game to be changing everything yet again, but it may be a good idea
> regardless, and is definitely something I'll read more about and consider
> going forward.  Thank you for bringing it to my attention.
>
> Anyway, my current plan, since I'm running in AWS and have the
> flexibility, is just to add another r3.xlarge node to the cluster over the
> weekend, try the deleted-doc purge, and then pull the node back out after
> moving all shards off of it.  I'm hoping this will allow me to clean things
> up with extra horsepower, but not increase costs too much throughout the
> week.
>
> Thanks for you input, it's very much appreciated.
>
>
>
> On Friday, August 22, 2014 7:14:18 PM UTC-4, Adrien Grand wrote:
>
>> Hi Jonathan,
>>
>> The default merge policy is already supposed to merge quite aggressively
>> segments that contain lots of deleted documents so it is a bit surprising
>> that you can see that many numbers of deleted documents, even with merge
>> throttling disabled.
>>
>> You mention having memory pressure because of the number of documents in
>> your index, do you know what causes this memory pressure? In case it is due
>> to field data maybe you could consider storing field data on disk? (what we
>> call "doc values")
>>
>>
>>
>> On Fri, Aug 22, 2014 at 5:27 AM, Jonathan Foy  wrote:
>>
>>> Hello
>>>
>>> I'm in the process of putting a two-node Elasticsearch cluster (1.1.2)
>>> into production, but I'm having a bit of trouble keeping it stable enough
>>> for comfort.  Specifically, I'm trying to figure out the best way to keep
>>> the number

Re: Query pre-processing before execution?

2014-08-25 Thread Pawel

Hi Joerg,
You are right about analyzer. I have also a little different case or maybe
I missed something (and analyzer-way can also handle my case).

I'd like to process a query and add additional filter to each of queries.
To build this filter external service should be queried to fetch additional
data and use this data to build proper filter. Filter can be build for a
few fields: for example fileld1:foo AND field2:bar AND field3:test. Do you
have any suggestions?

--
Paweł

On Thu, Aug 21, 2014 at 10:31 AM, joergpra...@gmail.com <
joergpra...@gmail.com> wrote:

> I would rather use the analyzer/token filter machinery of Lucene for
> search/index extensions, plugging this into ES is a breeze.
>
> If you want field specific mangling, I would use the field mapper to
> create a new field type. There, you have read access to the whole
> (immutable) document source and you can pre-process the field input data in
> the given document context before indexing.
>
> Jörg
>
>
> On Wed, Aug 20, 2014 at 12:00 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
>
>> Hi,
>>
>> What is the best way to pre-process a query a bit before ES executes it?
>> (e.g. I want to shingle the query string a bit and expand/rewrite a query
>> before letting ES execute it)
>>
>> I can create a custom Rest Action and a new API endpoint, but I'd prefer
>> to hide custom query pre-processing behind the standard ES query API.
>>
>> Is there any way to do that?
>>
>> Thanks,
>> Otis
>> --
>> Elasticsearch Performance Monitoring * Log Management * Search Analytics
>> http://sematext.com/
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/e99854af-65e1-42c0-9dab-e384c8c281e6%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFzggLcxBa0axOcyZ2v5CxG8v9NmquFcmHp%2ByiS8Z0F4w%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAF9ZkbMx70HZ2rOyU%3D%3DQh63j9wYVvjju%2BJMZsO1DQxKnZjN8Xw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: How to index Office files? .txt and .pdf are working...

2014-08-25 Thread David Pilato

>From my experience, this should work. Indexing Word docs should work as Tika 
>support office docs.

Not sure what you are doing wrong. Try to send a match all query and ask for 
field file.file.

Also, you could set mapper plugin to TRACE mode in logging.yml and see if it 
tells something interesting.

HTH

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

> Le 25 août 2014 à 09:05, Dirk Bauer  a écrit :
> 
> Hi,
> 
> using elasticsearch-1.3.2 with 
> 
> Plug-in
> -
> name: mapper-attachments
> version: 2.3.1
> description: Adds the attachment type allowing to parse difference attachment 
> formats
> jvm: true
> site: false
> 
> on Windows 8 for evaluation purpose.
> 
> JVM 
> -
> version: 1.7.0_67
> vm_name: Java HotSpot(TM) Client VM
> vm_version: 24.65-b04
> vm_vendor: Oracle Corporation
> 
> 
> I have created the following mapping:
> 
> {
> myIndex: {
> mappings: {
> dokument: {
> properties: {
> created: {
> type: date
> format: dateOptionalTime
> }
> description: {
> type: string
> }
> file: {
> type: attachment
> path: full
> fields: {
> file: {
> type: string
> store: true
> term_vector: with_positions_offsets
> }
> author: {
> type: string
> }
> title: {
> type: string
> }
> name: {
> type: string
> }
> date: {
> type: date
> format: dateOptionalTime
> }
> keywords: {
> type: string
> }
> content_type: {
> type: string
> }
> content_length: {
> type: integer
> }
> language: {
> type: string
> }
> }
> }
> id: {
> type: string
> }
> title: {
> type: string
> }
> }
> }
> }
> }
> }
> 
> Because I like to use ES from C#/.NET I have created a little C# app that 
> reads a file as base64 encodes stream from hard drive and put the document to 
> the index of ES. I'm working with this POST request:
> 
> {
>   "id": "8dbf1d73-44d1-4e20-aa35-13b18ddf5057",
>   "title": "Test",
>   "description": "Test Description",
>   "created": "2014-01-20T19:04:20.1019885+01:00",
>   "file": {
> "_content_type": "application/pdf",
> "_name": "Test.pdf",
> "content": "---my base64 stuff here---"
>   }
> }
> 
> and send it as index command to ES like this:
> 
> myIndex/dokument/8dbf1d73-44d1-4e20-aa35-13b18ddf5057?refresh=true
> 
> After that I query ES with this request:
> 
> {
>   "fields": [],
>   "query": {
> "match": {
>   "file": "test"
> }
>   },
>   "highlight": {
> "fields": {
>   "file": {}
> }
>   }
> }
> 
> If my input is a *.pdf or *.txt file everything works as expected. The 
> content of the document was recognized by the mapper-attachments plug-in and 
> the results with my string "test" that I'm looking for are highlighted.
> 
> I have searched for hours now to find a solution to do the same with 
> Microsoft Office documents but I'm not able to get it to work. ES does not 
> send any error message during adding the documents but I'm not able to find 
> the content of my office documents.
> Can anyone please help me an give me an sample how to index a *.doc, *.docx, 
> *.xls, *.xlsx etc.?
> 
> I have tried to give ES a hint for the content-type / mime type based on this 
> link http://filext.com/faq/office_mime_types.php but this makes no change. 
> 
> Thanks in advance!
> Dirk
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/2a2d0406-4177-431f-ba33-8766a1ce4a07%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0E15D457-3CB3-4FAE-8559-B92008463C81%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Re: Boost the first word in a multi-word query

2014-08-25 Thread Jérémy

Hum I didn't notice the change of behavior of the + sign. I prefer the how
"query string" handle that.

Is there a way to have a "must be present" operator for "simple string
query"?

Cheers,
Jeremy


On Mon, Aug 25, 2014 at 9:33 AM, Jérémy  wrote:

> Thank you so much for the warning, I was about to make that mistake ;-)
>
>
> On Mon, Aug 25, 2014 at 5:23 AM, vineeth mohan 
> wrote:
>
>> Hello Jeremy ,
>>
>> Just a word of caution.
>> Its not reccomonded to expose query_string to end user.
>> Rather another version of it used instead -
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html
>>
>> Thanks
>> Vineeth
>>
>>
>> On Mon, Aug 25, 2014 at 12:57 AM, Jérémy  wrote:
>>
>>> Thanks Vineeth, I can certainly build something with the query string :-)
>>>
>>>
>>> On Fri, Aug 22, 2014 at 8:50 PM, vineeth mohan <
>>> vm.vineethmo...@gmail.com> wrote:
>>>
 Hello Jeremy ,

 You can try query_string then.

 Query as "Brown^2 dog"


 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-dsl-query-string-query

 Thanks
Vineeth


 On Sat, Aug 23, 2014 at 12:11 AM, Jérémy  wrote:

> Thanks for your answer!
>
> Unfortunately the phrase query is not enough, because I still want to
> keep words optional. In my understanding, the phrase query requires all 
> the
> words of the query to be present.
>
> Cheers,
> Jeremy
>
>
> On Fri, Aug 22, 2014 at 8:20 PM, vineeth mohan <
> vm.vineethmo...@gmail.com> wrote:
>
>> Hello Jeremy ,
>>
>> I feel what you are looking for is a phrase query . It takes into
>> consideration the order of words -
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#_phrase
>>
>> Thanks
>>   Vineeth
>>
>>
>> On Fri, Aug 22, 2014 at 3:28 PM, Jeremy  wrote:
>>
>>> In case of a multi-word query, is there a way to boost the first
>>> terms of the query?
>>>
>>> For example, in the following query:
>>> GET /my_index/my_type/_search
>>> {
>>> "query": {
>>> "match": {
>>> "title": "BROWN DOG!"
>>> }
>>> }
>>> }
>>>
>>> "Brown" should be prioritized over "dog", therefore searching for
>>> "brown dog" will not return the same scores as searching for "dog 
>>> brown".
>>> I'm ideally looking for a solution which work with N words and put
>>> weight accordingly the number of words.
>>>
>>> Regards,
>>> Jeremy
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to elasticsearch+unsubscr...@googlegroups.com.
>>>
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/a53f5752-3da0-41de-b970-f84573b8f5a3%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to a topic in
>> the Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/elasticsearch/ojEtydA4zAw/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to
>> elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAGdPd5%3D51EiC_SmiDXD0k2Yj0YacnvXVzaqUOshdkD81HFpgsA%40mail.gmail.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to elasticsearch+unsubscr...@googlegroups.com.
>  To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAGNSLEwjxRwLgfHAmNWxoGa0BX5ZSEtk6J0QFBvWCBcW8wX42Q%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

  --
 You received this message because you are subscribed to a topic in the

Re: One large index vs. many smaller indexes

I meant tens of shards per node. So if you have N nodes with I indices
which have S shards and R replicas, that would be (I * S * (1 + R)) / N.

One shard per node is optimal but doesn't allows for growth: if you add one
more node, you cannot spread the indexing work load, that is why it is
common to have a few shards per node in order to allow elasticsearch to
spread the load in case you would introduce a new node in your cluster to
improve your cluster capacity.


On Mon, Aug 25, 2014 at 12:07 AM, Chris Neal 
wrote:

> Adrien,
>
> Thanks so much for the response.  It was very helpful.  I will check out
> those links on capacity planning for sure.
>
> One followup question.  You mention that tens of shards per node would be
> ok.  Are you meaning tens of shards from tens of indexes?  Or tens of
> shards for a single index?  Right now I have two servers configured with
> the index getting 2 shards (one per server), and 1 replica (per server).
>
> Chris
>
>
> On Fri, Aug 22, 2014 at 5:58 PM, Adrien Grand <
> adrien.gr...@elasticsearch.com> wrote:
>
>> Hi Chris,
>>
>> Usually, the problem is not that much in terms of indices but shards,
>> which are the physical units of data storage (an index being a logical view
>> over several shards).
>>
>> Something to beware of is that shards typically have some constant
>> overhead (disk space, file descriptors, memory usage) that does not depend
>> on the amount of data that they store. Although it would be ok to have up
>> to a few tens of shards per nodes, you should avoid to have eg. thousands
>> of shards per node.
>>
>> if you plan on always adding a filter for a specific application in your
>> search requests, then splitting by application makes sense since this will
>> make the filter useless at search time, you will just need to query the
>> application-specific index. On the other hand if you don't filter by
>> application, then splitting data by yourself into smaller indices would be
>> pretty equivalent to storing everything in a single index with a higher
>> number of shards.
>>
>> You might want to check out the following resources that talk about
>> capacity planning:
>>  - http://www.elasticsearch.org/videos/big-data-search-and-analytics/
>>  -
>> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/capacity-planning.html
>>
>>
>>
>> On Fri, Aug 22, 2014 at 9:08 PM, Chris Neal 
>> wrote:
>>
>>> Hi all,
>>>
>>> As the subject says, I'm wondering about index size vs. number of
>>> indexes.
>>>
>>> I'm indexing many application log files, currently with an index by day
>>> for all logs, which will make a very large index.  For just a few
>>> applications in Development, the index is 55GB a day (across 2 servers).
>>>  In prod with all applications, it will be "much more than that".  1TB a
>>> day maybe?
>>>
>>> I'm wondering if there is value in splitting the indexes by day and by
>>> application, which would produce more indexes per day, but they would be
>>> smaller, vs. value in having a single, mammoth index by day alone.
>>>
>>> Is it just a resource question?  If I have enough RAM/disk/CPU to
>>> support a "mammoth" index, then I'm fine?  Or are there other reasons to
>>> (or to not) split up indexes?
>>>
>>> Very much appreciate your time.
>>> Chris
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>> Adrien Grand
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5i7AAnasMYZgR83aTXvELan%3DkR6OLvGYKfs9d5Subi4A%40mail.gmail.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAND3Dph9Z1My%2B2%2BQ-NM-sWNn2vT1qktDi6%2BmR-b9rFN-Xc-_pw%40ma

Re: DOS attack Elasticsearch with Mappings

Hi Joshua,

Was the issue tied to the byte size of the mappings or the fact that they
contained lots of fields? I'm asking because there was a performance
inefficiency in versions < 1.3.0 that caused every field introduction to
perform in quadratic time[1]. It probably doesn't solve your problem but
I'm wondering if it could be related.

[1] https://github.com/elasticsearch/elasticsearch/pull/6707



On Mon, Aug 25, 2014 at 5:40 AM, Joshua Montgomery 
wrote:

> So you can modify the dynamic mapping setting to be off or strict. But
> that means everything that goes into the cluster would have to have a
> review process which is very time consuming. The primary goal of our system
> is provide a general purpose backend for search that users can have access
> to immediately. If we had to turn off dynamic mapping, we couldn't offer
> this primary goal. Maybe a potential solution is to have a setting which
> limits the number of fields that could be indexed for a type?
>
>
> On Sunday, August 24, 2014 8:21:51 PM UTC-7, Nikolas Everett wrote:
>
>> If the cluster is that open to users I don't think it'd be easy to
>> prevent a malicious user from intentionally DOSing it. But in this case I
>> think you could make the default for all fields be non-dynamic. That way
>> users have to intentionally send all mapping updates. It'd prevent this
>> short of unintentional DOS.
>>
>> I think this is a setting that you can change and I think that it would
>> only effect new indexes but I admit to not having done it and going from a
>> vague memory of seeing a setting somewhere.
>>
>> Nik
>> On Aug 24, 2014 11:08 PM, "Joshua Montgomery"  wrote:
>>
>>> So an Elasticsearch clusters I help run had an interesting issue last
>>> week around mappings and I wanted to get the communities thoughts about how
>>> to handle it.
>>>
>>> *Issue:*
>>> Our cluster one morning went into utter chaos for no apparent reason. We
>>> had nodes dropping constantly (master and data type nodes) and lots of
>>> network exceptions in a our log files. The cluster kept going red from all
>>> the dropped nodes and the cluster was totally unresponsive to external
>>> commands.
>>>
>>> *Some Backgound:*
>>> Our cluster is fairly open to our users, meaning they can index what
>>> ever they want without needing approval (this may have to change based on
>>> what happened). The content stored is usually generated from .Net objects
>>> and serialized using the Netwonsoft json serializer.
>>>
>>> *Cause:*
>>> After 6hrs of investigation while trying to get our cluster stable, this
>>> is what we found:
>>>
>>> We had a new document type (around 30,000 documents) indexed into the
>>> cluster over a 1 hour window containing the .Net equivalent of a dictionary
>>> in json format. When a dictionary is serialized to json, it ends up with a
>>> json object containing a list of properties and values. The current
>>> behavior of Elasticsearch is to generate a mapping definition for each
>>> field name in a json object. So when you serialize a dictionary, it means
>>> every 'key' in the dictionary gets its own mapping definition. It turns out
>>> this can lead to nasty consequences when indexed in Elasticsearch...
>>>
>>> Essentially, every document contained its own list of unique keys which
>>> resulted in Elasticsearch generating mapping definitions for all the keys.
>>> We found this out by noticing that the json type with the dictionary
>>> continuously kept having is mappings updated (based on the master node log
>>> files). The continual updating of the mappings (which is part of the
>>> overall state file) caused the master nodes to lock up on the updates,
>>> effectively stopping all other cluster operations. The state file upon
>>> further investigation was over 70MB large by the time we ended up stopping
>>> the cluster. Stopping the cluster was the only way to stop updates to the
>>> mappings. The large mapping file we suspect was one of the major reasons
>>> for nodes dropping; connections would timeout during the large file copy
>>> (i'm assuming the state is passed around the nodes in the cluster).
>>>
>>> *Solution:*
>>> As previously mentioned we had to stop the cluster. We then had to make
>>> sure that all indexing operations were stopped. Upon restarting the cluster
>>> we deleted all documents of the poisonous document type (which took a
>>> while). This resulted is a much smaller state file and a stable cluster.
>>>
>>> *Prevention:*
>>> So this is my real question for the community, what is the correct
>>> action for preventing this in the future (or does it already exist). We
>>> could obviously start more closely reviewing what goes into our cluster,
>>> but should there be a feature in Elasticsearch to prevent this (assuming it
>>> doesn't already exist)? I'm assuming that there are a number of users who
>>> have clusters where they don't review everything that goes into their
>>> cluster. So would it make sense to have Elasticsearch provide som

Re: Curator and disable shard allocation

2014-08-25 Thread Klaus Kleber

Hi,

yes i found that article but as soon as i start the HDD-node, elasticsearch 
starts to reallocate shards to it and thus if you search sth. in Kibana it 
will also searched from the slow HDD´s which we want to avoid.

We want that the whole index lies on the SSD until we reallocate it using 
curator

Am Montag, 25. August 2014 09:42:16 UTC+2 schrieb Mark Walkom:
>
> You don't want that, take a look at 
> https://github.com/elasticsearch/curator/wiki/Routing-Allocation
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com 
> web: www.campaignmonitor.com
>
>
> On 25 August 2014 17:38, Klaus Kleber > 
> wrote:
>
>> Hey Guys,
>>
>> We have set up a ES-cluster on a very powerful machine, where 1 node have 
>> the SSD assigned and the other nodes the slow HDD´s.
>>
>> We want to reallocate the indices after 7 days to the HDD´s which is very 
>> easy using the curator-tool.
>>
>> Logstash is configured to send the indexed events to port 9200 on which 
>> the SSD-ES-Node is listen too.
>>
>> But we elasticsearch rebalances the index between the SSD and the HDD 
>> node, making the performance advantage of the SSD useless.
>>
>> So i´ve search and found the cluster.routing.allocation.enable:none 
>> parameter 
>> for elasticsearch.
>>
>> I´ve I now disable routing allocation for the indexes, will curator still 
>> work?
>>
>> Thanks
>>  
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/ec7ecac0-d524-404b-aad9-5048f6cc61a4%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/364d5bfe-c00e-4a49-b682-34f03ead0bea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Need some advice to build a log central.

2014-08-25 Thread Sang Dang

Hi Vineeth Mohan,
My log central will contain 2 type of log, one is for log to debug/monitor, 
other is for stats.
I have 2 ways to achieve it:

#1 , I use only ES, it's ok to log for debug/monitor (using kibana). 
To do stats, I will build some extra api (base on 
filter/facet/agregration...)

#2, I use ES as external data storage, and write data to ES use Apache Hive 
(https://github.com/elasticsearch/elasticsearch-hadoop#apache-hive)
this approach will help me alot in doing stats, but I don't know whether 
it's good for logging other info ( to debug/monitor purpose).

I really appreciate your help :)

Best Regards.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/20091177-e41d-45be-bd0a-c535d7c65871%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Error running ES DSL in hadoop mapreduce

Thanks Sona,

This stack trace indicates a bug in the cardinality aggregation. I just
opened an issue for it:
https://github.com/elasticsearch/elasticsearch/issues/7429

In order to help me understand/reproduce this bug, could you please provide
the mappings of your ExamRowKey and body_part fields? Also answers to the
questions below would help me understand better what is happening:
 - how reproducible is it? Ie. if you run this query 10 times, how many of
these queries will write such lines to the logs?
 - is it common that several queries will be executing at the same time on
your elasticsearch cluster?
 - are there other exception in your logs that happen approximately at the
same time?

Thanks!



On Mon, Aug 25, 2014 at 6:10 AM, Sona Samad  wrote:

> Hi Adrien,
>
> My elasticsearch version is :  elasticsearch-1.2.1
>
> The Maven dependency for hadoop:
>
> 
>   org.elasticsearch
>   elasticsearch-hadoop-mr
>   2.0.1
> 
>
>
> The full stack trace is given below:
>
> [2014-08-25 09:31:58,892][DEBUG][action.search.type   ] [Thane Ector]
> [mr][4], node[1ZbXSvkKQC-kDvgMXuC8iQ], [P], s[STARTED]: Failed to execute
> [org.elasticsearch.action.search.SearchRequest@6ed78f6d]
> org.elasticsearch.search.query.QueryPhaseExecutionException: [mr][4]:
> query[ConstantScore(cache(_type:logs))],from[0],size[50]: Query Failed
> [Failed to execute main query]
>
>  at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
>  at
> org.elasticsearch.search.SearchService.executeScan(SearchService.java:215)
>  at
> org.elasticsearch.search.action.SearchServiceTransportAction$19.call(SearchServiceTransportAction.java:444)
>  at
> org.elasticsearch.search.action.SearchServiceTransportAction$19.call(SearchServiceTransportAction.java:441)
>  at
> org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 97
>  at
> org.elasticsearch.common.util.BigArrays$IntArrayWrapper.set(BigArrays.java:185)
>  at
> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus$Hashset.values(HyperLogLogPlusPlus.java:499)
>  at
> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.upgradeToHll(HyperLogLogPlusPlus.java:307)
>  at
> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collectLcEncoded(HyperLogLogPlusPlus.java:245)
>  at
> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collectLc(HyperLogLogPlusPlus.java:239)
>  at
> org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collect(HyperLogLogPlusPlus.java:231)
>  at
> org.elasticsearch.search.aggregations.metrics.cardinality.CardinalityAggregator$DirectCollector.collect(CardinalityAggregator.java:204)
>  at
> org.elasticsearch.search.aggregations.metrics.cardinality.CardinalityAggregator.collect(CardinalityAggregator.java:118)
>  at
> org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectBucketNoCounts(BucketsAggregator.java:74)
>  at
> org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:63)
>  at
> org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.collect(GlobalOrdinalsStringTermsAggregator.java:98)
>  at
> org.elasticsearch.search.aggregations.AggregationPhase$AggregationsCollector.collect(AggregationPhase.java:157)
>  at
> org.elasticsearch.common.lucene.MultiCollector.collect(MultiCollector.java:60)
>  at
> org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:193)
>  at
> org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
>  at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
>  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
>  at
> org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:175)
>  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
>  at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:116)
>  ... 7 more
> [2014-08-25 09:31:58,894][DEBUG][action.search.type   ] [Thane Ector]
> All shards failed for phase: [init_scan]
>
> Thanks,
> Sona
>
>
> On Friday, August 22, 2014 5:07:33 PM UTC+5:30, Sona Samad wrote:
>
>> Hi,
>>
>> I was trying to run the below query from hadoop mapreduce:
>>
>> {
>>  "aggs": {
>> "group_by_body_part": {
>>   "terms": {
>> "field": "body_part",
>> "size": 5,
>> "order" : { "examcount" : "desc" }
>> },
>>   "aggs": {
>> "examcount": {
>>   "cardinality": {
>> "field": "ExamRowKey"
>>   }
>> }
>>   }
>> }
>>   }
>> }
>>
>> The query is re

Re: Curator and disable shard allocation