Re: Treat "Dot" as a normal character in query_string query

2014-07-17 Thread Curt Hu
Yeah, Thanks

It's kind of the strange results for me.
What I am doing is just a simple query_string query with a specific field,
let's call that "domain".
If I do the query_string "www.google.com", I got like 10k results, looks
good, since the "domain" field in the results are all "www.google.com" (I
have not checked the total 10k results, I just randomly pick some), and if I
changed to some bad strings like "www.abcdefg.com", then I got 0 hits, also
good. But if I search something else like "www.loyal3.com", here are
strange: I returned like 40k results, some of them the domain field is
"www.loyal3.com", some are no any relationship, like "eatmywords365.com".

So, how does query_string query treat the dot? Or there are something else
wrong?



--
View this message in context: 
http://elasticsearch-users.115913.n3.nabble.com/Treat-Dot-as-a-normal-character-in-query-string-query-tp4060154p4060160.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1405665722459-4060160.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.


Re: What is the correct way to convert SearchHit.getSortValues/SortValues object to String

2014-07-17 Thread navneet.bits

Has anyone used getSortValues()/ SortValues() Java APIs before ? Return type
of this API is an object array and converting these objects to string
doesn't always work. 

Take this use case for example - If you have applied "filtered" sorting in
your query and someone of the results don't satisfy the filter criteria, the
value of sroted values should be null. Instead of null, getSortValues()/
SortValues() API is returning some junk value. 

Also, if I check the "type" of object returned by this API, it is of type -
org.elasticsearch.common.text.StringAndBytesText.

My question is, why this API returns object array and not String array ? 



--
View this message in context: 
http://elasticsearch-users.115913.n3.nabble.com/What-is-the-correct-way-to-convert-SearchHit-getSortValues-SortValues-object-to-String-tp4059938p4059983.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1405451845437-4059983.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.


What is the correct way to convert SearchHit.getSortValues/SortValues object to String

2014-07-17 Thread navneet.bits
Following code returns garbage for some values:

for (SearchHit result : output) {
for (Object value : result.getSortValues()) { // result.getSortValues()
returns object array
System.out.println(value.toString());
}
}



--
View this message in context: 
http://elasticsearch-users.115913.n3.nabble.com/What-is-the-correct-way-to-convert-SearchHit-getSortValues-SortValues-object-to-String-tp4059938.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1405411456173-4059938.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.


Re: architecture and performance question on searching small subsets of documents

2014-07-17 Thread Nikolas Everett
Look at routing. It will help by limiting the searches to the shard with
the user's data. Beyond that, you can generally trust the caching on
filters to make this kind of use case quick. At least that is what I've
seen on the mailing list.
On Jul 15, 2014 9:27 AM, "Mike Topper"  wrote:

> Hello,
>
> I'm new elasticsearch, so this might be a stupid question but i'd love
> some input before I get started creating my elasticsearch cluster.
>
> Basically I will be indexing documents with a few fields (documents are
> pretty small in size).  there are ~90million documents total.
>
> On the search side of things, each search will be limited by the small
> subset of documents that the user doing the search owns.
>
> my initial thought was to just have one large index for all documents and
> have a multi-value field that held the user ids of each user that owned
> that document.  then when searching across the index i would do a filter
> query to limit by that user id.  My only concern here is that this might be
> slow query times because you are always having to filter down by user id
> from a large data set to a very small subset (on average a user probably
> owns less than 1k documents).
>
> The other option I had is that i could create an index for each user and
> just index their documents into their index, but this would duplicate a
> massive amount of data and just seems hacky.
>
> Any suggestions?
>
> Thanks,
> Mike
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CALdNedL-sLM%3DyMWsHHzriBmMwfe08mxVG%3D%3D9tSwxLwiWzfAcyw%40mail.gmail.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3ccrAyDNZ8eR1xiG0fUvVw%3DAWY_iXTGmik16cdDRKL3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Optimizing a query that matches a large number of documents

2014-07-17 Thread David K Smith
Hi Ivan,

Thanks for your response. Couple of questions. 

1. Is it possible to use a field in a native script without loading it into 
cache? 

2. I've considered using a rescore query and moving the script_score to the 
rescore phase leaving only the item_id term match in query phase. With a short 
window size that would lose a fair bit of accuracy, right? Are there other ways 
of restructuring the query I can take advantage of? 

3. (unrelated to query) What's the most performant format to map fields as when 
you will access them from a native script to perform some further calculations? 
- Analyzed but not stored
- not analyzed but stored
- not analyzed but stored as doc values

Thanks
David

> On Jul 14, 2014, at 1:14 PM, Ivan Brusic  wrote:
> 
> Since the script is executed against lots of matched documents, perhaps 
> converting it into a native Java script (not Javascript) would provide a 
> performance boost.
> 
> Note that using fields in scripts will force their values to be loaded into 
> the cache.
> 
> -- 
> Ivan
> 
> 
>> On Sun, Jul 13, 2014 at 8:54 AM, David Smith  wrote:
>> Hi, 
>> 
>> I'm trying to optimize this query which takes 5-10s to run. This query is 
>> run repeated for different (pretty much all) users via an offline 
>> process/script daily. The index it is run against has about 4 billion 
>> documents, each query matches approximately 500k documents in that index but 
>> I only need the top 25 results. 
>> 
>> Mapping: 
>> 
>> "_all": {
>> "enabled": false
>> },
>> "_source" : {
>> "enabled" : false
>> },
>> "properties": {
>> "user_id": {
>> "type": "long",
>> "store": "no"
>> },
>> "item_id": {
>> "type": "long",
>> "store": "no"
>> },
>> "match_score": {
>> "type": "integer",
>> "store": "no"
>> },
>>  other fields not used by this query 
>> }
>> 
>> Query: 
>> 
>> {
>> "size": 25,
>> "min_score": 50.0,
>> "query": {
>> "function_score": {
>> "filter": {
>> "bool": {
>> "must": {
>> "term": {
>> "item_id": 8342743,
>> "_cache": false
>> }
>> },
>> "must_not": {
>> "term": {
>> "user_id": 10434531,
>> "_cache": false
>> }
>> }
>> }
>> },
>> "functions": [
>> {
>> "script_score": {
>> "script": "doc['match_score'].value"
>> }
>> }
>> ],
>> "score_mode": "sum",
>> "boost_mode": "replace"
>> }
>> }
>> }
>> 
>> The reasoning for current format is given below:
>> match_score is used as document score. It's a pre calculated field kept in 
>> the index. I'm using function_score here to get these benefits:
>> I can return as match_score as document score
>> I can make use of default sorting that happens on document score and there 
>> by avoid needing to sort on this field
>> I can make use of min_score to act as a range filter on this field
>> I did find this approach to be faster than adding a range filter on this 
>> field to the bool clause, then sorting by this field and then returning this 
>> field value by using fields: { "match_score" }
>> caching is turned off on item_id and user_id as this query is run repeated 
>> for different users and different item combinations, but never for the same 
>> user / item within a given 24 hour period. I'm trying to avoid wasting 
>> precious cache space for loading data that won't be used for another day.
>> routing cannot be used here as this index is built using user_id as the 
>> routing value, but we're going across users in this particular query as 
>> we're trying to find users who have this item_id as a match.
>> How can optimize the query (or perhaps the mapping and re-index) so that 
>> this query can run much faster? It's a simple enough query that I should be 
>> able to get it to run quite fast, under 100ms without any problem. I assume 
>> the reason why this is taking so long is because it matches way too many 
>> documents. What's a good strategy for this?
>> 
>> David
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.googl

Re: Treat "Dot" as a normal character in query_string query

2014-07-17 Thread vineeth mohan
Hello Curt ,

I believe you are chasing the wrong solution.
I feel what you need is something in the analyzer rather than search query.
Can you paste the output you are seeing.

Thanks
   Vineeth


On Fri, Jul 18, 2014 at 11:26 AM, Curt Hu  wrote:

> How can I treat the Dot '.' as the normal character in the query_string,
> as I
> want to search "www.google.com" as the whole string in the query_string,
> the
> current results for me are so strange..
>
>
>
> --
> View this message in context:
> http://elasticsearch-users.115913.n3.nabble.com/Treat-Dot-as-a-normal-character-in-query-string-query-tp4060154.html
> Sent from the ElasticSearch Users mailing list archive at Nabble.com.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/1405662997693-4060154.post%40n3.nabble.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5%3D_Ox16V9F-H7QyL-NBPn7Gn49saQkikD0KJoC-P1iN8w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Python version for curator

2014-07-17 Thread Honza Král
Hi Brian,

you seem to have hit an issue we have had with curator, there are some
solutions and workarounds on the github issue:

https://github.com/elasticsearch/curator/issues/77

hope this helps,

Honza

On Thu, Jul 17, 2014 at 6:22 AM, Brian  wrote:
> No joy:
>
> $ pip install elasticsearch
> Requirement already satisfied (use --upgrade to upgrade): elasticsearch in
> /usr/lib/python2.6/site-packages
> Cleaning up...
>
> $ curator --help
> Traceback (most recent call last):
>   File "/usr/bin/curator", line 5, in 
> from pkg_resources import load_entry_point
>   File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 2655, in
> 
> working_set.require(__requires__)
>   File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 648, in
> require
> needed = self.resolve(parse_requirements(requirements))
>   File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 546, in
> resolve
> raise DistributionNotFound(req)
> pkg_resources.DistributionNotFound: elasticsearch>=1.0.0,<2.0.0
>
> $ uname -a
> Linux elktest 2.6.32-431.17.1.el6.x86_64 #1 SMP Wed May 7 23:32:49 UTC 2014
> x86_64 x86_64 x86_64 GNU/Linux
>
> Brian
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/04eeb676-9b57-46ac-9b7b-fc1b45824d79%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CABfdDiqmxKcsRS7ZaDdfW5eeDOPzF5R9Xdp1ZYpTAEoTTSo1gQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: percolator throughput decreases as time passes

2014-07-17 Thread Seungjin Lee
not really, amount of queries were same throughout process lifecycle


2014-07-16 19:04 GMT+09:00 Martijn v Groningen <
martijn.v.gronin...@gmail.com>:

> Do the amount of registered percolate queries also increase?
>
>
> On 15 July 2014 12:02, Seungjin Lee  wrote:
>
>>
>> ​
>> hi all,
>>
>> we use elasticsearch with storm, continuously making percolation request.
>>
>> as you see above, percolator throughput decreases as time passes.
>>
>> but we are not seeing any other problematic statistics, except that CPU
>> usage also decreases as throughput decreases.
>>
>> can you guess any reason for this? we are using es v1.1.0
>>
>> sincerely,
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAL3_U41%3DMhee0xNDvU5s8NyKyOZEW_gXoQ2FOB39pr5s59ocwg%40mail.gmail.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> Met vriendelijke groet,
>
> Martijn van Groningen
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CA%2BA76Tx1wQhd7w0Kg-8U_R3RXOaF5AQkAO8fm6jm%2BtQJ%3DArpsQ%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAL3_U41pdzGPvCBbw%2B3HgBxofHoBc3ed-NEDuSLzitrSE_n7Ng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Treat "Dot" as a normal character in query_string query

2014-07-17 Thread Curt Hu
How can I treat the Dot '.' as the normal character in the query_string, as I
want to search "www.google.com" as the whole string in the query_string, the
current results for me are so strange..



--
View this message in context: 
http://elasticsearch-users.115913.n3.nabble.com/Treat-Dot-as-a-normal-character-in-query-string-query-tp4060154.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1405662997693-4060154.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.


Re: Python package index in elasticsearch

2014-07-17 Thread Honza Král
Nice!

Have you looked at Warehouse (0)? It's a similar effort by the pypa
initiative, also using elasticsearch.

Honza

0 - https://github.com/pypa/warehouse

On Fri, Jul 18, 2014 at 6:58 AM, Maciej Dziardziel  wrote:
> Hi
>
> Being frustrated with speed and inflexibility of pip search, I played with
> elasticsearch and set  up my own index.
> Maybe someone will find it useful too.
>
> Site:http://pypisearch.linuxcoder.co.uk
>
> Code:  https://github.com/Fiedzia/pypisearch
>
> Full lucene syntax is allowed.
>
> Note: indexing is in progress, but over 60% of pypi packages are there,
> it should get to 100% within max few hours.
>
> Having the basics working, I hope to polish it soon.
>
> Maciej Dziardziel
> fied...@gmail.com
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/7bb1d73a-5979-446d-8ec4-9776554df7a3%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CABfdDiqhjd_d8bkOqTHZuTGCxqZaYdd7Ow1WKKAgMf8r%2Brpcgg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: too many open files

2014-07-17 Thread Andrew Selden

This is a fairly common problem and not necessarily specific to Elasticsearch. 
It is simple to solve. In Linux you can increase the operating system's max 
file descriptor limit. Other Unix-like operating systems have the same concept. 
You can find how to do this for your specific Linux distribution from a little 
googling on "linux max file descriptor".

Cheers.


On Jul 17, 2014, at 9:40 PM, Seungjin Lee  wrote:

> hello, I'm using elasticsearch with storm, Java TransportClient.
> 
> I have total 128 threads across machines which communicate with elasticsearch 
> cluster.
> 
> From time to time, error below occurs
> 
> 
> 
> org.elasticsearch.common.netty.channel.ChannelException: Failed to create a 
> selector.
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.openSelector(AbstractNioSelector.java:343)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.(AbstractNioSelector.java:100)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.(AbstractNioWorker.java:52)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.NioWorker.(NioWorker.java:45)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.createWorker(NioWorkerPool.java:45)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.createWorker(NioWorkerPool.java:28)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorkerPool.newWorker(AbstractNioWorkerPool.java:143)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorkerPool.init(AbstractNioWorkerPool.java:81)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.(NioWorkerPool.java:39)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.(NioWorkerPool.java:33)
>   at 
> org.elasticsearch.transport.netty.NettyTransport.doStart(NettyTransport.java:254)
>   at 
> org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
>   at 
> org.elasticsearch.transport.TransportService.doStart(TransportService.java:92)
>   at 
> org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
>   at 
> org.elasticsearch.client.transport.TransportClient.(TransportClient.java:189)
>   at 
> org.elasticsearch.client.transport.TransportClient.(TransportClient.java:125)
>   at 
> com.naver.labs.nelo2.notifier.utils.ElasticSearchUtil.prepareElasticSearch(ElasticSearchUtil.java:30)
>   at 
> com.naver.labs.nelo2.notifier.bolt.PercolatorBolt.prepare(PercolatorBolt.java:48)
>   at 
> backtype.storm.topology.BasicBoltExecutor.prepare(BasicBoltExecutor.java:43)
>   at 
> backtype.storm.daemon.executor$fn__5641$fn__5653.invoke(executor.clj:690)
>   at backtype.storm.util$async_loop$fn__457.invoke(util.clj:429)
>   at clojure.lang.AFn.run(AFn.java:24)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: too many open files
>   at sun.nio.ch.IOUtil.makePipe(Native Method)
>   at sun.nio.ch.EPollSelectorImpl.(EPollSelectorImpl.java:65)
>   at 
> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
>   at java.nio.channels.Selector.open(Selector.java:227)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.open(SelectorUtil.java:63)
>   at 
> org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.openSelector(AbstractNioSelector.java:341)
>   ... 22 more
> 
> 
> code itself is very simple
> 
> I get client as follows
> 
> Settings settings =
> ImmutableSettings.settingsBuilder().put("cluster.name", 
> clusterName).put("client.transport.sniff", "true").build();
> List transportAddressList = new  
> ArrayList();
> for (String host : ESHost) {
> transportAddressList.add(new InetSocketTransportAddress(host, 
> ESPort));
> }
> return new TransportClient(settings)
> .addTransportAddresses(transportAddressList.toArray(new 
> InetSocketTransportAddress[transportAddressList.size()]));
> 
> and for each execution, it percolates as follows
> 
> return 
> client.preparePercolate().setIndices(indexName).setDocumentType(projectName).setPercolateDoc(docBuilder().setDoc(log)).setRouting(projectName).setPercolateFilter(FilterBuilders.termFilter("projects",
>  projectName).cache(true)).execute().actionGet();
> 
> 
> ES cluster consists of 5 machines with almost default setting.
> 
> What can be the cause of this problem?
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/CAL

too many open files

2014-07-17 Thread Seungjin Lee
hello, I'm using elasticsearch with storm, Java TransportClient.

I have total 128 threads across machines which communicate with
elasticsearch cluster.

>From time to time, error below occurs



org.elasticsearch.common.netty.channel.ChannelException: Failed to create a
selector.
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.openSelector(AbstractNioSelector.java:343)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.(AbstractNioSelector.java:100)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.(AbstractNioWorker.java:52)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.(NioWorker.java:45)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.createWorker(NioWorkerPool.java:45)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.createWorker(NioWorkerPool.java:28)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorkerPool.newWorker(AbstractNioWorkerPool.java:143)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorkerPool.init(AbstractNioWorkerPool.java:81)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.(NioWorkerPool.java:39)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.(NioWorkerPool.java:33)
at
org.elasticsearch.transport.netty.NettyTransport.doStart(NettyTransport.java:254)
at
org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
at
org.elasticsearch.transport.TransportService.doStart(TransportService.java:92)
at
org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
at
org.elasticsearch.client.transport.TransportClient.(TransportClient.java:189)
at
org.elasticsearch.client.transport.TransportClient.(TransportClient.java:125)
at
com.naver.labs.nelo2.notifier.utils.ElasticSearchUtil.prepareElasticSearch(ElasticSearchUtil.java:30)
at
com.naver.labs.nelo2.notifier.bolt.PercolatorBolt.prepare(PercolatorBolt.java:48)
at
backtype.storm.topology.BasicBoltExecutor.prepare(BasicBoltExecutor.java:43)
at backtype.storm.daemon.executor$fn__5641$fn__5653.invoke(executor.clj:690)
at backtype.storm.util$async_loop$fn__457.invoke(util.clj:429)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: too many open files
at sun.nio.ch.IOUtil.makePipe(Native Method)
at sun.nio.ch.EPollSelectorImpl.(EPollSelectorImpl.java:65)
at
sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
at java.nio.channels.Selector.open(Selector.java:227)
at
org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.open(SelectorUtil.java:63)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.openSelector(AbstractNioSelector.java:341)
... 22 more


code itself is very simple

I get client as follows

Settings settings =
ImmutableSettings.settingsBuilder().put("cluster.name",
clusterName).put("client.transport.sniff", "true").build();
List transportAddressList = new
 ArrayList();
for (String host : ESHost) {
transportAddressList.add(new InetSocketTransportAddress(host,
ESPort));
}
return new TransportClient(settings)
.addTransportAddresses(transportAddressList.toArray(new
InetSocketTransportAddress[transportAddressList.size()]));

and for each execution, it percolates as follows

return
client.preparePercolate().setIndices(indexName).setDocumentType(projectName).setPercolateDoc(docBuilder().setDoc(log)).setRouting(projectName).setPercolateFilter(FilterBuilders.termFilter("projects",
projectName).cache(true)).execute().actionGet();


ES cluster consists of 5 machines with almost default setting.

What can be the cause of this problem?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAL3_U43SQ2KCGFCQ%3DcisaLGQXxdg_r1mzjrcunreaJOj0Ln-%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Garbage collection pauses causing cluster to get unresponsive

2014-07-17 Thread Srinath C
Hi Michael,
   Did you get a chance to look at the hot_threads and iostat output?
   I also tried with EBS Provisioned SSB with 4000 IOPS and with that I was
able to ingest only at around 30K per second after which there are
EsRejectedExecutionException. There were 4 elasticsearch instances of type
c3.2xlarge. CPU utilization was around 650% (out of 800). The iostat output
on the instances looks like this:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1.660.000.140.150.04   98.01

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvdep17.8636.95   266.05 3923782825424
xvdf  0.03 0.20 0.00   2146  8
xvdg  0.03 0.21 0.07   2178736
*xvdj 52.53 0.33  2693.62   3506   28605624*


On an instance store SSD I can go upto 48K per second with occasional
occurrences of EsRejectedExecutionException. Do you think I should try
storage optimized instances like i2.xlarge or i2.2xlarge to handle this
kind of load?

Regards,
Srinath.






On Wed, Jul 16, 2014 at 5:57 PM, Srinath C  wrote:

> Hi Michael,
>You were right. Its the IO that was the bottleneck. The data was being
> written into a standard EBS device - no provisioned IOPS.
>
>After redirecting data into the local instance store SSD storage, I was
> able to get to a rate of around 50-55K without any EsRejectExceptions. The
> CPU utilization too is not too high - around 200 - 400%. I have attached
> the hot_threads output with this email. After running for around 1.5 hrs I
> could see a lot of EsRejectedExecutionException for certain periods of time.
>
> std_ebs_all_fine.txt - when using standard EBS. Around 25K docs per
> second. No EsRejectedExecutionExceptions.
> std_ebs_bulk_rejects.txt - when using standard EBS. Around 28K docs per
> second. No EsRejectedExecutionExceptions.
>
> instance_ssd_40K.txt - when using instance store SSD. Around 40K docs per
> second. No EsRejectedExecutionExceptions.
> instance_ssd_60K_few_rejects.txt - when using instance store SSD. Around
> 60K docs per second. Some  EsRejectedExecutionExceptions were seen.
> instance_ssd_60K_lot_of_rejects.txt - when using instance store SSD.
> Around 60K docs per second. A lot of  EsRejectedExecutionExceptions were
> seen.
>
>Also attaching the iostat output for these instances.
>
> Regards,
> Srinath.
>
>
>
>
> On Wed, Jul 16, 2014 at 3:34 PM, joergpra...@gmail.com <
> joergpra...@gmail.com> wrote:
>
>> Adding to this recommendations, I would suggest running iostat tool to
>> monitor for any suspicious "%iowait" states while
>> ESRejectedExecutionExceptions do arise.
>>
>> Jörg
>>
>>
>> On Wed, Jul 16, 2014 at 11:53 AM, Michael McCandless <
>> m...@elasticsearch.com> wrote:
>>
>>> Where is the index stored in your EC2 instances?  It's it just an EBS
>>> attached storage (magnetic or SSDs?  provisioned IOPs or the default).
>>>
>>> Maybe try putting the index on the SSD instance storage instead?  I
>>> realize this is not a long term solution (limited storage, and it's cleared
>>> on reboot), but it would be a simple test to see if the IO limitations of
>>> EBS is the bottleneck here.
>>>
>>> Can you capture the hot threads output when you're at 200% CPU after
>>> indexing for a while?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Wed, Jul 16, 2014 at 3:03 AM, Srinath C  wrote:
>>>
 Hi Joe/Michael,
I tried all your suggestions and found a remarkable difference in
 the way elasticsearch is able to handle the bulk indexing.
Right now, I'm able to ingest at the rate of 25K per second with the
 same setup. But occasionally there are still some
 EsRejectedExecutionException being raised. The CPUUtilization on the
 elasticsearch nodes is so low (around 200% on an 8 core system) that it
 seems that something else is wrong. I have also tried to increase
 queue_size but it just delays the EsRejectedExecutionException.

 Any more suggestions on how to handle this?

 *Current setup*: 4 c3.2xlarge instances of ES 1.2.2.
 *Current Configurations*:
 index.codec.bloom.load: false
 index.compound_format: false
 index.compound_on_flush: false
 index.merge.policy.max_merge_at_once: 4
 index.merge.policy.max_merge_at_once_explicit: 4
 index.merge.policy.max_merged_segment: 1gb
 index.merge.policy.segments_per_tier: 4
 index.merge.policy.type: tiered
 index.merge.scheduler.max_thread_count: 4
 index.merge.scheduler.type: concurrent
 index.refresh_interval: 10s
 index.translog.flush_threshold_ops: 5
 index.translog.interval: 10s
 index.warmer.enabled: false
 indices.memory.index_buffer_size: 50%
 indices.store.throttle.type: none





 On Tue, Jul 15, 2014 at 6:24 PM, Srinath C  wrote:

> Thanks Joe, Michael and all.

Re: Cluster interface

2014-07-17 Thread Mark Walkom
ES needs direct access to the interface for the instance, so NAT won't work.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com


On 18 July 2014 03:39,  wrote:

> I've setup three kvm guests with elasticsearch in a cluster. Since
> iptables kills the cluster but I'd like to maintain some security I've
> setup the main host with eth0 bridged so I can get to it, and eth1 on the
> default nat interface. The other two hosts use eth1 on the same nat
> interface. The idea is That I can (hopefully) use iptables on eth0 for the
> main host, and let elasticsearch cluster on eth1 (from main) and the other
> two on eth0 without iptables. Not working as I'd expect it too. Is there a
> for clustering on specific interfaces?
>
> Thanks,
> Avery
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/b8cd8755-ff2c-4925-8af9-b3061e05365e%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEM624bYesEhd6jyfFZYSp1vAOx1S1tG_hbg29vAQJBBJOfSyg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Kibana with (non-basic) User Authentication

2014-07-17 Thread Mark Walkom
There are a few such wrappers around that community members have written.
Have a search through the archives here and you may get some ideas and even
code to leverage.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com


On 18 July 2014 04:46, Stephan Buys  wrote:

> Hi all, I'm exploring options to enable user management and authentication
> for Kibana. The idea is to have a nice looking (not just basic auth)
> authentication screen for users of a web-based monitoring solution that we
> are developing (powered by Elasticsearch). We're not a big group, so I'm
> looking for lightweight solutions, and trying to stay as close to the
> standard packages.
>
> I've come up with the following possible solutions, but would greatly
> appreciate any pointers.
>
> 1) Add authentication to a thin Django or Flask project (throw some
> bootstrap into the mix) and use it as a reverse proxy for urls. Should give
> me session management.
> 2) Use something like CAS (http://www.jasig.org/cas)
> 3) Use nginx with Basic Auth :(
>
> Cheers,
> Stephan
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/9dacffd2-e78b-4185-b04b-23f9e7d80c93%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEM624YcsdyPXrOZx95uJMbx5Zfndt%3D86XBcDT23JtKe0-Rq8A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Bloom filter codec?

2014-07-17 Thread Adrien Grand
On Thu, Jul 17, 2014 at 10:37 PM, Nikolas Everett  wrote:

> Thanks for replying.  I've been looking to reduce my IO.  Pushing
> everything into an all field is really going to be the biggest thing, I
> think, but I was wondering about the bloom filters.  It doesn't sound worth
> it.  It feels like everything but the default codec is pretty unlikely to
> be useful?
>

Indeed, the default codec tries to make sensible trade-offs and would be
the most useful in most cases.

-- 
Adrien Grand

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7ou5PLbCJ744nP_Qk_S5mwfrbrPUxG0dGkn9zYiQwrsw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Python package index in elasticsearch

2014-07-17 Thread Maciej Dziardziel
Hi

Being frustrated with speed and inflexibility of pip search, I played with 
elasticsearch and set  up my own index.
Maybe someone will find it useful too.

Site:http://pypisearch.linuxcoder.co.uk

Code:  https://github.com/Fiedzia/pypisearch

Full lucene syntax is allowed.

Note: indexing is in progress, but over 60% of pypi packages are there,
it should get to 100% within max few hours.

Having the basics working, I hope to polish it soon.

Maciej Dziardziel
fied...@gmail.com

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7bb1d73a-5979-446d-8ec4-9776554df7a3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


How to figure out field type?

2014-07-17 Thread Adrian
I've added some data to my ES.

JSON format:

{
"doc":{
  "site" : "marriage.com",
  "name" : "amount-active-users",

  "daily" : {
"dataX": [1,2,3],
"dataY": [1388538061, 1388624461, 1388710861],
"startDate":1388538061,
"endDate":1388710861
  }

}
}

If you look at dataX field, it's an array. ES interprets it as an array on 
Longs.

Now when I add another JSON doc with dataX being doubles I'm not sure on 
the Java side how to know when the input is in format 
criptDocValues.Doubles or criptDocValues.Longs
I would like data to be interpreted as doubles all the time. 

This is the terrible code I use to case but it's not working after adding 
the doubles to the dataX field:

List dataXTimeSeries2Long= ((ScriptDocValues.Doubles) 
doc().get(rootPathDataX)).getValues();
List dataXTimeSeries2 = new ArrayList();
//reverse TS so that it matches order of series retrieved via 
client() - ya those are reversed
   for(int i=dataXTimeSeries2Long.size()-1; i>-1; i--){

dataXTimeSeries2.add(Double.parseDouble(dataXTimeSeries2Long.get(i).toString()));
 
//using toStr is slow
}

This code fails with a class cast exception:
ClassCastException[org.elasticsearch.index.fielddata.ScriptDocValues$Longs 
cannot be cast to 
org.elasticsearch.index.fielddata.ScriptDocValues$Doubles]; }]
I can do that in Scala. Java is behind.

Help!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/12d62f00-b0c5-4552-b4fe-07a874085ce4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Question regarding Shard Distribution while adding a replica to cluster

2014-07-17 Thread Rahul Sharma
Hi,

My ES Version -- 0.90.2

1) I used to have a 3 node ES cluster where I had 5 shard indices. I used
to have around 1000 shards (all primary as it was running with 0 replica).
2) Then I change the replica to 1 and add 3 additional nodes hoping the
shards will get evenly balanced with a replica.

However what I found is that all new indices are getting created with
replica (So in total 10 shards per index) but the old indices stayed as is.
They did not get replicated.

So my new nodes are fairly empty whereas the old nodes shards are growing
as new indices are getting created.


*How can I re-balance the cluster so that I my shards are evenly balanced
and the old indices also gets replicated?*
Now I am facing another problem, while I index new data the cluster
suddenly drops few shards to unassigned and goes to Red, however it pick up
and reassigns on its own. But this keeps happening pretty often. Not sure
what could be the cause. Would appreciate your guidence here.

Thanks
Rahul

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CA%2BsZA0VH_YOaVa64LRuZSV563p2nikiUUNc7-vjzGBZe8pnVdQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Bloom filter codec?

2014-07-17 Thread Nikolas Everett
Thanks for replying.  I've been looking to reduce my IO.  Pushing
everything into an all field is really going to be the biggest thing, I
think, but I was wondering about the bloom filters.  It doesn't sound worth
it.  It feels like everything but the default codec is pretty unlikely to
be useful?


On Thu, Jul 17, 2014 at 4:31 PM, Adrien Grand <
adrien.gr...@elasticsearch.com> wrote:

> Hi Nik,
>
> The trade-off is not easy indeed. First, the default terms dictionary can
> already save some disk seeks. By storing the prefixes of the terms that are
> in the terms dictionary in a FST in memory, it can avoid going to disk when
> the term that you are looking up cannot match this FST. A bloom filter
> might save a few additional disk seeks but as you said, it's pretty
> intensive memory-wise and sometimes that is memory that would just be
> better spent on the filesystem cache.
>
>
>
> On Thu, Jul 17, 2014 at 4:25 PM, Nikolas Everett 
> wrote:
>
>> Has anyone had success adding a bloom filter to the codec for any of
>> their fields?
>>
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings
>>
>> I imagine it'd help reduce IO from (non multi-term) queries that
>> frequently don't match.  Like if you have a field that is very specific and
>> useful for searching but very rarely matches anything.
>>
>> It looks like the cost is in the range of 10 bits of heap per term per
>> segment for a false positive probability around 1%.  Meaning it'd be pretty
>> high if the index had lots of terms - especially if they were in many
>> segments.  But it'd be about 10 bits per value if the values were mostly
>> unique.
>>
>> Nik
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> Adrien Grand
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j52TNTaN8NzNpB5jd-Kms3VuVtn_0ZFVqbt%2B7tfhk%3D1WQ%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0dpqMAkLZ%3DOdWfhicO9hcB5ummBrnmTPw7xUG-54G1pQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Bloom filter codec?

2014-07-17 Thread Adrien Grand
Hi Nik,

The trade-off is not easy indeed. First, the default terms dictionary can
already save some disk seeks. By storing the prefixes of the terms that are
in the terms dictionary in a FST in memory, it can avoid going to disk when
the term that you are looking up cannot match this FST. A bloom filter
might save a few additional disk seeks but as you said, it's pretty
intensive memory-wise and sometimes that is memory that would just be
better spent on the filesystem cache.



On Thu, Jul 17, 2014 at 4:25 PM, Nikolas Everett  wrote:

> Has anyone had success adding a bloom filter to the codec for any of their
> fields?
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings
>
> I imagine it'd help reduce IO from (non multi-term) queries that
> frequently don't match.  Like if you have a field that is very specific and
> useful for searching but very rarely matches anything.
>
> It looks like the cost is in the range of 10 bits of heap per term per
> segment for a false positive probability around 1%.  Meaning it'd be pretty
> high if the index had lots of terms - especially if they were in many
> segments.  But it'd be about 10 bits per value if the values were mostly
> unique.
>
> Nik
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Adrien Grand

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j52TNTaN8NzNpB5jd-Kms3VuVtn_0ZFVqbt%2B7tfhk%3D1WQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Kibana with (non-basic) User Authentication

2014-07-17 Thread Stephan Buys
Hi all, I'm exploring options to enable user management and authentication 
for Kibana. The idea is to have a nice looking (not just basic auth) 
authentication screen for users of a web-based monitoring solution that we 
are developing (powered by Elasticsearch). We're not a big group, so I'm 
looking for lightweight solutions, and trying to stay as close to the 
standard packages.

I've come up with the following possible solutions, but would greatly 
appreciate any pointers.

1) Add authentication to a thin Django or Flask project (throw some 
bootstrap into the mix) and use it as a reverse proxy for urls. Should give 
me session management.
2) Use something like CAS (http://www.jasig.org/cas)
3) Use nginx with Basic Auth :(

Cheers,
Stephan

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9dacffd2-e78b-4185-b04b-23f9e7d80c93%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Need to import lots of lucene libraries to run example code

2014-07-17 Thread joergpra...@gmail.com
Use the Maven dependencies of the ES jar in the Maven repo to let the IDE
build and run your code.

If you want to run your code, you have to include all jars under the "lib"
folder in ES_HOME into your classpath. Maven knows about these dependencies
automatically.

Jörg


On Thu, Jul 17, 2014 at 6:52 PM, Morris Chang  wrote:

> Hi,
>
> I am using java to learn elasticsearch API. I tried to start from the
> index api with the example code on website Resource page:
>
> Node node = nodeBuilder().node();
> Client client = node.client();
>
>
> IndexResponse response = client.prepareIndex("twitter", "tweet",
> "1")
> .setSource(jsonBuilder()
> .startObject()
> .field("user", "kimchy")
> .field("postDate", new Date())
> .field("message", "trying out
> Elasticsearch")
> .endObject()
>   )
> .execute()
> .actionGet();
>
>
> My problem is that importing only the elasticsearch-1.2.2.jar is not
> enough, and I need to go download .jar files like
>
> org/apache/lucene/search/vectorhighlight/SimpleFieldFragList
> org/apache/lucene/index/memory/MemoryIndex.
>
> to solve the java.lang.NoClassDefFoundError.
>
> Is there a better efficient way to import all these required class files?
> (instead of google all the jar files and add into eclipse libraries)
>
>
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/c688d658-0211-44f8-87f2-06ec9e4fe619%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE3cUf_saJhdPbOisC--2%2BAA6aiJyh3pXdOKBhvsAjdcw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: ElasticSearch Hadoop

2014-07-17 Thread Costin Leau

On 7/17/14 8:38 PM, James Cook wrote:

I've read through much of the documentation for es-hadoop, but I might be 
coming away with some misunderstandings.

The setup docs for elasticsearch for apache hadoop (es-hadoop) uses the word 
/interact/ which is a bit vague.

Elasticsearch for Apache Hadoop is an open-source, stand-alone, 
self-contained, small library that allows Hadoop
jobs (whether using Map/Reduce or libraries built upon it such as Hive, Pig 
or Cascading) to interact with
Elasticsearch. Data flows bi-directionaly so that applications can leverage 
transparently the Elasticsearch engine
capabilities to significantly enrich their capabilities and increase the 
performance.


So, does this mean I have a separate Hadoop instance (potentially built upon 
HDFS or AWS EMR) and I can query data using
either the elasticsearch (REST/Java/etc) or hadoop (Hive, Pig, Cascading) 
environments?



I'm not sure I understand your question. es-hadoop allows Hadoop jobs (whether they are written in 
Cascading/Hive/Pig/MR) to
read/write (to be read interact) easily to/from Elasticsearch. es-hadoop provides native APIs to the aforementioned 
libraries

and underneath takes care of the boiler-plate work (conversion to/from JSON, 
communicating with ES, handling failures) plus
adds some 'optimizations' such as using Hadoop multi-node/parallel tasks.
Note that it is entirely possible to do all these yourself (if you don't 
want/cannot use it).

Cheers,




--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to
elasticsearch+unsubscr...@googlegroups.com 
.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c12c35a0-b9ee-461e-8e81-12910dd06894%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.


--
Costin

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/53C81530.509%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trading index performance for search performance

2014-07-17 Thread jnortey
Thanks to both of you for the advise. Unfortunately setting daily indexing 
times isn't an option for us, however  I think I have a good idea of what 
we should try next.

On Thursday, July 17, 2014 10:56:31 AM UTC-5, jnortey wrote:
>
> At the moment, we're able to bulk index data at a rate faster than we 
> actually need. Indexing is not as important to use as being able to quickly 
> search for data. Once we start reaching ~30 million documents indexed, we 
> start to see performance decreasing in ours search queries. What are the 
> best techniques for sacrificing indexing time in order to improve search 
> performance?
>
>
> A bit more info:
>
> - We have the resources to improve our hardware (memory, CPU, etc) but 
> we'd like to maximize the improvements that can be made programmatically or 
> using properties before going for hardware increases.
>
> - Our searches make very heavy uses of faceting and aggregations.
>
> - When we run the optimize query, we see *significant* improvements in 
> our search times (between 50% and 80% improvements), but as documented, 
> this is usually a pretty expensive operation. Is there a way to sacrifice 
> indexing time in order to have Elasticsearch index the data more 
> efficiently? (I guess sort of mimicking the optimization behavior at index 
> time)
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/66fc6dbb-6982-410d-a682-c711708eff54%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: counting unique root objects on nested aggregations

2014-07-17 Thread Kallin Nagelberg
I realized this could be simplified by simply leaving out the 'value_count'
aggregation within the reverse_nested, as that information is already
provided by the included 'doc_count'. I guess it can't be simplified much
beyond this.

Would it be worth including this information by default when doing a nested
agg (doc_count on the reverse_nested?). It seems pretty useful, but not
sure about performance implications of always doing it. An option
nested_aggs to return doc_count of the parent would be a nice to have for
sure!


On Thu, Jul 17, 2014 at 11:46 AM, Kallin Nagelberg <
kallin.nagelb...@gmail.com> wrote:

> I'm trying to build a query to aggregate on some fields in a nested
> document, but instead of returning the count of the nested documents for
> each aggregation, I'd like to know the number of root objects.
>
> IE.,
>
> I have a mapping like (from the docs):
>
> "product" : {
> "properties" : {
> "resellers" : {
> "type" : "nested"
> "properties" : {
> "name" : { "type" : "string" },
> "price" : { "type" : "double" }
> }
> }
> }
> }
>
>
> Now let's say I want to know how many products have each reseller name.
> That's not straight forward as far as I can tell.
>
> If I do an agg like:
>
> "aggs": {
> "resellers": {
>   "nested": {
> "path": "resellers"
>   },
>   "aggs": {
> "names": {
>   "terms": {
> "field": "resellers.name"
>  ...
>
>
> I'll get back something like ( assuming that each product has many
> resellers):
>
> hits: 20,
> aggregations: {
>   resellers: {
> doc_count: 100,
> names : {
>buckets: [
> {
> key:  'name1'
> doc_count: 50
> },
> {
> key: 'name2',
> doc_count: 50
> }
> ]
> }}}
>
>
> So, its aggregating on the nested objects, ie there are 100 reseller
> nested docs, and in them 50 have name1, 50 have name2.
>
> What I'm interested in though is how many products have resellers with
> name 1 and name 2.
>
> IE, it should say something like,
>
> - 15 products have a reseller w/ name name1
> - 10 products have a reseller w/ name name2
>
> It looks like I can do this by putting a reverse nested aggregation below
> my names agg, then do a value_count aggregation on the ID of the root
> object. This seems kind of round about and I wonder if I'm missing an
> easier way. Any suggestions would be appreciated !
>
> Thanks,
> -Kal
>
>
>
>
>
>
>
>
>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/5f9iHPo5-Ps/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/1e895c15-accc-4638-b19c-ec1d263ca53b%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAC7UURF8%3D_1-J4LyCADUOnOQ82Bc5dt_%3DYTAmAYV9isXXt6UJg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Cluster interface

2014-07-17 Thread avery . rozar
I've setup three kvm guests with elasticsearch in a cluster. Since iptables 
kills the cluster but I'd like to maintain some security I've setup the 
main host with eth0 bridged so I can get to it, and eth1 on the default nat 
interface. The other two hosts use eth1 on the same nat interface. The idea 
is That I can (hopefully) use iptables on eth0 for the main host, and let 
elasticsearch cluster on eth1 (from main) and the other two on eth0 without 
iptables. Not working as I'd expect it too. Is there a for clustering on 
specific interfaces?

Thanks,
Avery

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b8cd8755-ff2c-4925-8af9-b3061e05365e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


ElasticSearch Hadoop

2014-07-17 Thread James Cook
I've read through much of the documentation for es-hadoop, but I might be 
coming away with some misunderstandings.

The setup docs for elasticsearch for apache hadoop (es-hadoop) uses the 
word *interact* which is a bit vague.

Elasticsearch for Apache Hadoop is an open-source, stand-alone, 
> self-contained, small library that allows Hadoop jobs (whether using 
> Map/Reduce or libraries built upon it such as Hive, Pig or Cascading) to 
> interact with Elasticsearch. Data flows bi-directionaly so that 
> applications can leverage transparently the Elasticsearch engine 
> capabilities to significantly enrich their capabilities and increase the 
> performance.


So, does this mean I have a separate Hadoop instance (potentially built 
upon HDFS or AWS EMR) and I can query data using either the elasticsearch 
(REST/Java/etc) or hadoop (Hive, Pig, Cascading) environments?



 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c12c35a0-b9ee-461e-8e81-12910dd06894%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Using script fields in Kibana

2014-07-17 Thread Darby Sager
Gal, I too would appreciate seeing your solution.  Thank you!

On Tuesday, April 1, 2014 9:27:42 AM UTC-7, Gal Zolkover wrote:
>
> Ok thank you , I'm up for the chalange 😀

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7e7347fe-dc75-4c97-933c-eddb06484e8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Need to import lots of lucene libraries to run example code

2014-07-17 Thread Morris Chang
Hi,

I am using java to learn elasticsearch API. I tried to start from the index 
api with the example code on website Resource page:

Node node = nodeBuilder().node();
Client client = node.client();
   

IndexResponse response = client.prepareIndex("twitter", "tweet", "1"
)
.setSource(jsonBuilder()
.startObject()
.field("user", "kimchy")
.field("postDate", new Date())
.field("message", "trying out Elasticsearch"
)
.endObject()
  )
.execute()
.actionGet();


My problem is that importing only the elasticsearch-1.2.2.jar is not 
enough, and I need to go download .jar files like

org/apache/lucene/search/vectorhighlight/SimpleFieldFragList
org/apache/lucene/index/memory/MemoryIndex.

to solve the java.lang.NoClassDefFoundError.

Is there a better efficient way to import all these required class files? 
(instead of google all the jar files and add into eclipse libraries)




-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c688d658-0211-44f8-87f2-06ec9e4fe619%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trading index performance for search performance

2014-07-17 Thread joergpra...@gmail.com
The 30m docs may have characteristics (volume, term freqs, mappings) so ES
limits are reached within your specific configuration. This is hard to
guess without knowing more facts.

Beside improving merge configuration, you might be able to sacrifice
indexing time by assigning limited daily indexing time windows to your
clients.

The indexing process can then be divided into steps:

- connect to cluster
- create index with n shards and replica level 0
- create mappings
- disable refresh rate
- start bulk index
- stop bulk index
- optimize to segment num 1
- enable refresh rate
- add replica levels in order to handle maximum search workload
- invoke warmers
- disconnect from cluster

After the clients have completed indexing, you have a fully optimized
cluster, on which you can put full search load with aggregations etc. with
the highest performance, but while searching you should keep the indexing
silent (or set it even to read only).

You do not need to scale vertically by adding hardware to the existing
servers. Scaling horizontally by adding nodes on more servers for the
replicas the method ES was designed for. Adding nodes will drastically
improve the search capabilities with regard to facets/aggregations.

Jörg


On Thu, Jul 17, 2014 at 5:56 PM, jnortey  wrote:

> At the moment, we're able to bulk index data at a rate faster than we
> actually need. Indexing is not as important to use as being able to quickly
> search for data. Once we start reaching ~30 million documents indexed, we
> start to see performance decreasing in ours search queries. What are the
> best techniques for sacrificing indexing time in order to improve search
> performance?
>
>
> A bit more info:
>
> - We have the resources to improve our hardware (memory, CPU, etc) but
> we'd like to maximize the improvements that can be made programmatically or
> using properties before going for hardware increases.
>
> - Our searches make very heavy uses of faceting and aggregations.
>
> - When we run the optimize query, we see *significant* improvements in
> our search times (between 50% and 80% improvements), but as documented,
> this is usually a pretty expensive operation. Is there a way to sacrifice
> indexing time in order to have Elasticsearch index the data more
> efficiently? (I guess sort of mimicking the optimization behavior at index
> time)
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/0e134001-9a55-40c5-a8fc-4c1485a3e6fc%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHWfvjUc5KLUn9HpBpbmjo%3DEeKEQJ6iGMcqHZVCTafV0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trading index performance for search performance

2014-07-17 Thread Nikolas Everett
It might be useful to fiddle with the merge configuration

to try to end up with fewer segments.  That'll reduce the IO cost of the
underlying Lucene operations that filter your query before the
aggregations.  One win is to make sure you aren't oversubscribing.  So if
you are going for maximum speed have one shard per server.  Maybe one
less.  If you are going for maximum throughput (like, total queries) then
have fewer total copies of the data then you do servers.  So if you have 5
shards with 2 replicas, you'd need at least 15 servers.


Cutting down the number of really helped my throughput but it might have
been because my workload is different.

Nik


On Thu, Jul 17, 2014 at 11:56 AM, jnortey  wrote:

> At the moment, we're able to bulk index data at a rate faster than we
> actually need. Indexing is not as important to use as being able to quickly
> search for data. Once we start reaching ~30 million documents indexed, we
> start to see performance decreasing in ours search queries. What are the
> best techniques for sacrificing indexing time in order to improve search
> performance?
>
>
> A bit more info:
>
> - We have the resources to improve our hardware (memory, CPU, etc) but
> we'd like to maximize the improvements that can be made programmatically or
> using properties before going for hardware increases.
>
> - Our searches make very heavy uses of faceting and aggregations.
>
> - When we run the optimize query, we see *significant* improvements in
> our search times (between 50% and 80% improvements), but as documented,
> this is usually a pretty expensive operation. Is there a way to sacrifice
> indexing time in order to have Elasticsearch index the data more
> efficiently? (I guess sort of mimicking the optimization behavior at index
> time)
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/0e134001-9a55-40c5-a8fc-4c1485a3e6fc%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1WEWDomC20wvP0nja_dtEwDtmNFTfa5fp0AOeirShowA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Trading index performance for search performance

2014-07-17 Thread jnortey
At the moment, we're able to bulk index data at a rate faster than we 
actually need. Indexing is not as important to use as being able to quickly 
search for data. Once we start reaching ~30 million documents indexed, we 
start to see performance decreasing in ours search queries. What are the 
best techniques for sacrificing indexing time in order to improve search 
performance?


A bit more info:

- We have the resources to improve our hardware (memory, CPU, etc) but we'd 
like to maximize the improvements that can be made programmatically or 
using properties before going for hardware increases.

- Our searches make very heavy uses of faceting and aggregations.

- When we run the optimize query, we see *significant* improvements in our 
search times (between 50% and 80% improvements), but as documented, this is 
usually a pretty expensive operation. Is there a way to sacrifice indexing 
time in order to have Elasticsearch index the data more efficiently? (I 
guess sort of mimicking the optimization behavior at index time)

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0e134001-9a55-40c5-a8fc-4c1485a3e6fc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


counting unique root objects on nested aggregations

2014-07-17 Thread Kallin Nagelberg
I'm trying to build a query to aggregate on some fields in a nested 
document, but instead of returning the count of the nested documents for 
each aggregation, I'd like to know the number of root objects.

IE.,

I have a mapping like (from the docs):

"product" : {
"properties" : {
"resellers" : { 
"type" : "nested"
"properties" : {
"name" : { "type" : "string" },
"price" : { "type" : "double" }
}
}
}
}


Now let's say I want to know how many products have each reseller name. 
That's not straight forward as far as I can tell.

If I do an agg like:

"aggs": {
"resellers": {
  "nested": {
"path": "resellers"
  },
  "aggs": {
"names": {
  "terms": {
"field": "resellers.name"
 ...


I'll get back something like ( assuming that each product has many 
resellers):

hits: 20,
aggregations: {
  resellers: {
doc_count: 100,
names : {
   buckets: [
{
key:  'name1'
doc_count: 50 
},
{
key: 'name2',
doc_count: 50
}
]
}}}


So, its aggregating on the nested objects, ie there are 100 reseller nested 
docs, and in them 50 have name1, 50 have name2.

What I'm interested in though is how many products have resellers with name 
1 and name 2.

IE, it should say something like,

- 15 products have a reseller w/ name name1
- 10 products have a reseller w/ name name2

It looks like I can do this by putting a reverse nested aggregation below 
my names agg, then do a value_count aggregation on the ID of the root 
object. This seems kind of round about and I wonder if I'm missing an 
easier way. Any suggestions would be appreciated !

Thanks,
-Kal










-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1e895c15-accc-4638-b19c-ec1d263ca53b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


How can I break down and diagnose this query error resulting in a NumberFormatException?

2014-07-17 Thread Ryan V
I just converted our ES cluster from 0.90.12 to  1.1.1 and our app from 
NEST 0.12 to 1.0.0-rc1 and have had a really difficult time fixing all the 
breaking changes.  I'm stuck on the following error.  It occurs when I 
execute what is a rather complex search query:

{
[my_index][0]: 
RemoteTransportException[[search3.localdomain][inet[/172.31.xx.xx:9300]][search/phase/query]];
 nested: QueryPhaseExecutionException[[my_index][0]: 
query[filtered(ConstantScore(cache(BooleanFilter(description.contains:[* TO 
*])))^0.5 ConstantScore(cache(BooleanFilter(name.contains:[* TO *])))^0.8 
ConstantScore(cache(BooleanFilter(description.snowball:[* TO *]))) 
ConstantScore(cache(BooleanFilter(name.snowball:[* TO *])))^2.0 
ConstantScore(cache(BooleanFilter(stateName:[* TO *])))^0.5 
ConstantScore(*:*)^0.3)->cache(_type:mysearchtype)],from[0],size[100]: 
Query Failed [Failed to execute main query]];
 nested: ElasticsearchException[java.lang.NumberFormatException: empty 
String];
 nested: UncheckedExecutionException[java.lang.NumberFormatException: empty 
String];
 nested: NumberFormatException[empty String]; 
}


I'm just not quite sure how to debug that.  The exact same query works on 
my 0.90.12 server but not on my 1.1.1 server. Here's the query (sanitized 
up a little)

{
"from": 0,
"size": 100,
"sort": {
"_score": {
"order": "desc"
}
},
"facets": {
"cityshortName": {
"terms": {
"field": "cityshortName",
"size": 50,
"order": "count"
},
"facet_filter": {
"and": {
"filters": [{
"and": {
"filters": [{}, {}, {}, {
"fquery": {
"query": {
"query_string": {
"query": "Starbucks",
"default_field": 
"companyshortName",
"analyzer": "whitespace"
}
}
}
}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]
}
}, {}]
}
}
},
"amenities.shortName": {
"terms": {
"field": "amenities.shortName",
"size": 50,
"order": "count"
},
"facet_filter": {
"and": {
"filters": [{
"and": {
"filters": [{}, {}, {}, {
"fquery": {
"query": {
"query_string": {
"query": "Starbucks",
"default_field": 
"companyshortName",
"analyzer": "whitespace"
}
}
}
}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]
}
}, {}]
}
}
},
"spaceEventTypes.shortName": {
"terms": {
"field": "spaceEventTypes.shortName",
"size": 50,
"order": "count"
},
"facet_filter": {
"and": {
"filters": [{
"and": {
"filters": [{}, {}, {}, {
"fquery": {
"query": {
"query_string": {
"query": "Starbucks",
"default_field": 
"companyshortName",
"analyzer": "whitespace"
}
}
}
}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]
}
}, {}]
}
}
},
"spaceType.shortName": {
"terms": {
"field": "spaceType.shortName",
"size": 50,
"order": "count"
},
"facet_filter": {
"and": {
"filters": [{
"and": {
"filters": [{}, {}, {}, {
"fquery": {
"query": {
"query_string": {
"query": "Starbuck

Re: Random sort with a seed changes after updating a doc

2014-07-17 Thread Arny
I'm facing the same issue.
Is there no way to choose what the seed actually should pick for the random 
score calculation?
Or just let it pick the uid which never changes.

On Tuesday, February 11, 2014 1:00:57 AM UTC+1, Brandon Williams wrote:
>
> I'm using random_score to perform a search with some random sorting, 
> something as simple as this:
>
> {
>   "fields": [
> "id"
>   ],
>   "query": {
> "function_score": {
>   "random_score": {
> "seed": 773372
>   }
> }
>   },
>   "sort": [
> {
>   "_score": "desc"
> }
>   ]
> }
>
> As soon as I update a doc in the index its position will change with this 
> sort. I took a diff of the before and after result set, and with a set of 
> about 180 documents I noticed 8 documents switched places after the update.
>
> If I insert a doc instead of updating a doc then it works as expected. The 
> new doc is inserted into a random spot, but in a consistent manner.
>
> Is this the expected behavior?
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/428e6570-93ca-4d46-940c-450dc4841b69%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: High memory usage on dedicated master nodes

2014-07-17 Thread David K Smith
They are dedicated masters and no queries are going through them. 

smonasco, that's it I believe. It's ParNew for young gen. I made a mistake in 
our puppet configs and gave the same amount of memory to both data nodes and 
master nodes for young generation (Xmn) even though master nodes only have 1/4 
of the memory data nodes have. So all 2G of memory in these master nodes are in 
young generation. 

> On Jul 17, 2014, at 12:07 AM, smonasco  wrote:
> 
> Maybe I'm missing something but if you give java mms em it will use it if 
> only to store garbage.  The trough after a garbage collection is usually more 
> indicative of what is actually in use.
> 
> This looks like a CMS/parnew setup.  Parnew is fast and low blocking, but 
> leaves stuff behind.  CMS blocks but is really good at taking out the garbage.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/0e7c621e-3e85-4bc3-ac0f-a1971cb097f6%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/A5405B23-A1AC-4ABC-B357-81B3977A5E39%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


Bloom filter codec?

2014-07-17 Thread Nikolas Everett
Has anyone had success adding a bloom filter to the codec for any of their
fields?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings

I imagine it'd help reduce IO from (non multi-term) queries that frequently
don't match.  Like if you have a field that is very specific and useful for
searching but very rarely matches anything.

It looks like the cost is in the range of 10 bits of heap per term per
segment for a false positive probability around 1%.  Meaning it'd be pretty
high if the index had lots of terms - especially if they were in many
segments.  But it'd be about 10 bits per value if the values were mostly
unique.

Nik

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


IP geolocation without Logstash

2014-07-17 Thread Justin Koehler
I'm working on a system to record usage data for an application that 
submits its data to an ES cluster. I would like to record the location of 
each data point based on IP geolocation. I found the Logstash plugin that 
uses the GeoIP databases, but I was unable to find any solutions built for 
just Elasticsearch. Has anybody done something like this before?

In addition, it would be convenient to extract the IP of the point itself 
from the "X-Forwarded-For" header of the incoming data point. Is there a 
way to access these headers when the point is received by Elasticsearch?

Thanks in advance for any help.

Justin

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e7cb0010-103c-4ff7-8cd7-f5da5188f9bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Does insert order matter for date range queries

2014-07-17 Thread John Smith
Thanks

On Tuesday, 15 July 2014 11:49:56 UTC-4, Nikolas Everett wrote:
>
> I don't believe it matters, no.
>
>
> On Tue, Jul 15, 2014 at 11:47 AM, John Smith  > wrote:
>
>> Say I insert a few documents that have my own "date" field (NOT the ES 
>> insert stamp) but not inserted in order of that specific date field.
>>
>> {
>> ...
>>  "DateMoved": "2014-12-31..."
>> }
>>
>> {
>> ...
>>  "DateMoved": "2013-12-31..."
>> }
>>
>> {
>> ...
>>  "DateMoved": "2014-12-25..."
>> }
>>
>> {
>> ...
>>  "DateMoved": "2012-12-25..."
>> }
>>
>> {
>> ...
>>  "DateMoved": "2013-12-25..."
>> }
>>
>>
>>
>> And so on...
>>
>> If i wanted to do a range query by DateMoved (For all documents in 
>> 2013-12) would it affect the speed of the query?
>>
>> I have been testing my query and seems to be running ok. But just double 
>> checking to see there's no caveats.
>>  
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/cca69f29-b674-4505-9837-712e86ed59d2%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2c764769-e24e-4134-8c2f-d88333024724%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: No efect refresh_interval

2014-07-17 Thread Michael McCandless
OK, thanks for bringing closure.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jul 17, 2014 at 9:02 AM, Marek Dabrowski 
wrote:

> Hello
>
> I found reason my problems.
> Refresh index during usage perl depend on parameters "max_count" and
> "max_size" for
> $e->bulk_helper
> Values for this parameters determine when refresh will be done on index.
>
> Tnx for help.
>
> Regards
>
>
> W dniu czwartek, 17 lipca 2014 09:59:55 UTC+2 użytkownik Marek Dabrowski
> napisał:
>
>> Hello Mike
>>
>> My ES version is 1.2.1
>> I checked utilization nodes my cluster. Common valus ofr all nodes are:
>> java proces cpu utilization: < 6%
>> os load: < 1
>> io stat: < 15kB/s write
>>
>> I checked indexing process 2 methods:
>> a) indexing by native json data (13GB splited to 100MB chunks)
>> time for i in /tmp/SMT* ; do echo $i; curl -s -XPOST
>> h3:9200/smt_20140501_bulk_json_refresh_600/num/_bulk --data-binary @$i ;
>> rm -f $i; done
>>
>> b) indexing csv data by use perl script
>>
>> my $e = Search::Elasticsearch->new(
>>nodes => [
>>'h3:9200',
>>]
>>);
>>
>>
>> my $bulk = $e->bulk_helper(
>> index => $idx_name,
>> type  => $idx_type,
>> max_count => 1
>> );
>>
>> open(my $DATA, '<', $data_file) or die $!;
>> while(<$DATA>) {
>> chomp;
>>
>> my @data = split(',', $_);
>> $bulk->index({ source => {
>> p0  => $data[0],
>> p1  => $data[1],
>> p2  => $data[2],
>> p3  => $data[3],
>> p4  => $data[4],
>> p5  => $data[5],
>> p6  => $data[6],
>> p7  => $data[7],
>> p8  => $data[8],
>> p9  => $data[9],
>> p10 => $data[10],
>> p11 => $data[11]
>> }});
>>
>> }
>> close($DATA);
>> $bulk->flush;
>>
>> Setting refresh_interval to 600s in both cases has no effect. Data are
>> available immediately. I expect (equal to ES documentation) that new data
>> will be available after 10 minutes and in consequently indexing process
>> will be quicker but it doesn’t.
>>
>> Regards
>>
>> W dniu środa, 16 lipca 2014 16:52:31 UTC+2 użytkownik Michael McCandless
>> napisał:
>>>
>>> Which ES version are you using?  You should use the latest (soon to be
>>> 1.3): there have been a number of bulk-indexing improvements recently.
>>>
>>> Are you using the bulk API with multiple/async client threads?  Are you
>>> saturating either CPU or IO in your cluster (so that the test is really a
>>> full cluster capacity test)?
>>>
>>> Also, the relationship between refresh_interval and indexing performance
>>> is tricky: it turns out, -1 is often a poor choice, because it means your
>>> bulk indexing threads are sometimes tied up flushing segments when with
>>> refreshing enabled, it's a separate thread that does that.  So a refresh of
>>> 5s is maybe a good choice.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Wed, Jul 16, 2014 at 6:51 AM, Marek Dabrowski 
>>> wrote:
>>>
 Hello

 My configuration is:
 6 nodes Elasticsearch cluster
 OS: Centos 6.5
 JVM: 1.7.0_25

 Cluster is working fine. I can indexing data, query, etc. Now I'm doing
 test on package about ~50mln doc (~13GB). I would like take better
 performance during indexing data. To take this target I has been changed
 parameter refresh_interval. I did test for 1s, -1 and 600s. Time for
 indexing data is that same. I checked configuration (_settings) for index
 and value for refresh_interval is ok (has proper value), eg:

 {
   "smt_20140501_10_20g_norefresh" : {
 "settings" : {
   "index" : {
 "uuid" : "q3imiZGQTDasQUuMWS8oiw",
 "number_of_replicas" : "1",
 "number_of_shards" : "6",
 "refresh_interval" : "600s",
 "version" : {
   "created" : "1020199"
 }
   }
 }
   }
 }



 Create index, setting refresh_interval and load is done on that same
 cluster node. Before test index is deleted and created again before start
 new test with new value of refresh_interval. All cluster nodes logs
 information that parameter has been changed, eg:
 [2014-07-16 11:24:09,813][INFO ][index.shard.service  ] [h6]
 [smt_20140501_10_20g_norefresh][1] updating refresh_interval from
 [1s] to [-1]
 or
 [2014-07-16 11:32:32,928][INFO ][index.shard.service  ] [h6]
 [smt_20140501_10_20g_norefresh][1] updating refresh_interval from
 [1s] to [10m]

 After start test new data are available immediately and indexing time
 that same in 3 cases. I 

Re: No efect refresh_interval

2014-07-17 Thread Marek Dabrowski
Hello

I found reason my problems.
Refresh index during usage perl depend on parameters "max_count" and 
"max_size" for 
$e->bulk_helper
Values for this parameters determine when refresh will be done on index.

Tnx for help.

Regards


W dniu czwartek, 17 lipca 2014 09:59:55 UTC+2 użytkownik Marek Dabrowski 
napisał:
>
> Hello Mike
>
> My ES version is 1.2.1
> I checked utilization nodes my cluster. Common valus ofr all nodes are:
> java proces cpu utilization: < 6%
> os load: < 1
> io stat: < 15kB/s write
>
> I checked indexing process 2 methods:
> a) indexing by native json data (13GB splited to 100MB chunks)
> time for i in /tmp/SMT* ; do echo $i; curl -s -XPOST 
> h3:9200/smt_20140501_bulk_json_refresh_600/num/_bulk --data-binary @$i ; rm 
> -f $i; done
>
> b) indexing csv data by use perl script
>
> my $e = Search::Elasticsearch->new(
>nodes => [
>'h3:9200',
>]   
>);  
>
>
> my $bulk = $e->bulk_helper(
> index => $idx_name,
> type  => $idx_type,
> max_count => 1
> );
>
> open(my $DATA, '<', $data_file) or die $!; 
> while(<$DATA>) {
> chomp;
>
> my @data = split(',', $_);
> $bulk->index({ source => {  
> p0  => $data[0], 
> p1  => $data[1],
> p2  => $data[2],
> p3  => $data[3],
> p4  => $data[4],
> p5  => $data[5],
> p6  => $data[6],
> p7  => $data[7],
> p8  => $data[8],
> p9  => $data[9],
> p10 => $data[10],
> p11 => $data[11]
> }});
>
> }
> close($DATA);
> $bulk->flush;
>
> Setting refresh_interval to 600s in both cases has no effect. Data are 
> available immediately. I expect (equal to ES documentation) that new data 
> will be available after 10 minutes and in consequently indexing process 
> will be quicker but it doesn’t.
>
> Regards
>
> W dniu środa, 16 lipca 2014 16:52:31 UTC+2 użytkownik Michael McCandless 
> napisał:
>>
>> Which ES version are you using?  You should use the latest (soon to be 
>> 1.3): there have been a number of bulk-indexing improvements recently.
>>
>> Are you using the bulk API with multiple/async client threads?  Are you 
>> saturating either CPU or IO in your cluster (so that the test is really a 
>> full cluster capacity test)?
>>
>> Also, the relationship between refresh_interval and indexing performance 
>> is tricky: it turns out, -1 is often a poor choice, because it means your 
>> bulk indexing threads are sometimes tied up flushing segments when with 
>> refreshing enabled, it's a separate thread that does that.  So a refresh of 
>> 5s is maybe a good choice.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Jul 16, 2014 at 6:51 AM, Marek Dabrowski  
>> wrote:
>>
>>> Hello
>>>
>>> My configuration is:
>>> 6 nodes Elasticsearch cluster
>>> OS: Centos 6.5
>>> JVM: 1.7.0_25
>>>
>>> Cluster is working fine. I can indexing data, query, etc. Now I'm doing 
>>> test on package about ~50mln doc (~13GB). I would like take better 
>>> performance during indexing data. To take this target I has been changed 
>>> parameter refresh_interval. I did test for 1s, -1 and 600s. Time for 
>>> indexing data is that same. I checked configuration (_settings) for index 
>>> and value for refresh_interval is ok (has proper value), eg:
>>>
>>> {
>>>   "smt_20140501_10_20g_norefresh" : {
>>> "settings" : {
>>>   "index" : {
>>> "uuid" : "q3imiZGQTDasQUuMWS8oiw",
>>> "number_of_replicas" : "1",
>>> "number_of_shards" : "6",
>>> "refresh_interval" : "600s",
>>> "version" : {
>>>   "created" : "1020199"
>>> }
>>>   }
>>> }
>>>   }
>>> }
>>>
>>>
>>>
>>> Create index, setting refresh_interval and load is done on that same 
>>> cluster node. Before test index is deleted and created again before start 
>>> new test with new value of refresh_interval. All cluster nodes logs 
>>> information that parameter has been changed, eg:
>>> [2014-07-16 11:24:09,813][INFO ][index.shard.service  ] [h6] 
>>> [smt_20140501_10_20g_norefresh][1] updating refresh_interval from [1s] 
>>> to [-1]
>>> or
>>> [2014-07-16 11:32:32,928][INFO ][index.shard.service  ] [h6] 
>>> [smt_20140501_10_20g_norefresh][1] updating refresh_interval from [1s] 
>>> to [10m]
>>>
>>> After start test new data are available immediately and indexing time 
>>> that same in 3 cases. I don't know where is failure. Somebody know what is 
>>> going on?
>>>
>>> Regards
>>> Marek
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop rece

Re: Cross Fields w/ Fuzziness

2014-07-17 Thread Elliott Bradshaw
I realize that this post is getting a little old, but does the community 
have any feedback on the feasibility of this?

On Friday, May 16, 2014 10:21:53 AM UTC-4, Tom wrote:
>
> +1 fuzziness would be great when using cross_fields
>
> Am Mittwoch, 7. Mai 2014 22:00:25 UTC+2 schrieb Ryan Tanner:
>>
>> Any update to this?
>>
>> On Monday, April 7, 2014 7:59:54 AM UTC-6, Elliott Bradshaw wrote:
>>>
>>> Hi Elasticsearch,
>>>
>>> I've been playing with the new cross_fields multi match type, and I've 
>>> got to say that I love it.  It's a great way to search complex data without 
>>> doing a lot of memory killing denormalization.  That said, is there any 
>>> plan to implement a fuzziness option with this type?  That would certainly 
>>> be very valuable.
>>>
>>> - Elliott
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6170beb4-36d5-4323-93a1-14a612f601fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: How many tcp connections should ES/logstash generate ?

2014-07-17 Thread Bastien Chong
My issue is fixed by creating and dropping daily index.

The "resouce temporarily unavailable" was due to the 1024 maximum process 
for elasticsearch user. By not deleting per range, it decreased by 10x the 
number of process, and I also increase the ulimit for nproc.

Thanks all for your help.

On Wednesday, July 16, 2014 7:03:40 PM UTC-4, Mark Walkom wrote:
>
> If you are using daily indexes then don't even bother running the delete, 
> just drop the index when the next day rolls around.
>
> "Resource temporarily unavailable" could indicate you may need to 
> increase the ulimit for the user, did you set this in 
> /etc/default/elasticsearch?
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com 
> web: www.campaignmonitor.com
>
>
> On 17 July 2014 03:07, Bastien Chong > 
> wrote:
>
>> Thanks for you input. I'm running ES as another user, i still had root 
>> access.
>>
>> I will refactor and create index per day, and each 30sec i'll simply 
>> delete index from yesterday. I'm hoping this greatly reduce the number of 
>> threads.
>>
>>
>> On Wednesday, July 16, 2014 12:02:33 PM UTC-4, Jörg Prante wrote:
>>
>>> First, you should always run ES under another user with least-possible 
>>> privileges, so you can login, even if ES is running out of process space. 
>>> (There are more security related issues that everyone should care about, I 
>>> leave them out here)
>>>
>>> Second, it is not intended that ES runs so many processes. On the other 
>>> hand ES does not refuse to execute plenties of threads when retrying hard 
>>> to recover from network-related problems. Maybe you see what the threads 
>>> are doing by executing a "hot thread" command 
>>>
>>> http://www.elasticsearch.org/guide/en/elasticsearch/
>>> reference/current/cluster-nodes-hot-threads.html
>>>
>>> Third, you run every 30 secs a command to "delete by query" with a range 
>>> of many days. That does not seem to make sense. You should always take care 
>>> to complete such queries before continuing, they can take very long time (I 
>>> mean hours). They put a burden on your system. Set up daily indices, this 
>>> is much more efficient, deletions by day are a matter of seconds then.
>>>
>>> Jörg
>>>
>>>
>>>
>>>
>>> On Wed, Jul 16, 2014 at 4:30 PM, Bastien Chong  
>>> wrote:
>>>
 http://serverfault.com/questions/412114/cannot-
 switch-ssh-to-specific-user-su-cannot-set-user-id-resource-temporaril

 Looks like I have the same issue, is it normal that ES spawns that much 
 process, over 1000 ?


 On Wednesday, July 16, 2014 9:23:45 AM UTC-4, Bastien Chong wrote:
>
> I'm not sure how to find answer that, I use the default settings in 
> ES. The cluster is composed of 2 read/write node, and a read-only node.
> There is 1 Logstash instance that simply output 2 type of data to ES. 
> Nothing fancy.
>
> I need to delete documents older than a day, for this particular 
> thing, I can't create a daily index. Is there a better way ?
>
> I'm using an EC2 m3.large instance, ES has 1.5GB of heap.
>
> It seems like I'm hitting an OS limit, I can't "su - elasticsearch" : 
>
> su: /bin/bash: Resource temporarily unavailable
>
> Stopping elasticsearch fix this issue, so this is directly linked. 
>
>> -bash-4.1$ ulimit -a
>> core file size  (blocks, -c) 0
>> data seg size   (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size   (blocks, -f) unlimited
>> pending signals (-i) 29841
>> max locked memory   (kbytes, -l) unlimited
>> max memory size (kbytes, -m) unlimited
>> open files  (-n) 65536
>> pipe size(512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority  (-r) 0
>> stack size  (kbytes, -s) 8192
>> cpu time   (seconds, -t) unlimited
>> max user processes  (-u) 1024
>> virtual memory  (kbytes, -v) unlimited
>> file locks  (-x) unlimited
>>
>
>
>
>
> On Tuesday, July 15, 2014 6:35:22 PM UTC-4, Mark Walkom wrote:
>>
>> It'd depend on your config I'd guess, in particular how many 
>> workers/threads you have and what ES output you are using in LS.
>>
>> Why are you cleaning an index like this anyway? It seems horribly 
>> inefficient.
>> Basically the error is "OutOfMemoryError", which means you've run 
>> out of heap for the operation to complete. What are the specs for your 
>> node, how much heap does ES have?
>>
>> Regards,
>> Mark Walkom
>>
>> Infrastructure Engineer
>> Campaign Monitor
>> email: ma...@campaignmonitor.com
>> web: www.campaignmonitor.com
>>
>>
>> On 16 July 2014 00:43, Bastien Chong  wrote:
>>
>

Re: Unique/Distinct values from elasticsearch query

2014-07-17 Thread jigish thakar
Hey Soumya,
I needed exactly same in my implementation. and hardly 2 days old with 
ElasticSearch. 

Can you please post code snippet you used in mapping and then to fetch 
unique values?

Thanks in advance.


On Thursday, February 20, 2014 12:39:08 PM UTC+5:30, soumya sengupta wrote:
>
> Thanks, that worked
>
>
> On Wed, Feb 19, 2014 at 5:41 PM, aash dhariya  > wrote:
>
>> You can set the type of the field as multi_field. For example:
>>
>>
>> "value" : {
>> "type" : "multi_field",
>> "fields" : {
>>   "value" : {
>> "type" : "string",
>> "analyzer" : "custom_ngram_analyzer"
>>   },
>>   "value_untouched" : {
>> "type" : "string",
>> "index" : "not_analyzed",
>>   }
>> }
>>   }
>>
>> Now, you can compute the facets for value.value_untouched and for search 
>> you can simple use value.
>>
>> To learn more about multi_field, you can visit:
>>
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
>>
>>
>> On Wed, Feb 19, 2014 at 5:02 PM, soumya sengupta > > wrote:
>>
>>> I have got an ngram-analyzer for the field which is preventing me from 
>>> using facets.
>>> What options do I have ?
>>>
>>>
>>> On Wednesday, February 19, 2014 4:36:13 PM UTC+5:30, geeky_sh wrote:
>>>
 You could use facet query to get all the unique values for a particular 
 field. Though you will get the counts too.


 On Wed, Feb 19, 2014 at 3:21 PM, soumya sengupta  
 wrote:

> How to get unique or distinct values from elastic search query ?
> I want to get all the unique vales and not their total count.
>  
> -- 
> You received this message because you are subscribed to the Google 
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to elasticsearc...@googlegroups.com.
>
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/elasticsearch/00a45aff-6d67-40ae-8c54-e00481011bce%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>



 -- 
 Thanks, 
 Aash
  
>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com .
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/d798c2f3-c460-45f1-a83a-fcb5180f40c9%40googlegroups.com
>>> .
>>>
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>
>>
>> -- 
>> Thanks, 
>> Aash
>>  
>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/elasticsearch/VL0Zn9kXzbk/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/CAG7ZG_fgm2DV1paXe_OiH7fcKPe0M3bYA8Jr7VMHqxiqZFtmNA%40mail.gmail.com
>> .
>>
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>
>
> -- 
> Soumya Sen Gupta
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/acb62c3d-00d7-4f7a-9b3e-08d64aeb82bf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [ERROR][bootstrap] {1.2.2}: Initialization Failed ... - NullPointerException[null]

2014-07-17 Thread joergpra...@gmail.com
Do you try to execute Elasticsearch on a non-executable file system?

Jörg


On Thu, Jul 17, 2014 at 9:45 AM, vjbangis  wrote:

> Does anyone experience below
>
> elasticsearch]$ tail -f /var/log/elasticsearch/lastikman.log
> [2014-07-17 05:37:01,470][INFO ][node ] [Karma]
> version[1.2.2], pid[12325], build[9902f08/2014-07-09T12:02:32Z]
> [2014-07-17 05:37:01,471][INFO ][node ] [Karma]
> initializing ...
> [2014-07-17 05:37:01,479][ERROR][bootstrap] {1.2.2}:
> Initialization Failed ...
> - NullPointerException[null]
> [2014-07-17 07:42:08,916][INFO ][node ] [Valerie
> Cooper] version[1.2.2], pid[19368], build[9902f08/2014-07-09T12:02:32Z]
> [2014-07-17 07:42:08,916][INFO ][node ] [Valerie
> Cooper] initializing ...
> [2014-07-17 07:42:08,926][ERROR][bootstrap] {1.2.2}:
> Initialization Failed ...
> - NullPointerException[null]
>
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/c6c968b4-aeb4-41d3-96d5-608ea4b22444%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHr%2BFmneowduEco%3DYqQRNZYuAN0S9N535fM4WwD_Y8OTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [Hadoop] : Parsing error in MR integration

2014-07-17 Thread Costin Leau

Hi,

Looked again at your code sample and your configuration is incorrect. For some reason you are using 
FileInput/OuputFormat to set the input and output; since you are using
es-hadoop you need to specify only the input and not the output. Moreover in your case, you are not using the input so 
potentially you can remove that as well.


Did you set the es.resource for es-hadoop? I don't see that set anywhere though, since no exceptions was raised you 
probably configured it somewhere.


I've tried replicating the problem but I can't - the writable are properly converted into JSON. Can you please enable 
logging [1] and report back? Additionally make sure
you are using the latest build since the error message is different and should give you more information (what field is 
being extracted and from where)...


Cheers,

[1] 
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/logging.html

On 7/15/14 7:19 PM, Aurélien V wrote:

Hi Costin,

Thanks for support. Well, I'm still experiencing this issue, and for now I see 
no obvious reasons for it. My only guess
is about environment stuff and I'm trying to clean maven dependencies, 
environment variables, test version
compatibility. For the moment nothing had worked.

About the constant, it was to test to ensure my data wasn't corrupted in some 
way. So I'm pretty sure the exception
gives no clue about the real issue.


I keep you in touch in case of I discover the reason, may interest somone after 
all.
Aurelien


2014-07-14 18:48 GMT+03:00 Costin Leau mailto:costin.l...@gmail.com>>:

Hi,

Nothing jumps out from your configuration. The error indicates that the 
values passed to es-hadoop cannot be
processed for some reason. Which is more surpsing considering your Mapper 
writes some constants to the output.
I've pushed some improvements to the 2.x branch which explain better 
conditions in which the error appears - you can
either build the jar yourself [1] and test it out or wait for the nightly 
build to publish the artifact [2].

Cheers,

[1] https://github.com/__elasticsearch/elasticsearch-__hadoop/tree/2.x

[2] http://build.elasticsearch.__com/view/Hadoop/job/es-hadoop-__nightly-2x/



On 7/14/14 1:19 PM, Aurélien wrote:

Hi,

I can't sort that ! I'm using hadoop CDH3u6, and trying to get ES index 
my data. I tried with raw json and
MapWritable,
I always get the same kind of errors :

|


java.lang.Exception:org.__elasticsearch.hadoop.__EsHadoopIllegalArgumentExcepti__on:[org.elasticsearch.hadoop.__serialization.field.__MapWritableFieldExtractor@__35b5f7bd]cannot
extract value fromobject[org.apache.hadoop.__io.MapWritable@11c757a1]
  at 
org.apache.hadoop.mapred.__LocalJobRunner$Job.run(__LocalJobRunner.java:349)

Causedby:org.elasticsearch.__hadoop.__EsHadoopIllegalArgumentExcepti__on:[org.elasticsearch.hadoop.__serialization.field.__MapWritableFieldExtractor@__35b5f7bd]cannot
extract value fromobject[org.apache.hadoop.__io.MapWritable@11c757a1]

  at 
org.elasticsearch.hadoop.__serialization.bulk.__TemplatedBulk$FieldWriter.__write(TemplatedBulk.java:49)
  at 
org.elasticsearch.hadoop.__serialization.bulk.__TemplatedBulk.writeTemplate(__TemplatedBulk.java:101)
  at 
org.elasticsearch.hadoop.__serialization.bulk.__TemplatedBulk.write(__TemplatedBulk.java:77)
  at 
org.elasticsearch.hadoop.rest.__RestRepository.writeToIndex(__RestRepository.java:130)
  at org.elasticsearch.hadoop.mr

.__EsOutputFormat$EsRecordWriter.__write(EsOutputFormat.java:161)
  at 
org.apache.hadoop.mapred.__MapTask$__NewDirectOutputCollector.__write(MapTask.java:531)
  at 
org.apache.hadoop.mapreduce.__TaskInputOutputContext.write(__TaskInputOutputContext.java:__80)
  at my.jobs.index.IndexMapper.map(__IndexMapper.java:27)
  at my.jobs.index.IndexMapper.map(__IndexMapper.java:19)
  at org.apache.hadoop.mapreduce.__Mapper.run(Mapper.java:144)
  at 
org.apache.hadoop.mapred.__MapTask.runNewMapper(MapTask.__java:648)
  at org.apache.hadoop.mapred.__MapTask.run(MapTask.java:322)
  at 
org.apache.hadoop.mapred.__LocalJobRunner$Job$__MapTaskRunnable.run(__LocalJobRunner.java:218)
  at 
java.util.concurrent.__Executors$RunnableAdapter.__call(Executors.java:471)
  at 
java.util.concurrent.__FutureTask$Sync.innerRun(__FutureTask.java:334)
  at java.util.concurrent.__FutureTask.run(FutureTask.__java:166)
  at 
java.util.concurrent.__ThreadPoolExecutor.runWorker(__ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.__ThreadPoolExecutor$Worker.run(__ThreadPoolEx

Elasticsearch query SQL Server LAG function analog

2014-07-17 Thread Dmitriy Shilonosov
Hi,

I am looking for a SQL Server LAG 
/LEAD 
functions 
analog in Elasticsearch.

Assume I have a list of documents in result set found by particular 
criteria. The result set is also ordered in some order.

I know the id of one of the documents in that result set and I need to find 
next and/or previous document in the same result set.

SQL Server 2012 and above has LAG/LEAD functions to get next/previous row 
in the recordset. So I wondering if there is such functionality in the 
elasticsearch.

Could you please point me on the corresponding documentation/examples 
please?

Here is my question on 
stackoverflow: 
http://stackoverflow.com/questions/24779002/elasticsearch-query-sql-server-lag-function-analog
 
in case if you'd like to share knowledge there.

Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9321fae0-7b55-4c85-bae5-49875859974c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [ERROR][bootstrap] {1.2.2}: Initialization Failed ... - NullPointerException[null]

2014-07-17 Thread vineeth mohan
Hi,

Can you enabled the debug mode in the lo config and paste the debug log
here.

Thanks
 Vineeth


On Thu, Jul 17, 2014 at 1:15 PM, vjbangis  wrote:

> Does anyone experience below
>
> elasticsearch]$ tail -f /var/log/elasticsearch/lastikman.log
> [2014-07-17 05:37:01,470][INFO ][node ] [Karma]
> version[1.2.2], pid[12325], build[9902f08/2014-07-09T12:02:32Z]
> [2014-07-17 05:37:01,471][INFO ][node ] [Karma]
> initializing ...
> [2014-07-17 05:37:01,479][ERROR][bootstrap] {1.2.2}:
> Initialization Failed ...
> - NullPointerException[null]
> [2014-07-17 07:42:08,916][INFO ][node ] [Valerie
> Cooper] version[1.2.2], pid[19368], build[9902f08/2014-07-09T12:02:32Z]
> [2014-07-17 07:42:08,916][INFO ][node ] [Valerie
> Cooper] initializing ...
> [2014-07-17 07:42:08,926][ERROR][bootstrap] {1.2.2}:
> Initialization Failed ...
> - NullPointerException[null]
>
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/c6c968b4-aeb4-41d3-96d5-608ea4b22444%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5%3DfmdFdzf-YCWh1G0ffD9bzyM26zSrUpyZHWOSZdihcUg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: No efect refresh_interval

2014-07-17 Thread Marek Dabrowski
Hello Mike

My ES version is 1.2.1
I checked utilization nodes my cluster. Common valus ofr all nodes are:
java proces cpu utilization: < 6%
os load: < 1
io stat: < 15kB/s write

I checked indexing process 2 methods:
a) indexing by native json data (13GB splited to 100MB chunks)
time for i in /tmp/SMT* ; do echo $i; curl -s -XPOST 
h3:9200/smt_20140501_bulk_json_refresh_600/num/_bulk --data-binary @$i ; rm 
-f $i; done

b) indexing csv data by use perl script

my $e = Search::Elasticsearch->new(
   nodes => [
   'h3:9200',
   ]   
   );  


my $bulk = $e->bulk_helper(
index => $idx_name,
type  => $idx_type,
max_count => 1
);

open(my $DATA, '<', $data_file) or die $!; 
while(<$DATA>) {
chomp;

my @data = split(',', $_);
$bulk->index({ source => {  
p0  => $data[0], 
p1  => $data[1],
p2  => $data[2],
p3  => $data[3],
p4  => $data[4],
p5  => $data[5],
p6  => $data[6],
p7  => $data[7],
p8  => $data[8],
p9  => $data[9],
p10 => $data[10],
p11 => $data[11]
}});

}
close($DATA);
$bulk->flush;

Setting refresh_interval to 600s in both cases has no effect. Data are 
available immediately. I expect (equal to ES documentation) that new data 
will be available after 10 minutes and in consequently indexing process 
will be quicker but it doesn’t.

Regards

W dniu środa, 16 lipca 2014 16:52:31 UTC+2 użytkownik Michael McCandless 
napisał:
>
> Which ES version are you using?  You should use the latest (soon to be 
> 1.3): there have been a number of bulk-indexing improvements recently.
>
> Are you using the bulk API with multiple/async client threads?  Are you 
> saturating either CPU or IO in your cluster (so that the test is really a 
> full cluster capacity test)?
>
> Also, the relationship between refresh_interval and indexing performance 
> is tricky: it turns out, -1 is often a poor choice, because it means your 
> bulk indexing threads are sometimes tied up flushing segments when with 
> refreshing enabled, it's a separate thread that does that.  So a refresh of 
> 5s is maybe a good choice.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Jul 16, 2014 at 6:51 AM, Marek Dabrowski  > wrote:
>
>> Hello
>>
>> My configuration is:
>> 6 nodes Elasticsearch cluster
>> OS: Centos 6.5
>> JVM: 1.7.0_25
>>
>> Cluster is working fine. I can indexing data, query, etc. Now I'm doing 
>> test on package about ~50mln doc (~13GB). I would like take better 
>> performance during indexing data. To take this target I has been changed 
>> parameter refresh_interval. I did test for 1s, -1 and 600s. Time for 
>> indexing data is that same. I checked configuration (_settings) for index 
>> and value for refresh_interval is ok (has proper value), eg:
>>
>> {
>>   "smt_20140501_10_20g_norefresh" : {
>> "settings" : {
>>   "index" : {
>> "uuid" : "q3imiZGQTDasQUuMWS8oiw",
>> "number_of_replicas" : "1",
>> "number_of_shards" : "6",
>> "refresh_interval" : "600s",
>> "version" : {
>>   "created" : "1020199"
>> }
>>   }
>> }
>>   }
>> }
>>
>>
>>
>> Create index, setting refresh_interval and load is done on that same 
>> cluster node. Before test index is deleted and created again before start 
>> new test with new value of refresh_interval. All cluster nodes logs 
>> information that parameter has been changed, eg:
>> [2014-07-16 11:24:09,813][INFO ][index.shard.service  ] [h6] 
>> [smt_20140501_10_20g_norefresh][1] updating refresh_interval from [1s] 
>> to [-1]
>> or
>> [2014-07-16 11:32:32,928][INFO ][index.shard.service  ] [h6] 
>> [smt_20140501_10_20g_norefresh][1] updating refresh_interval from [1s] 
>> to [10m]
>>
>> After start test new data are available immediately and indexing time 
>> that same in 3 cases. I don't know where is failure. Somebody know what is 
>> going on?
>>
>> Regards
>> Marek
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed

Re: How to Upgrade from 1.1.1 to 1.2.2 in a windows enviroment (as windows service)

2014-07-17 Thread Costin Leau

Hi,

Remove the old service (service remove) then install it again using the new 
path.
Going forward you might want to look into using file-system links (which Windows Vista + supports) so that you can make 
an alias

to the folder, install the service for it and reuse that across installs.
That is, install under c:\elasticsearch\current (which can point to 1.1.0, then 1.1.1, 1.2.2, etc...) while your service 
points to

\current.
Of course, you need to check whether each version introduces some changes into the init/stop script (does happen though 
rarely)

and use that.

Cheers,

On 7/17/14 10:02 AM, Wesley Creteur wrote:

Hi,

I'm a recent user of elasticsearch and was wondering what steps i should take 
on upgrade to a newer version of
elasticsearch on my windows server 2012?

I've installed elasticsearch as running service on windows with the following 
commands: (3 nodes)
|
C:\Elasticsearch\elasticsearch-1.1.0\bin\service install node-01
C:\Elasticsearch\elasticsearch-1.1.0\bin\service install node-02
C:\Elasticsearch\elasticsearch-1.1.0\bin\service install node-03
|

Best regards

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to
elasticsearch+unsubscr...@googlegroups.com 
.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6a4c4726-abb6-42ad-af56-9f9f889d7284%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.


--
Costin

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/53C78069.3070902%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


[ERROR][bootstrap] {1.2.2}: Initialization Failed ... - NullPointerException[null]

2014-07-17 Thread vjbangis
Does anyone experience below

elasticsearch]$ tail -f /var/log/elasticsearch/lastikman.log
[2014-07-17 05:37:01,470][INFO ][node ] [Karma] version[
1.2.2], pid[12325], build[9902f08/2014-07-09T12:02:32Z]
[2014-07-17 05:37:01,471][INFO ][node ] [Karma] 
initializing ...
[2014-07-17 05:37:01,479][ERROR][bootstrap] {1.2.2}: 
Initialization Failed ...
- NullPointerException[null]
[2014-07-17 07:42:08,916][INFO ][node ] [Valerie Cooper] 
version[1.2.2], pid[19368], build[9902f08/2014-07-09T12:02:32Z]
[2014-07-17 07:42:08,916][INFO ][node ] [Valerie Cooper] 
initializing ...
[2014-07-17 07:42:08,926][ERROR][bootstrap] {1.2.2}: 
Initialization Failed ...
- NullPointerException[null]



-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c6c968b4-aeb4-41d3-96d5-608ea4b22444%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


How to Upgrade from 1.1.1 to 1.2.2 in a windows enviroment (as windows service)

2014-07-17 Thread Wesley Creteur
Hi,

I'm a recent user of elasticsearch and was wondering what steps i should 
take on upgrade to a newer version of elasticsearch on my windows server 
2012? 

I've installed elasticsearch as running service on windows with the 
following commands: (3 nodes)
  
C:\Elasticsearch\elasticsearch-1.1.0\bin\service install node-01
C:\Elasticsearch\elasticsearch-1.1.0\bin\service install node-02
C:\Elasticsearch\elasticsearch-1.1.0\bin\service install node-03

Best regards

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6a4c4726-abb6-42ad-af56-9f9f889d7284%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.