Re: Doc Values vs Field Data Questions

2015-05-22 Thread Robert Muir
On Fri, May 22, 2015 at 10:02 AM, Matt Traynham wrote: > Thanks for the clarification Adrien. If that's the case, is there such a > flag that can enable them by default for all fields (excluding non-analyzed > strings; using ~1.4.3 here)? > > Also, do you guys have more performance metrics on us

Re: Elasticsearch is not able to search for Nonnglish text present in PDF type of attachment

2015-05-12 Thread Robert Muir
Its your PDF (and the font being used plays a role in this case). PDFs encode glyphs (display order), not characters (logical order). Usually the distinction is not important, but for complex writing systems it matters. Open your PDF in acrobat and highlight the word in question, and do "copy/pas

Re: ES/Lucene eating up entire memory!

2015-03-29 Thread Robert Muir
Do you know what virtual memory is? You have terabytes of it. On Sun, Mar 29, 2015 at 4:22 PM, Yogesh wrote: > Hi, > > I have a single node ES setup (50GB memory, 500GB disk, 4 cores) and I run > the Twitter river on it. I've set the ES_HEAP_SIZE to 5g. However, when I > do "top", the ES process

Re: [MergeException[java.lang.NullPointerException] and All shards failed for phase: [query_fetch]

2015-02-05 Thread Robert Muir
e of > the above exceptions? > > Thank you, > > Cindy > > On Thursday, 5 February 2015 19:25:49 UTC-5, Robert Muir wrote: >> >> Are you using an IBM JDK? Don't do that :) >> >> On Thu, Feb 5, 2015 at 7:01 PM, 'Cindy' via elasticsearch >>

Re: [MergeException[java.lang.NullPointerException] and All shards failed for phase: [query_fetch]

2015-02-05 Thread Robert Muir
Are you using an IBM JDK? Don't do that :) On Thu, Feb 5, 2015 at 7:01 PM, 'Cindy' via elasticsearch wrote: > Hello, > > My environment has 1 linux server installed elasticsearch 1.4.2 with default > settings from rpm package. I use TransportClient to send requests to > elasticsearch. I recently

Re: Elasticsearch continuously crashes

2015-01-11 Thread Robert Muir
Hi, this looks like a compiler bug. if you can actually reproduce it, you can report it to http://bugs.java.com/ Otherwise, as far as working around it, you can try an EA release of a newer jvm (https://jdk8.java.net/download.html), or exclude the method in question from compilation with an -XX:Co

Re: Index corruption when upload large number of documents (4billion+)

2015-01-09 Thread Robert Muir
Why did you snip the stack trace? can you provide all the information? On Thu, Jan 8, 2015 at 10:37 PM, Darshat wrote: > Hi, > We have a 98 node cluster of ES with each node 32GB RAM. 16GB is reserved > for ES via config file. The index has 98 shards with 2 replicas. > > On this cluster we are lo

Re: ES upgrade 0.20.6 to 1.3.4 -> CorruptIndexException

2014-12-30 Thread Robert Muir
Yes. again, use the latest version (1.4.x). its very simple. On Tue, Dec 30, 2014 at 8:53 AM, Georgeta Boanea wrote: > The Lucene bug is referring to 3.0-3.3 versions, Elasticsearch 0.20.6 is > using Lucene 3.6, is it the same bug? > > > On Tuesday, December 30, 2014 2:08:48 P

Re: ES upgrade 0.20.6 to 1.3.4 -> CorruptIndexException

2014-12-30 Thread Robert Muir
This bug occurs because you are upgrading to an old version of elasticsearch (1.3.4). Try the latest version where the bug is fixed: https://issues.apache.org/jira/browse/LUCENE-5975 On Fri, Dec 19, 2014 at 5:40 AM, Georgeta Boanea wrote: > Hi All, > > After upgrading from ES 0.20.6 to 1.3.4 the

Re: Custom _source compression / compaction to reduce disk usage

2014-12-15 Thread Robert Muir
If you want to do such experiments, it will be hard to do with ES, since you would have to plumb a ton of code to even get the results. Instead I write lucene code to test these things out. This also makes the benchmark fast since i dont "index" anything so there is no real flushing or merging goi

Re: Custom _source compression / compaction to reduce disk usage

2014-12-15 Thread Robert Muir
On Mon, Dec 15, 2014 at 12:53 PM, Eran Duchan wrote: > Seems to me that an uncompressed binary representation ala protobuf would be > smaller than compressed JSON given our schema. > > If that were to prove correct, is it possibe to do this? I dont expect > ElasticSearch to do anything except al

Re: Custom _source compression / compaction to reduce disk usage

2014-12-15 Thread Robert Muir
I dont understand what you hope to benefit from it. Using a binary encoding isn't going to improve the compression here really... I have done tests with it. On Mon, Dec 15, 2014 at 12:12 PM, Eran Duchan wrote: > Got it, thanks. > Any insight on a custom _source? Is this doable? > > -- > You rece

Re: Custom _source compression / compaction to reduce disk usage

2014-12-15 Thread Robert Muir
On Mon, Dec 15, 2014 at 11:49 AM, Eran Duchan wrote: > On Monday, December 15, 2014 5:44:38 PM UTC+2, Robert Muir wrote: >> >> That is not the case, blocks of documents are compressed together: > > > Thanks, Robert. > > I unscientifically swam around the code piv

Re: Custom _source compression / compaction to reduce disk usage

2014-12-15 Thread Robert Muir
On Mon, Dec 15, 2014 at 9:20 AM, Eran Duchan wrote: > I understand that _source is compressed, but I assume every document is > compressed separately (our small documents don't benefit from that). That is not the case, blocks of documents are compressed together: https://lucene.apache.org/core/4

Re: Corrupted Shard on Recovery

2014-11-11 Thread Robert Muir
ecommended way of upgrading? One minor version at a time or can > I do a rolling upgrade to 1.3.5? > > Thanks! > Christoph > > Am Dienstag, 11. November 2014 19:38:55 UTC+1 schrieb Robert Muir: >> >> The error says "local checksum OK"... what version of elasticsea

Re: Corrupted Shard on Recovery

2014-11-11 Thread Robert Muir
The error says "local checksum OK"... what version of elasticsearch are you running? If its before 1.3.2, please read this: http://www.elasticsearch.org/blog/elasticsearch-1-3-2-released/ On Wed, Sep 3, 2014 at 12:58 AM, David Kleiner wrote: > Greetings, > > I tried to overcome slowly recoverin

Re: CorruptIndexException when trying to replicate one shard of a new index

2014-10-22 Thread Robert Muir
ednesday, October 22, 2014 5:26:42 PM UTC-4, Robert Muir wrote: >> >> Can you try the workaround mentioned here: >> http://www.elasticsearch.org/blog/elasticsearch-1-3-2-released/ >> >> and see if it works? If the compression issue is the problem, you can >> re-enab

Re: CorruptIndexException when trying to replicate one shard of a new index

2014-10-22 Thread Robert Muir
Can you try the workaround mentioned here: http://www.elasticsearch.org/blog/elasticsearch-1-3-2-released/ and see if it works? If the compression issue is the problem, you can re-enable compression, just upgrade to at least 1.3.2 which has the fix. On Wed, Oct 22, 2014 at 4:57 PM, Nate Folkert

Re: Elasticsearch version upgrade issue -- CorruptIndexException

2014-09-23 Thread Robert Muir
This is a bug in lucene: https://issues.apache.org/jira/browse/LUCENE-5975 Sorry it took a while, thanks for reporting this! On Tue, Sep 9, 2014 at 7:18 PM, Wei wrote: > Hi All, > > I'm working on an ES upgrade from v0.20.5 to v1.2.1 > I tested in a 2 node cluster, 3 indices, ~4 million docs, 18

Re: Faster sloppy phrase queries

2014-09-08 Thread Robert Muir
On Mon, Sep 8, 2014 at 4:24 PM, Nikolas Everett wrote: > > One thing on my side is that I don't really _need_ phrase queries. I can > play around with the specification a bit so long as I stay sane. I just > need to make documents that contain the terms near each other float to the > top. It'd

Re: JVM crash on 64 bit SPARC with Elasticsearch 1.2.2 due to unaligned memory access

2014-08-22 Thread Robert Muir
How big is it? Maybe i can have it anyway? I pulled two ancient ultrasparcs out of my closet to try to debug your issue, but unfortunately they are a pita to work with (dead nvram battery on both, zeroed mac address, etc.) Id still love to get to the bottom of this. On Aug 22, 2014 3:59 PM, wrote:

Re: What is bad of using pulsing postings format?

2014-07-13 Thread Robert Muir
There is not really an advantage to it. The optimization has been incorporated into the default index format of lucene: https://issues.apache.org/jira/browse/LUCENE-4498 On Sun, Jul 13, 2014 at 10:20 PM, 陳智清 wrote: > From this article > (http://blog.mikemccandless.com/2010/06/lucenes-pulsingcode

Re: Query string query mini-language vs. grammar implementation?

2014-07-10 Thread Robert Muir
On Tue, Jul 8, 2014 at 8:35 PM, x0ne wrote: > Ever since I discovered the mini-language provided through the query string > query > (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html), > I have had a hard time going back to the difficult proces

Re: Trigram-accelerated regex searches

2014-05-22 Thread Robert Muir
On Wed, May 21, 2014 at 6:01 PM, Erik Rose wrote: > I'm trying to move Mozilla's source code search engine (dxr.mozilla.org) > from a custom-written SQLite trigram index to ES. In the current production > incarnation, we support fast regex (and, by extension, wildcard) searches by > extracting tri

Re: Stemmer token filter result is different that it should be

2014-05-20 Thread Robert Muir
When you use the "french" analyzer it uses the Lucene FrenchAnalyzer behind the scenes, which does not use the snowball algorithm. It uses the Savoy stemmer, the same as specifying "light_french" stemmer: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-stemmer-token

Re: Removing unused fields (more Lucene than ES but..)

2014-04-03 Thread Robert Muir
ect of what happened to us for background. > > > On 1 April 2014 18:29, Robert Muir wrote: >> >> On Tue, Apr 1, 2014 at 2:41 AM, Paul Smith wrote: >> > >> > Thanks Robert for the reply, all of that sounds fairly hairy. I did try >> > a >> > ful

Re: ElasticSearch standard Analyzer - problematic case

2014-04-03 Thread Robert Muir
The standard analyzer doesn't really know anything about emails/URLs, its just implementing the Unicode tokenization rules. There is an extension of it that does know about these things (and tries to keep them as one token)... http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/

Re: "Locale" parameter in query_string query

2014-04-01 Thread Robert Muir
This controls the behavior of the string conversions triggered by lowercase_expanded_terms. For example Turkish/Azeri have different casing characteristics: http://en.wikipedia.org/wiki/Dotted_and_dotless_I On Tue, Apr 1, 2014 at 2:43 AM, Prashant Agrawal wrote: > Any updates on the above query?

Re: Removing unused fields (more Lucene than ES but..)

2014-04-01 Thread Robert Muir
On Tue, Apr 1, 2014 at 2:41 AM, Paul Smith wrote: > > Thanks Robert for the reply, all of that sounds fairly hairy. I did try a > full optimize of the shard index using Luke, but the residual über-segment > still has the filed definitions in it. Are saying in (1) that the creating > of a new Sh

Re: Removing unused fields (more Lucene than ES but..)

2014-03-31 Thread Robert Muir
It is actually possible in lucene 4, but there is nothing really convenient setup to do this. You have two choices there: 1. trigger a massive merge (essentially an optimize), by wrapping all readers and calling IndexWriter.addIndexes(Reader...). 2. wrap readers in a custom merge policy and do it

Re: Indexing performance with doc values (particularly with larger number of fields)

2014-03-23 Thread Robert Muir
Would be a nice benchmark to run (and if you find hotspots/slow things to go improve in lucene...)! The data structures for docvalues are less complex than the data structures for the inverted index. I've enabled docvalues for many fields as you suggest in the past, and in my tests the time for e

Re: Fast vector highlighter does not work with explicit span_near queries

2014-03-22 Thread Robert Muir
On Sat, Mar 22, 2014 at 10:26 AM, Harry Waye wrote: > Thanks Robert, a useful caveat to add to the highlighter docs? Yes, I think so. > > The reason I looked in to using the fvh was due to the plain highlighter not > using the correct analyzer if the index analyzer was specified specified via >

Re: Fast vector highlighter does not work with explicit span_near queries

2014-03-22 Thread Robert Muir
FVH definitely doesn't recognize span-near queries. In general, when it comes to the spanquery family the plain highlighter will work better because it has explicit support for those queries. Maybe you want to open an issue? Its not obvious how to fix though, because of how span-near queries can a

Re: Stop filter problem: enablePositionIncrements=false is not supported anymore as of Lucene 4.4 as it can create broken token streams

2013-12-18 Thread Robert Muir
the old disabling of position increments was bogus. for example a stop filter could remove a token and "move" a synonym from one word to another. so this option conflated two unrelated things: whether or not a "gap" should be introduced when a word is removed, and whether any existing positions (e