Re: Doc Values vs Field Data Questions

2015-05-22 Thread Robert Muir
On Fri, May 22, 2015 at 10:02 AM, Matt Traynham  wrote:

> Thanks for the clarification Adrien.  If that's the case, is there such a
> flag that can enable them by default for all fields (excluding non-analyzed
> strings; using ~1.4.3 here)?
>
> Also, do you guys have more performance metrics on using Doc Values vs
> FDC?  I've seen the "10-25%" slower value thrown around, but I wanted to
> know what that was tested with (CPU, mem, spinning vs. SSD, etc...) and
> where gains may be had.
>
> in my debugging the current differences are usually the cost of a
predictable branch (bounds check), coming from ByteBuffer.get(). For
fielddata it uses simple java arrays, and today the java compiler can do
optimizations to remove the checks more easily in that case.

But IMO benchmarking here is usually not done correctly, it doesn't
consider the impact of having such huge badly-compressed data in heap
memory, e.g. impacts on GC and other problems people have. So i recommend
doing a test with real data and real workloads :)

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD06sYSQ2U2y0D9ZOPMUDwywoXPYdAfq-6tDFsZOVKJ%3D0GatZA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Elasticsearch is not able to search for Nonnglish text present in PDF type of attachment

2015-05-12 Thread Robert Muir
Its your PDF (and the font being used plays a role in this case).

PDFs encode glyphs (display order), not characters (logical order).
Usually the distinction is not important, but for complex writing systems
it matters.

Open your PDF in acrobat and highlight the word in question, and do
"copy/paste" and you will see it pastes the same way.
You can also see this bogus mapping clearly if you extract the font data
with fontforge (attached).


On Tue, May 12, 2015 at 5:17 AM, Prashant Agrawal <
prashant.agra...@paladion.net> wrote:

> Hi Team,
>
> We are facing an issue while searching the Non English text indexed as PDF
> type of document. Below are the complete details.
>
> 1) I am having a pdf document as New_Pdf_issue.pdf which is attached in
> this
> mail.
> 2) Created an indexing request alongwith mapping as well which is attached
> as pdf_index_issue.sh
> 3) Now if you will look onto pdf attachment you will find keywords such as
> "अधिकार", so if i am searching as "अधिकार" I am not able to get any
> matching
> documents for the same.
>
> Note : What we observed is like when we perform search query as
> {
>   "fields": [
> "SessionAtt.content_type",
> "SessionAtt"
>   ],
>   "query": {
> "bool": {
>   "must": [
> {
>   "query_string": {
> "fields": [
>   "Content",
>   "SessionAtt"
> ],
> "query": "*"
>   }
> }
>   ]
> }
>   }
> }
>
> We are observing as "अधिकार" words has been indexed as "अधधकार".
>
> So can anyone let me know what could be the issue for the same.
>
> ~Prashant
>
> pdf_index_issue.sh
> <
> http://elasticsearch-users.115913.n3.nabble.com/file/n4074717/pdf_index_issue.sh
> >
> New_Pdf_issue.pdf
> <
> http://elasticsearch-users.115913.n3.nabble.com/file/n4074717/New_Pdf_issue.pdf
> >
>
>
>
> --
> View this message in context:
> http://elasticsearch-users.115913.n3.nabble.com/Elasticsearch-is-not-able-to-search-for-Nonnglish-text-present-in-PDF-type-of-attachment-tp4074717.html
> Sent from the Elasticsearch Users mailing list archive at Nabble.com.
>
> --
> Please update your bookmarks! We have moved to https://discuss.elastic.co/
> ---
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/1431422241775-4074717.post%40n3.nabble.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD06sYQRvGGvtROhGFKY%3DUkfgdvbM%3DAHiEftk8it4wWpgpK5hg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: ES/Lucene eating up entire memory!

2015-03-29 Thread Robert Muir
Do you know what virtual memory is? You have terabytes of it.

On Sun, Mar 29, 2015 at 4:22 PM, Yogesh  wrote:

> Hi,
>
> I have a single node ES setup (50GB memory, 500GB disk, 4 cores) and I run
> the Twitter river on it. I've set the ES_HEAP_SIZE to 5g. However, when I
> do "top", the ES process shows the VIRT memory to be around 34g. That would
> be I assume the max mapped memory. The %MEM though always hovers around 10%
>
> However, within a few days post-reboot, the memory used keeps going up.
> From 10g to almost 50g (as shown in the third line) because of which my
> other dbs start behaving badly. Below is the snapshot of "top". Despite the
> fact that VIRT and %MEM still hover around the same 34g and 10%
> respectively.
>
> Please help me understand where is my memory going over time! My one guess
> is that Lucene is eating it up. How do I remedy it?
>
> Thanks-in-advance!
>
>
>
> 
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/6eaac271-9b6b-4d6e-84b3-2c1194e0796b%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAOdYfZWCV0c-bMsg1_x%3DtYdB9tYjXHwY%2ByuknxuWJGYP99_uTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [MergeException[java.lang.NullPointerException] and All shards failed for phase: [query_fetch]

2015-02-05 Thread Robert Muir
I know this because I have seen this same IBM JDK bug in our tests many times.

I also wrote the code in question. NPE is not possible.
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_2/lucene/core/src/java/org/apache/lucene/codecs/lucene49/Lucene49NormsConsumer.java#L213

This is why we say, don't use IBM JDK with lucene.

On Thu, Feb 5, 2015 at 7:39 PM, 'Cindy' via elasticsearch
 wrote:
> Yes, I am using IBM JDK. How do you detect it that? Does it show from the
> log?
>
> I know there is a Lucene IBM JVM bug. Is this related? What are the cause of
> the above exceptions?
>
> Thank you,
>
> Cindy
>
> On Thursday, 5 February 2015 19:25:49 UTC-5, Robert Muir wrote:
>>
>> Are you using an IBM JDK? Don't do that :)
>>
>> On Thu, Feb 5, 2015 at 7:01 PM, 'Cindy' via elasticsearch
>>  wrote:
>> > Hello,
>> >
>> > My environment has 1 linux server installed elasticsearch 1.4.2 with
>> > default
>> > settings from rpm package. I use TransportClient to send requests to
>> > elasticsearch. I recently changed to use rpm package. It has been
>> > working
>> > fine for a couple of day. But today I saw the following errors when I
>> > indexed a data set I had successfully indexed before and tried to query
>> > a
>> > simple word. In the meantime, the server takes much longer time to index
>> > and
>> > delete indices.
>> >
>> >
>> > [2015-02-05 18:01:01,105][WARN ][index.merge.scheduler] [Kkallakku]
>> > [wa_value_index_v1][0] failed to merge
>> > java.lang.NullPointerException
>> > at
>> >
>> > org.apache.lucene.codecs.lucene49.Lucene49NormsConsumer$NormMap.getOrd(Lucene49NormsConsumer.java:249)
>> > at
>> >
>> > org.apache.lucene.codecs.lucene49.Lucene49NormsConsumer.addNumericField(Lucene49NormsConsumer.java:150)
>> > at
>> >
>> > org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:129)
>> > at
>> > org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:255)
>> > at
>> > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:133)
>> > at
>> > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4173)
>> > at
>> > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3768)
>> > at
>> >
>> > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>> > at
>> >
>> > org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
>> > at
>> >
>> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>> > [2015-02-05 18:01:01,106][WARN ][index.engine.internal] [Kkallakku]
>> > [wa_value_index_v1][0] failed engine [merge exception]
>> > org.apache.lucene.index.MergePolicy$MergeException:
>> > java.lang.NullPointerException
>> > at
>> >
>> > org.elasticsearch.index.merge.scheduler.ConcurrentMergeSchedulerProvider$CustomConcurrentMergeScheduler.handleMergeException(ConcurrentMergeSchedulerProvider.java:133)
>> > at
>> >
>> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
>> > Caused by: java.lang.NullPointerException
>> > at
>> >
>> > org.apache.lucene.codecs.lucene49.Lucene49NormsConsumer$NormMap.getOrd(Lucene49NormsConsumer.java:249)
>> > at
>> >
>> > org.apache.lucene.codecs.lucene49.Lucene49NormsConsumer.addNumericField(Lucene49NormsConsumer.java:150)
>> > at
>> >
>> > org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:129)
>> > at
>> > org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:255)
>> > at
>> > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:133)
>> > at
>> > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4173)
>> > at
>> > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3768)
>> > at
>> >
>> > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>> > at
>> >
>> > org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107

Re: [MergeException[java.lang.NullPointerException] and All shards failed for phase: [query_fetch]

2015-02-05 Thread Robert Muir
Are you using an IBM JDK? Don't do that :)

On Thu, Feb 5, 2015 at 7:01 PM, 'Cindy' via elasticsearch
 wrote:
> Hello,
>
> My environment has 1 linux server installed elasticsearch 1.4.2 with default
> settings from rpm package. I use TransportClient to send requests to
> elasticsearch. I recently changed to use rpm package. It has been working
> fine for a couple of day. But today I saw the following errors when I
> indexed a data set I had successfully indexed before and tried to query a
> simple word. In the meantime, the server takes much longer time to index and
> delete indices.
>
>
> [2015-02-05 18:01:01,105][WARN ][index.merge.scheduler] [Kkallakku]
> [wa_value_index_v1][0] failed to merge
> java.lang.NullPointerException
> at
> org.apache.lucene.codecs.lucene49.Lucene49NormsConsumer$NormMap.getOrd(Lucene49NormsConsumer.java:249)
> at
> org.apache.lucene.codecs.lucene49.Lucene49NormsConsumer.addNumericField(Lucene49NormsConsumer.java:150)
> at
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:129)
> at
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:255)
> at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:133)
> at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4173)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3768)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> at
> org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> [2015-02-05 18:01:01,106][WARN ][index.engine.internal] [Kkallakku]
> [wa_value_index_v1][0] failed engine [merge exception]
> org.apache.lucene.index.MergePolicy$MergeException:
> java.lang.NullPointerException
> at
> org.elasticsearch.index.merge.scheduler.ConcurrentMergeSchedulerProvider$CustomConcurrentMergeScheduler.handleMergeException(ConcurrentMergeSchedulerProvider.java:133)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
> Caused by: java.lang.NullPointerException
> at
> org.apache.lucene.codecs.lucene49.Lucene49NormsConsumer$NormMap.getOrd(Lucene49NormsConsumer.java:249)
> at
> org.apache.lucene.codecs.lucene49.Lucene49NormsConsumer.addNumericField(Lucene49NormsConsumer.java:150)
> at
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:129)
> at
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:255)
> at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:133)
> at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4173)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3768)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> at
> org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> [2015-02-05 18:01:01,369][WARN ][cluster.action.shard ] [Kkallakku]
> [wa_value_index_v1][0] sending failed shard for [wa_value_index_v1][0],
> node[KpdH3su1QSyTx8lekwshAA], [P], s[STARTED], indexUUID
> [LgwNpUo7RRqjnIaVcA2rLA], reason [engine failure, message [merge
> exception][MergeException[java.lang.NullPointerException]; nested:
> NullPointerException; ]]
> [2015-02-05 18:01:01,369][WARN ][cluster.action.shard ] [Kkallakku]
> [wa_value_index_v1][0] received shard failed for [wa_value_index_v1][0],
> node[KpdH3su1QSyTx8lekwshAA], [P], s[STARTED], indexUUID
> [LgwNpUo7RRqjnIaVcA2rLA], reason [engine failure, message [merge
> exception][MergeException[java.lang.NullPointerException]; nested:
> NullPointerException; ]]
>
>
> org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to
> execute phase [query_fetch], all shards failed
> at
> org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:233)
> at
> org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:156)
> at
> org.elasticsearch.action.search.type.TransportSearchQueryAndFetchAction.doExecute(TransportSearchQueryAndFetchAction.java:55)
> at
> org.elasticsearch.action.search.type.TransportSearchQueryAndFetchAction.doExecute(TransportSearchQueryAndFetchAction.java:45)
> at
> org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
> at
> org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:107)
> at
> org.elasticsearch.action.search.TransportSearchAct

Re: Elasticsearch continuously crashes

2015-01-11 Thread Robert Muir
Hi, this looks like a compiler bug. if you can actually reproduce it,
you can report it to http://bugs.java.com/

Otherwise, as far as working around it, you can try an EA release of a
newer jvm (https://jdk8.java.net/download.html), or exclude the method
in question from compilation with an -XX:CompileCommand. unfortunately
the method in question is the "main indexing loop" of lucene, so there
is likely some performance impact from doing this.

On Sun, Jan 11, 2015 at 6:27 AM, abdo  wrote:
> Hi all
>
> I am using elasticsearch 1.4.1 to index logs. I run it on a 12 GB RAM with 8
> cores. I dedicate 6 GB to the heap. I am using  oracle JVM 1.8.0_25. ES
> starts crashing after it reaches 8 Million records. attaches is the error
> log  hs_err_pid26939.log
> 
>
>
>
> --
> View this message in context: 
> http://elasticsearch-users.115913.n3.nabble.com/Elasticsearch-continuously-crashes-tp4068818.html
> Sent from the ElasticSearch Users mailing list archive at Nabble.com.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/1420975664186-4068818.post%40n3.nabble.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAOdYfZXsJDTrAn%2BxUMzTJ4e0LCHucw5V-3tT8bQJiXNOrnWQLA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Index corruption when upload large number of documents (4billion+)

2015-01-09 Thread Robert Muir
Why did you snip the stack trace? can you provide all the information?

On Thu, Jan 8, 2015 at 10:37 PM, Darshat  wrote:
> Hi,
> We have a 98 node cluster of ES with each node 32GB RAM. 16GB is reserved
> for ES via config file. The index has 98 shards with 2 replicas.
>
> On this cluster we are loading a large number of documents (when done it
> would be about 10 billion). In this use case about 40million documents are
> generated per hour and we are pre-loading several days worth of documents to
> prototype how ES will scale, and its query performance.
>
> Right now we are facing problems getting data loaded. Indexing is turned
> off. We use NEST client, with batch size of 10k. To speed up data load, we
> distribute the hourly data to each of the 98 nodes to insert in parallel.
> This worked ok for a few hours till we got 4.5B documents in the cluster.
>
> After that the cluster state went to red. The outstanding tasks CAT API
> shows errors like below. CPU/Disk/Memory seems to be fine on the nodes.
>
> Why are we getting these errors?. any help greatly appreciated since this
> blocks prototyping ES for our use case.
>
> thanks
> Darshat
>
> Sample errors:
>
> source   : shard-failed ([agora_v1][24],
>node[00ihc1ToRiqMDJ1lou1Sig], [R], s[INITIALIZING]),
>reason [Failed to start shard, message
>[RecoveryFailedException[[agora_v1][24]: Recovery
>failed from [Shingen
> Harada][RDAwqX9yRgud9f7YtZAJPg][CH1
>SCH060051438][inet[/10.46.153.84:9300]] into
> [Elfqueen][
>
> 00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182
>.106:9300]]]; nested:
> RemoteTransportException[[Shingen
>
> Harada][inet[/10.46.153.84:9300]][internal:index/shard/r
>ecovery/start_recovery]]; nested:
>RecoveryEngineException[[agora_v1][24] Phase[1]
>Execution failed]; nested:
>RecoverFilesRecoveryException[[agora_v1][24] Failed
> to
>transfer [0] files with total size of [0b]]; nested:
> NoS
>
> uchFileException[D:\app\ES.ElasticSearch_v010\elasticsea
>
> rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1
>\24\index\segments_6r]; ]]
>
>
> AND
>
> source   : shard-failed ([agora_v1][95],
>node[PUsHFCStRaecPA6MuvJV9g], [P], s[INITIALIZING]),
>reason [Failed to start shard, message
>[IndexShardGatewayRecoveryException[[agora_v1][95]
>failed to fetch index version after copying it over];
>nested: CorruptIndexException[[agora_v1][95]
>Preexisting corrupted index
>[corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by:
>CorruptIndexException[Read past EOF while reading
>segment infos]
>EOFException[read past EOF:
> MMapIndexInput(path="D:\
>
> app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el
>
> asticsearch\nodes\0\indices\agora_v1\95\index\segments_1
>1j")]
>org.apache.lucene.index.CorruptIndexException: Read
>past EOF while reading segment infos
>at
> org.elasticsearch.index.store.Store.readSegmentsI
>nfo(Store.java:127)
>at
> org.elasticsearch.index.store.Store.access$400(St
>ore.java:80)
>at
> org.elasticsearch.index.store.Store$MetadataSnaps
>hot.buildMetadata(Store.java:575)
> ---snip more stack trace-
>
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://elasticsearch-users.115913.n3.nabble.com/Index-corruption-when-upload-large-number-of-documents-4billion-tp4068742.html
> Sent from the ElasticSearch Users mailing list archive at Nabble.com.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/1420774624607-4068742.post%40n3.nabble.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZULi4OYDGH5_4FOtkxBFhHUpnt4GCvAiBNHHWjp3a-ouw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: ES upgrade 0.20.6 to 1.3.4 -> CorruptIndexException

2014-12-30 Thread Robert Muir
Yes. again, use the latest version (1.4.x). its very simple.

On Tue, Dec 30, 2014 at 8:53 AM, Georgeta Boanea  wrote:
> The Lucene bug is referring to 3.0-3.3 versions, Elasticsearch 0.20.6 is
> using Lucene 3.6, is it the same bug?
>
>
> On Tuesday, December 30, 2014 2:08:48 PM UTC+1, Robert Muir wrote:
>>
>> This bug occurs because you are upgrading to an old version of
>> elasticsearch (1.3.4). Try the latest version where the bug is fixed:
>> https://issues.apache.org/jira/browse/LUCENE-5975
>>
>> On Fri, Dec 19, 2014 at 5:40 AM, Georgeta Boanea  wrote:
>> > Hi All,
>> >
>> > After upgrading from ES 0.20.6 to 1.3.4 the following messages occurred:
>> >
>> > [2014-12-19 10:02:06.714 GMT] WARN ||
>> > elasticsearch[es-node-name][generic][T#14]
>> > org.elasticsearch.cluster.action.shard  [es-node-name] [index-name][3]
>> > sending failed shard for [index-name][3], node[qOTLmb3IQC2COXZh1n9O2w],
>> > [P],
>> > s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard,
>> > message
>> > [IndexShardGatewayRecoveryException[[index-name][3] failed to fetch
>> > index
>> > version after copying it over]; nested:
>> > CorruptIndexException[[index-name][3] Corrupted index
>> > [corrupted_Ackui00SSBi8YXACZGNDkg] caused by: CorruptIndexException[did
>> > not
>> > read all bytes from file: read 112 vs size 113 (resource:
>> >
>> > BufferedChecksumIndexInput(NIOFSIndexInput(path="path/3/index/_uzm_2.del")))]];
>> > ]]
>> >
>> > [2014-12-19 10:02:08.390 GMT] WARN ||
>> > elasticsearch[es-node-name][generic][T#20]
>> > org.elasticsearch.indices.cluster
>> > [es-node-name] [index-name][3] failed to start shard
>> > org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
>> > [index-name][3] failed to fetch index version after copying it over
>> > at
>> >
>> > org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
>> > at
>> >
>> > org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
>> > at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> > at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> > at java.lang.Thread.run(Thread.java:745)
>> > Caused by: org.apache.lucene.index.CorruptIndexException:
>> > [index-name][3]
>> > Corrupted index [corrupted_Ackui00SSBi8YXACZGNDkg] caused by:
>> > CorruptIndexException[did not read all bytes from file: read 112 vs size
>> > 113
>> > (resource:
>> >
>> > BufferedChecksumIndexInput(NIOFSIndexInput(path="path/3/index/_uzm_2.del")))]
>> > at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:353)
>> > at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:338)
>> > at
>> >
>> > org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)
>> > ... 4 more
>> >
>> > Shard [3] of the index remains unallocated and the cluster remains in a
>> > RED
>> > state.
>> >
>> > curl -XGET 'http://localhost:48012/_cluster/health?pretty=true'
>> > {
>> >   "cluster_name" : "cluster-name",
>> >   "status" : "red",
>> >   "timed_out" : false,
>> >   "number_of_nodes" : 5,
>> >   "number_of_data_nodes" : 5,
>> >   "active_primary_shards" : 10,
>> >   "active_shards" : 20,
>> >   "relocating_shards" : 0,
>> >   "initializing_shards" : 1,
>> >   "unassigned_shards" : 1
>> > }
>> >
>> > If I do an optimize (curl -XPOST
>> > http://localhost:48012/index-name/_optimize?max_num_segments=1) for the
>> > index before the update, everything is fine. Optimize works just before
>> > the
>> > update, if is done after the update the problem remains the same.
>> >
>> > Any idea why this problem occurs?
>> > Is there another way to avoid this problem? I want to avoid optimize in
>> > case
>> > of large volume of data.
>> >
>> > Thank you,
>> > Georgeta
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
&

Re: ES upgrade 0.20.6 to 1.3.4 -> CorruptIndexException

2014-12-30 Thread Robert Muir
This bug occurs because you are upgrading to an old version of
elasticsearch (1.3.4). Try the latest version where the bug is fixed:
https://issues.apache.org/jira/browse/LUCENE-5975

On Fri, Dec 19, 2014 at 5:40 AM, Georgeta Boanea  wrote:
> Hi All,
>
> After upgrading from ES 0.20.6 to 1.3.4 the following messages occurred:
>
> [2014-12-19 10:02:06.714 GMT] WARN ||
> elasticsearch[es-node-name][generic][T#14]
> org.elasticsearch.cluster.action.shard  [es-node-name] [index-name][3]
> sending failed shard for [index-name][3], node[qOTLmb3IQC2COXZh1n9O2w], [P],
> s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message
> [IndexShardGatewayRecoveryException[[index-name][3] failed to fetch index
> version after copying it over]; nested:
> CorruptIndexException[[index-name][3] Corrupted index
> [corrupted_Ackui00SSBi8YXACZGNDkg] caused by: CorruptIndexException[did not
> read all bytes from file: read 112 vs size 113 (resource:
> BufferedChecksumIndexInput(NIOFSIndexInput(path="path/3/index/_uzm_2.del")))]];
> ]]
>
> [2014-12-19 10:02:08.390 GMT] WARN ||
> elasticsearch[es-node-name][generic][T#20] org.elasticsearch.indices.cluster
> [es-node-name] [index-name][3] failed to start shard
> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
> [index-name][3] failed to fetch index version after copying it over
> at
> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
> at
> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.lucene.index.CorruptIndexException: [index-name][3]
> Corrupted index [corrupted_Ackui00SSBi8YXACZGNDkg] caused by:
> CorruptIndexException[did not read all bytes from file: read 112 vs size 113
> (resource:
> BufferedChecksumIndexInput(NIOFSIndexInput(path="path/3/index/_uzm_2.del")))]
> at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:353)
> at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:338)
> at
> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)
> ... 4 more
>
> Shard [3] of the index remains unallocated and the cluster remains in a RED
> state.
>
> curl -XGET 'http://localhost:48012/_cluster/health?pretty=true'
> {
>   "cluster_name" : "cluster-name",
>   "status" : "red",
>   "timed_out" : false,
>   "number_of_nodes" : 5,
>   "number_of_data_nodes" : 5,
>   "active_primary_shards" : 10,
>   "active_shards" : 20,
>   "relocating_shards" : 0,
>   "initializing_shards" : 1,
>   "unassigned_shards" : 1
> }
>
> If I do an optimize (curl -XPOST
> http://localhost:48012/index-name/_optimize?max_num_segments=1) for the
> index before the update, everything is fine. Optimize works just before the
> update, if is done after the update the problem remains the same.
>
> Any idea why this problem occurs?
> Is there another way to avoid this problem? I want to avoid optimize in case
> of large volume of data.
>
> Thank you,
> Georgeta
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/74d0af86-c661-4e58-ba2c-d38adde1291c%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZW3Kc-8smWQjn1VRrk2yhgdiA33EctWUiXEOkxg46BjiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Custom _source compression / compaction to reduce disk usage

2014-12-15 Thread Robert Muir
If you want to do such experiments, it will be hard to do with ES,
since you would have to plumb a ton of code to even get the results.

Instead I write lucene code to test these things out. This also makes
the benchmark fast since i dont "index" anything so there is no real
flushing or merging going on to make benchmarking more difficult (when
i want to measure performance of things like merge, i force it to
happen at predictable intervals to keep index size comparisons valid).

On Mon, Dec 15, 2014 at 1:11 PM, Eran Duchan  wrote:
> Given those stats I totally agree, but this would have to vary given 
> different schemas... That's why I'd like to at least experiment with it. Is 
> this even possible through the public http interface?
>
> --
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/29797a61-473e-48f0-9d00-acd82cbe26bc%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAOdYfZW0K13wCqUwd3jLF0ifRkQsW_v9cnR5D75zdGTk-%2BWP2w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Custom _source compression / compaction to reduce disk usage

2014-12-15 Thread Robert Muir
On Mon, Dec 15, 2014 at 12:53 PM, Eran Duchan  wrote:
> Seems to me that an uncompressed binary representation ala protobuf would be 
> smaller than compressed JSON given our schema.
>
> If that were to prove correct, is it possibe to do this? I dont expect 
> ElasticSearch to do anything except allowing me to control the contents of 
> _source.
>

The question is how much. for example in my tests it saves 2% with lz4
and 0.5% with deflate (the larger block size used there makes it
super-not-worth-it).

Its not worth the complexity IMO.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUMG4G6jF7Rw%3Dn9gMenQnm8nytnJ84WMW0K3CjTt%2Bc8TA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Custom _source compression / compaction to reduce disk usage

2014-12-15 Thread Robert Muir
I dont understand what you hope to benefit from it.

Using a binary encoding isn't going to improve the compression here
really... I have done tests with it.

On Mon, Dec 15, 2014 at 12:12 PM, Eran Duchan  wrote:
> Got it, thanks.
> Any insight on a custom _source? Is this doable?
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/bcf7dae7-2164-43ae-b2bd-c44ed0d99aaa%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUhtNHqG%2BW9KnKWfRzTEUkwG8JWo%2Bz6-%3DGUDkZ-eSu6-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Custom _source compression / compaction to reduce disk usage

2014-12-15 Thread Robert Muir
On Mon, Dec 15, 2014 at 11:49 AM, Eran Duchan  wrote:
> On Monday, December 15, 2014 5:44:38 PM UTC+2, Robert Muir wrote:
>>
>> That is not the case, blocks of documents are compressed together:
>
>
> Thanks, Robert.
>
> I unscientifically swam around the code pivoting around this and saw that:
>
> This isn't tweakable - I can't choose to compress in larger chunks
> 2.0.0 will have an option to use deflate for better compression
>
> So if I can't tweak _source compression, can I shove a _source of my own as
> posted originally in (1)?

Its not really tweakable at all before lucene 5, thats why we added a
higher compression option. Note this option is not just deflate but
also uses a higher blocksize and other internal parameters.

Using a higher blocksize (64kb) for deflate is really a simple
workaround to get the feature out sooner than later, with the idea
that people that choose BEST_COMPRESSION are willing to sacrifice some
retrieval speed.

Increasing blocksize has a negative cost on retrieval performance and
is not really the best way overall to get better compression when
there is high redundancy across documents. In the future I hope we can
add preset dictionary support for sharing across blocks.

So the current blocksize should really be seen as an internal thing.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZX8-SGzuNvvZ%3D_-Sec_%2Bq3svtLBE-_d3L%2B70TR8Nm_%3Drw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Custom _source compression / compaction to reduce disk usage

2014-12-15 Thread Robert Muir
On Mon, Dec 15, 2014 at 9:20 AM, Eran Duchan  wrote:
> I understand that _source is compressed, but I assume every document is
> compressed separately (our small documents don't benefit from that).

That is not the case, blocks of documents are compressed together:

https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/codecs/lucene41/Lucene41StoredFieldsFormat.html

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZXbii9TQWvfNbEx5QPMjhVfOVxSAS8S43eE-GOyzEfo%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Corrupted Shard on Recovery

2014-11-11 Thread Robert Muir
First, i would try the workaround mentioned in the article: disable
the compression and see if fixes the issue.

On Tue, Nov 11, 2014 at 1:42 PM, Christoph Tavan
 wrote:
> I'm running 1.3.1. Thanks a lot for the hint. I will try to upgrade and let
> you know.
>
> What is the recommended way of upgrading? One minor version at a time or can
> I do a rolling upgrade to 1.3.5?
>
> Thanks!
> Christoph
>
> Am Dienstag, 11. November 2014 19:38:55 UTC+1 schrieb Robert Muir:
>>
>> The error says "local checksum OK"... what version of elasticsearch
>> are you running?
>>
>> If its before 1.3.2, please read this:
>> http://www.elasticsearch.org/blog/elasticsearch-1-3-2-released/
>>
>>
>> On Wed, Sep 3, 2014 at 12:58 AM, David Kleiner 
>> wrote:
>> > Greetings,
>> >
>> > I tried to overcome slowly recovering replica set, changed the number of
>> > replicas on index to 0, then to 1, getting this exception:
>> >
>> >
>> > 
>> > [2014-09-02 23:51:59,738][WARN ][indices.recovery ] [Salvador
>> > Dali]
>> > [...-2014.08.29][1] File corruption on recovery name [_40d_es090_0.pos],
>> > length [11345418], checksum [ekoi4m], writtenBy [LUCENE_4_9] local
>> > checksum
>> > OK
>> > org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
>> > problem?) : expected=ekoi4m actual=1pdwf09 (resource=name
>> > [_40d_es090_0.pos], length [11345418], checksum [ekoi4m], writtenBy
>> > [LUCENE_4_9])
>> > at
>> >
>> > org.elasticsearch.index.store.Store$VerifyingIndexOutput.readAndCompareChecksum(Store.java:684)
>> > at
>> >
>> > org.elasticsearch.index.store.Store$VerifyingIndexOutput.writeBytes(Store.java:696)
>> > at
>> >
>> > org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:589)
>> > at
>> >
>> > org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:533)
>> > at
>> >
>> > org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
>> > at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> > at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> > at java.lang.Thread.run(Thread.java:745)
>> >
>> >
>> > 
>> >
>> > Any pointers?
>> >
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "elasticsearch" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to elasticsearc...@googlegroups.com.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/d/msgid/elasticsearch/cba135a4-7838-4ad5-b56c-439823f7653b%40googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/b9dfd69c-94ce-4e15-ac84-04b0034a5f2d%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAOdYfZUSmuC_pfKP4%3DgRz4URYD82tMRo-ax%3D6L__jm-804E2qQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Corrupted Shard on Recovery

2014-11-11 Thread Robert Muir
The error says "local checksum OK"... what version of elasticsearch
are you running?

If its before 1.3.2, please read this:
http://www.elasticsearch.org/blog/elasticsearch-1-3-2-released/


On Wed, Sep 3, 2014 at 12:58 AM, David Kleiner  wrote:
> Greetings,
>
> I tried to overcome slowly recovering replica set, changed the number of
> replicas on index to 0, then to 1, getting this exception:
>
> 
> [2014-09-02 23:51:59,738][WARN ][indices.recovery ] [Salvador Dali]
> [...-2014.08.29][1] File corruption on recovery name [_40d_es090_0.pos],
> length [11345418], checksum [ekoi4m], writtenBy [LUCENE_4_9] local checksum
> OK
> org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
> problem?) : expected=ekoi4m actual=1pdwf09 (resource=name
> [_40d_es090_0.pos], length [11345418], checksum [ekoi4m], writtenBy
> [LUCENE_4_9])
> at
> org.elasticsearch.index.store.Store$VerifyingIndexOutput.readAndCompareChecksum(Store.java:684)
> at
> org.elasticsearch.index.store.Store$VerifyingIndexOutput.writeBytes(Store.java:696)
> at
> org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:589)
> at
> org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:533)
> at
> org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> 
>
> Any pointers?
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/cba135a4-7838-4ad5-b56c-439823f7653b%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZXTNae5Njf2-EfFSYOmb1fEQtOMXDmpBN87_aNfz34wdg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: CorruptIndexException when trying to replicate one shard of a new index

2014-10-22 Thread Robert Muir
Thanks for closing the loop.

On Wed, Oct 22, 2014 at 6:01 PM, Nate Folkert  wrote:
> After disabling compression, I was able to successfully replicate that
> shard, so looks like we're hitting that bug.  I guess we'll have to upgrade!
>
> Thanks!
> - Nate
>
> On Wednesday, October 22, 2014 5:26:42 PM UTC-4, Robert Muir wrote:
>>
>> Can you try the workaround mentioned here:
>> http://www.elasticsearch.org/blog/elasticsearch-1-3-2-released/
>>
>> and see if it works? If the compression issue is the problem, you can
>> re-enable compression, just upgrade to at least 1.3.2 which has the
>> fix.
>>
>>
>> On Wed, Oct 22, 2014 at 4:57 PM, Nate Folkert 
>> wrote:
>> > Created and populated a new index on a 1.3.1 cluster.  Primary shards
>> > work
>> > fine.  Updated the index to create several replicas, and three of the
>> > four
>> > shards replicated, but one shard fails to replicate on any node with the
>> > following error (abbreviated some of the hashes for readability):
>> >
>> >>> [2014-10-22 20:31:54,549][WARN ][index.engine.internal] [NODENAME]
>> >>> [INDEXNAME][2] failed engine [corrupted preexisting index]
>> >>>
>> >>> [2014-10-22 20:31:54,549][WARN ][indices.cluster  ] [NODENAME]
>> >>> [INDEXNAME][2] failed to start shard
>> >>>
>> >>> org.apache.lucene.index.CorruptIndexException: [INDEXNAME][2]
>> >>> Corrupted
>> >>> index [CORRUPTED] caused by: CorruptIndexException[codec footer
>> >>> mismatch:
>> >>> actual footer=1161826848 vs expected footer=-1071082520 (resource:
>> >>> MMapIndexInput(path="DATAPATH/INDEXNAME/2/index/_7cp.fdt"))]
>> >>>
>> >>> at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:343)
>> >>>
>> >>> at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:328)
>> >>>
>> >>> at
>> >>>
>> >>> org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:723)
>> >>>
>> >>> at
>> >>>
>> >>> org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:576)
>> >>>
>> >>> at
>> >>>
>> >>> org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:183)
>> >>>
>> >>> at
>> >>>
>> >>> org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:444)
>> >>>
>> >>> at
>> >>>
>> >>> org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
>> >>>
>> >>> at
>> >>>
>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>>
>> >>> at
>> >>>
>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>>
>> >>> at java.lang.Thread.run(Thread.java:745)
>> >>>
>> >>> [2014-10-22 20:31:54,549][WARN ][cluster.action.shard ] [NODENAME]
>> >>> [INDEXNAME][2] sending failed shard for [INDEXNAME][2], node[NODEID],
>> >>> [R],
>> >>> s[INITIALIZING], indexUUID [INDEXID], reason [Failed to start shard,
>> >>> message
>> >>> [CorruptIndexException[[INDEXNAME][2] Corrupted index [CORRUPTED]
>> >>> caused by:
>> >>> CorruptIndexException[codec footer mismatch: actual footer=1161826848
>> >>> vs
>> >>> expected footer=-1071082520 (resource:
>> >>> MMapIndexInput(path="DATAPATH/INDEXNAME/2/index/_7cp.fdt"))
>> >>>
>> >>> [2014-10-22 20:31:54,550][WARN ][cluster.action.shard ] [NODENAME]
>> >>> [INDEXNAME][2] sending failed shard for [INDEXNAME][2], node[NODEID],
>> >>> [R],
>> >>> s[INITIALIZING], indexUUID [INDEXID], reason [engine failure, message
>> >>> [corrupted preexisting index][CorruptIndexException[[INDEXNAME][2]
>> >>> Corrupted
>> >>> index [CORRUPTED] caused by: CorruptIndexException[codec footer
>> >>> mismatch:
>> >>>

Re: CorruptIndexException when trying to replicate one shard of a new index

2014-10-22 Thread Robert Muir
Can you try the workaround mentioned here:
http://www.elasticsearch.org/blog/elasticsearch-1-3-2-released/

and see if it works? If the compression issue is the problem, you can
re-enable compression, just upgrade to at least 1.3.2 which has the
fix.


On Wed, Oct 22, 2014 at 4:57 PM, Nate Folkert  wrote:
> Created and populated a new index on a 1.3.1 cluster.  Primary shards work
> fine.  Updated the index to create several replicas, and three of the four
> shards replicated, but one shard fails to replicate on any node with the
> following error (abbreviated some of the hashes for readability):
>
>>> [2014-10-22 20:31:54,549][WARN ][index.engine.internal] [NODENAME]
>>> [INDEXNAME][2] failed engine [corrupted preexisting index]
>>>
>>> [2014-10-22 20:31:54,549][WARN ][indices.cluster  ] [NODENAME]
>>> [INDEXNAME][2] failed to start shard
>>>
>>> org.apache.lucene.index.CorruptIndexException: [INDEXNAME][2] Corrupted
>>> index [CORRUPTED] caused by: CorruptIndexException[codec footer mismatch:
>>> actual footer=1161826848 vs expected footer=-1071082520 (resource:
>>> MMapIndexInput(path="DATAPATH/INDEXNAME/2/index/_7cp.fdt"))]
>>>
>>> at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:343)
>>>
>>> at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:328)
>>>
>>> at
>>> org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:723)
>>>
>>> at
>>> org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:576)
>>>
>>> at
>>> org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:183)
>>>
>>> at
>>> org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:444)
>>>
>>> at
>>> org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> [2014-10-22 20:31:54,549][WARN ][cluster.action.shard ] [NODENAME]
>>> [INDEXNAME][2] sending failed shard for [INDEXNAME][2], node[NODEID], [R],
>>> s[INITIALIZING], indexUUID [INDEXID], reason [Failed to start shard, message
>>> [CorruptIndexException[[INDEXNAME][2] Corrupted index [CORRUPTED] caused by:
>>> CorruptIndexException[codec footer mismatch: actual footer=1161826848 vs
>>> expected footer=-1071082520 (resource:
>>> MMapIndexInput(path="DATAPATH/INDEXNAME/2/index/_7cp.fdt"))
>>>
>>> [2014-10-22 20:31:54,550][WARN ][cluster.action.shard ] [NODENAME]
>>> [INDEXNAME][2] sending failed shard for [INDEXNAME][2], node[NODEID], [R],
>>> s[INITIALIZING], indexUUID [INDEXID], reason [engine failure, message
>>> [corrupted preexisting index][CorruptIndexException[[INDEXNAME][2] Corrupted
>>> index [CORRUPTED] caused by: CorruptIndexException[codec footer mismatch:
>>> actual footer=1161826848 vs expected footer=-1071082520 (resource:
>>> MMapIndexInput(path="DATAPATH/INDEXNAME/2/index/_7cp.fdt"))
>
>
> The index is stuck now in a state where the shards try to replicate on one
> set of nodes, hit this failure, and then switch to try to replicate on a
> different set of nodes.  Have been looking around to see if anyone's
> encountered a similar issue but haven't found anything useful yet.  Anybody
> know if this is recoverable or if I should just scrap it and try building a
> new one?
>
> - Nate
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/51f1b345-a19d-4c70-873f-a0d47e5a%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZVEaeNXW%3DH6%2Bczq2M1s7Xf5g1quabGa749M8BZYMUfe%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Elasticsearch version upgrade issue -- CorruptIndexException

2014-09-23 Thread Robert Muir
This is a bug in lucene: https://issues.apache.org/jira/browse/LUCENE-5975

Sorry it took a while, thanks for reporting this!

On Tue, Sep 9, 2014 at 7:18 PM, Wei  wrote:
> Hi All,
>
> I'm working on an ES upgrade from v0.20.5 to v1.2.1
> I tested in a 2 node cluster, 3 indices, ~4 million docs, 18G file sizes, 20
> shards, 1 replicas
> However, after bumping the version and reboot the cluster, I kept on seeing
> some shards are damaged. The ES log said:
> Caused by: org.apache.lucene.index.CorruptIndexException: did not read all
> bytes from file: read 451 vs size 452 (resource:
> BufferedChecksumIndexInput(MMapIndexInput(path="/18/index/_195c_i.del")))
>
> This badly blocked the version upgrade in my case.
> Could you any one point me the reason of this issue?
> Lots appreciate to your help!
>
>
> Wei
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/e315ae9e-5c4a-43ab-a48f-3201dc52c6c9%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZWZgg0RiaizYGaUyC2kWAXFAfSqM2knb7z5bkwNeth1HQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Faster sloppy phrase queries

2014-09-08 Thread Robert Muir
On Mon, Sep 8, 2014 at 4:24 PM, Nikolas Everett  wrote:
>
> One thing on my side is that I don't really _need_ phrase queries.  I can
> play around with the specification a bit so long as I stay sane.  I just
> need to make documents that contain the terms near each other float to the
> top.  It'd be better if it was the exact phrases but some false positives is
> probably ok.  The phrase query got the job done but if there is a way to
> cheat it I'm happy to try.

For this purpose, why not stay with small window sizes (e.g. your 64,
or maybe even much smaller). IMO terms being present within massively
large windows means nothing. Personally i would consider one much
smaller, like 5. I know there have been experiments/papers around
this, i can dig up if you need, but I think its also kind of
intuitive.

This is probably a lot easier than doing anything around speeding up
sloppy phrase scoring.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZV-9QuViFQer_ebMwF29rw7bdDhUuu4AqA-0Ugy2xTJ7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: JVM crash on 64 bit SPARC with Elasticsearch 1.2.2 due to unaligned memory access

2014-08-22 Thread Robert Muir
How big is it? Maybe i can have it anyway? I pulled two ancient ultrasparcs
out of my closet to try to debug your issue, but unfortunately they are a
pita to work with (dead nvram battery on both, zeroed mac address, etc.) Id
still love to get to the bottom of this.
On Aug 22, 2014 3:59 PM,  wrote:

> Hi Adrien,
> It's a bunch of garbled binary data, basically a dump of the process image.
> Tony
>
>
> On Thursday, August 21, 2014 6:36:12 PM UTC-4, Adrien Grand wrote:
>>
>> Hi Tony,
>>
>> Do you have more information in the core dump file? (cf. the "Core dump
>> written" line that you pasted)
>>
>>
>> On Thu, Aug 21, 2014 at 7:53 PM,  wrote:
>>
>>> Hello,
>>> I installed ES 1.3.2 on a spare Solaris 11/ T4-4 SPARC server to scale
>>> out of small x86 machine.  I get a similar exception running ES with
>>> JAVA_OPTS=-d64.  When Logstash 1.4.1 sends the first message I get the
>>> error below on the ES process:
>>>
>>>
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  SIGBUS (0xa) at pc=0x7a9a3d8c, pid=14473, tid=209
>>> #
>>> # JRE version: 7.0_25-b15
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>> solaris-sparc compressed oops)
>>> # Problematic frame:
>>> # V  [libjvm.so+0xba3d8c]  Unsafe_GetInt+0x158
>>> #
>>> # Core dump written. Default location: 
>>> /export/home/elasticsearch/elasticsearch-1.3.2/core
>>> or core.14473
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> #   http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>>
>>> ---  T H R E A D  ---
>>>
>>> Current thread (0x000107078000):  JavaThread
>>> "elasticsearch[KYLIE1][http_server_worker][T#17]{New I/O worker #147}"
>>> daemon [_thread_in_vm, id=209, stack(0x5b80,
>>> 0x5b84)]
>>>
>>> siginfo:si_signo=SIGBUS: si_errno=0, si_code=1 (BUS_ADRALN),
>>> si_addr=0x000709cc09e7
>>>
>>>
>>> I can run ES using 32bit java but have to shrink ES_HEAPS_SIZE more than
>>> I want to.  Any assistance would be appreciated.
>>>
>>> Regards,
>>> Tony
>>>
>>>
>>> On Tuesday, July 22, 2014 5:43:28 AM UTC-4, David Roberts wrote:

 Hello,

 After upgrading from Elasticsearch 1.0.1 to 1.2.2 I'm getting JVM core
 dumps on Solaris 10 on SPARC.

 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGBUS (0xa) at pc=0x7e452d78, pid=15483, tid=263
 #
 # JRE version: Java(TM) SE Runtime Environment (7.0_55-b13) (build
 1.7.0_55-b13)
 # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.55-b03 mixed mode
 solaris-sparc compressed oops)
 # Problematic frame:
 # V  [libjvm.so+0xc52d78]  Unsafe_GetLong+0x158

 I'm pretty sure the problem here is that Elasticsearch is making
 increasing use of "unsafe" functions in Java, presumably to speed things
 up, and some CPUs are more picky than others about memory alignment.  In
 particular, x86 will tolerate misaligned memory access whereas SPARC won't.

 Somebody has tried to report this to Oracle in the past and
 (understandably) Oracle has said that if you're going to use unsafe
 functions you need to understand what you're doing:
 http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8021574

 A quick grep through the code of the two versions of Elasticsearch
 shows that the new use of "unsafe" memory access functions is in the
 BytesReference, MurmurHash3 and HyperLogLogPlusPlus classes:

 bash-3.2$ git checkout v1.0.1
 Checking out files: 100% (2904/2904), done.

 bash-3.2$ find . -name '*.java' | xargs grep UnsafeUtils
 ./src/main/java/org/elasticsearch/common/util/UnsafeUtils.java:public
 enum UnsafeUtils {
 ./src/main/java/org/elasticsearch/search/aggregations/bucket/
 BytesRefHash.java:if (id == -1L || UnsafeUtils.equals(key,
 get(id, spare))) {
 ./src/main/java/org/elasticsearch/search/aggregations/bucket/
 BytesRefHash.java:} else if (UnsafeUtils.equals(key,
 get(curId, spare))) {
 ./src/test/java/org/elasticsearch/benchmark/common/util/Byte
 sRefComparisonsBenchmark.java:import org.elasticsearch.common.util.
 UnsafeUtils;
 ./src/test/java/org/elasticsearch/benchmark/common/util/Byte
 sRefComparisonsBenchmark.java:return
 UnsafeUtils.equals(b1, b2);

 bash-3.2$ git checkout v1.2.2
 Checking out files: 100% (2220/2220), done.

 bash-3.2$ find . -name '*.java' | xargs grep UnsafeUtils
 ./src/main/java/org/elasticsearch/common/bytes/BytesReference.java:import
 org.elasticsearch.common.util.UnsafeUtils;
 ./src/main/java/org/elasticsearch/common/bytes/BytesReferenc
 e.java:return UnsafeUtils.equals(a.array(),
 a.arrayOffset(), b.array(), b.arrayOffset(), a.length());
 ./src/main/java/org/elasticsearch/common/hash/MurmurHash3.java:import
 org.elasticsearch.com

Re: What is bad of using pulsing postings format?

2014-07-13 Thread Robert Muir
There is not really an advantage to it.

The optimization has been incorporated into the default index format
of lucene: https://issues.apache.org/jira/browse/LUCENE-4498

On Sun, Jul 13, 2014 at 10:20 PM, 陳智清  wrote:
> From this article
> (http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html)
> I know pulsing codec saves one disk seek hence introduces performance gain.
> I would like to know what I pay for using pulsing codec? What will happen if
> I give it a high cut-off frequency so that all postings are stored in term
> dictionary?
>
> In other words, instead of the goodness, I would like to know what is the
> drawback of using pulsing postings format?
>
> Thank you.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/4d4c5ad8-baab-4c81-85a5-dc75095a7f5a%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZXKEDQiBw%3DwmyLPh47bkJh%3DUT5E%2B_yrfS797vgCuADdBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Query string query mini-language vs. grammar implementation?

2014-07-10 Thread Robert Muir
On Tue, Jul 8, 2014 at 8:35 PM, x0ne  wrote:
> Ever since I discovered the mini-language provided through the query string
> query
> (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html),
> I have had a hard time going back to the difficult process of mapping what
> someone wants to a proper elasticsearch query. As such, I have essentially
> provided users with the ability to create their own query strings and then
> execute them directly against the cluster (10s of millions of documents).
>
> This approach works great until a several complex queries are ran in a row
> which then appears to send the cluster into an OOM panic. Is there a way to
> put some sanity checks inside of the query string query to avoid insane
> results coming back? Can I limit the number of results loaded onto the heap
> or put into the cache? Have others just rolled their own grammar parsing
> instead of using the mini-language directly?

Have you looked at simple query string?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html

This one is more limited, but it has a flags parameter that lets you
turn every feature or operator on/off. So you could disable wildcard,
phrase, fuzzy, etc.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZVu-f9NiOxByFCzm4zaWVZ0y6%2BypaKqripvPHBDdLLiLQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trigram-accelerated regex searches

2014-05-22 Thread Robert Muir
On Wed, May 21, 2014 at 6:01 PM, Erik Rose  wrote:
> I'm trying to move Mozilla's source code search engine (dxr.mozilla.org)
> from a custom-written SQLite trigram index to ES. In the current production
> incarnation, we support fast regex (and, by extension, wildcard) searches by
> extracting trigrams from the search pattern and paring down the documents to
> those containing said trigrams.

This is definitely a great approach for a database, but it won't work
exactly the same way for an inverted index because the datastructure
is totally different.

In the inverted index queries like wildcards are slow: they must
iterate and match all terms in the document collection, then intersect
those postings with the rest of your query. So because its inverted,
it works backwards from what you expect and thats why adding
additional intersections like 'AND' don't speed anything up, they
haven't happened yet.

N-grams can speed up partial matching in general, but the methods to
accomplish this are different: usually the best way to go about it is
to try to think about Analyzing the data in such a way that the
queries to accomplish what you need are as basic as possible.

The first question is if you really need partial matching at all: I
don't have much knowledge about your use case, but just going from
your example, i would look at wildcards like "*Children*Next*" and ask
if instead i'd want to ensure my analyzer split on case-changes, and
try to see if i could get what i need with a sloppy phrase query.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUS40rsAjmzrL_YK6yjgjZRumeQKFVPhVu9bUcW4nN_KA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Stemmer token filter result is different that it should be

2014-05-20 Thread Robert Muir
When you use the "french" analyzer it uses the Lucene FrenchAnalyzer
behind the scenes, which does not use the snowball algorithm.

It uses the Savoy stemmer, the same as specifying "light_french"
stemmer: 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html

On Tue, May 20, 2014 at 8:03 AM,   wrote:
> If I try to analyze the following text
>
> GET _analyze?analyzer=french&text=Ils maintenaient la machine
>
> It results 2 tokens: "maintenaient", "machin".
>
> Elasticsearch apply more options to the default snowball stemming algorithm.
> Without these options, the result for the first token should be "mainten"
> (approved by the documentation
> http://snowball.tartarus.org/algorithms/french/stemmer.html).
>
> What are these additional options that elasticsearch add to the standard
> snowball analyzer?
>
> TIA
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/0522456e-f7c2-4f6b-907c-d4ee9f53b6b9%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZWK5PyE%3DsrOr4SwMe91Ug%3D3BzVNOZkN5z8O-QY8DZHhRg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Removing unused fields (more Lucene than ES but..)

2014-04-03 Thread Robert Muir
Thank you Paul, I added some comments just so the technical challenges
and risks are clear.

Its unfortunately not so easy to fix...

On Thu, Apr 3, 2014 at 7:49 PM, Paul Smith  wrote:
> Thanks for the JIRA link Robert, I've added a comment to it just to share
> the real world aspect of what happened to us for background.
>
>
> On 1 April 2014 18:29, Robert Muir  wrote:
>>
>> On Tue, Apr 1, 2014 at 2:41 AM, Paul Smith  wrote:
>> >
>> > Thanks Robert for the reply, all of that sounds fairly hairy.  I did try
>> > a
>> > full optimize of the shard index using Luke, but the residual
>> > über-segment
>> > still has the  filed definitions in it.  Are saying in (1) that the
>> > creating
>> > of a new Shard index through a custom call to IndexWriter.addIndexes(..)
>> > would produce a _fully_ optimized index without the fields, and that is
>> > different than what an Optimize operation through ES would call? More a
>> > technical question now on what the differences is between the Optimize
>> > call
>> > and a manual create-new-index-from-multiple-readers.  (I actually though
>> > that's what the Optimize does in practical terms, but there's obviously
>> > more
>> > or less going on under the hood under these different code paths).
>> >
>> > We're going the reindex route for now, was just hoping there was some
>> > special trick we could do a little easier than the above. :)
>> >
>>
>> Optimize and normal merging don't "garbage collect" unused fields from
>> fieldinfos:
>>
>> https://issues.apache.org/jira/browse/LUCENE-1761
>>
>> The addindexes trick is also a forced merge, but it decorates the
>> readers-to-be-merged: lying
>> and hiding the fields as if they don't exist.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAMUKNZW2-FEjA6CChSR3%2Br0GQYAfJ9ZOOhyU565V79QMTrPFWQ%40mail.gmail.com.
>>
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAHfYWB5dHJ70cxzZuZ21gdeQwN1ckZ40Yu4K%2BmJYexK-i01AVQ%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZW6bH_M5qcf0qnRUKxvm%3DVjMdLO7RAxCbeWsKMN%3DjDrqA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: ElasticSearch standard Analyzer - problematic case

2014-04-03 Thread Robert Muir
The standard analyzer doesn't really know anything about emails/URLs,
its just implementing the Unicode tokenization rules.

There is an extension of it that does know about these things (and
tries to keep them as one token)...

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-uaxurlemail-tokenizer.html

Maybe try this one and see if it works better for you?


On Thu, Apr 3, 2014 at 4:52 AM, Igor Romanov  wrote:
> Hi
>
> I was analyzing some analyzer weird behaviour, and try to understand why it
> happens and how to fix it
>
> here what token I get for standard analyzer for text:
> "myem...@email.com:test1234"
>
> curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d
> 'myem...@email.com:test1234'
> {
>   "tokens" : [ {
> "token" : "myemail",
> "start_offset" : 0,
> "end_offset" : 7,
> "type" : "",
> "position" : 1
>   }, {
> "token" : "email.com:test1234",
> "start_offset" : 8,
> "end_offset" : 26,
> "type" : "",
> "position" : 2
>   } ]
> }
>
>
> so question is why I am getting that as one token: "email.com:test1234"
>
> why it is not devided to tokens by . and : ?
>
> and what analyzer/tokenizer/filter can I use that can help with it?
>
> Thanks,
> Igor
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/826eb584-3408-404a-b87c-2c44e455bb65%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZWGsks9O5Y5qupAovgn6Vwa3EwVKju9WOeSmW3dQ-hPTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: "Locale" parameter in query_string query

2014-04-01 Thread Robert Muir
This controls the behavior of the string conversions triggered by
lowercase_expanded_terms.

For example Turkish/Azeri have different casing characteristics:
http://en.wikipedia.org/wiki/Dotted_and_dotless_I

On Tue, Apr 1, 2014 at 2:43 AM, Prashant Agrawal
 wrote:
> Any updates on the above query?
>
>
>
> --
> View this message in context: 
> http://elasticsearch-users.115913.n3.nabble.com/Locale-parameter-in-query-string-query-tp4052983p4053213.html
> Sent from the ElasticSearch Users mailing list archive at Nabble.com.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/1396334600143-4053213.post%40n3.nabble.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZXXSVqqazPWz7%3Deo376ybi1ZJPU42GLFs-5wiUC8B_Wkg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Removing unused fields (more Lucene than ES but..)

2014-04-01 Thread Robert Muir
On Tue, Apr 1, 2014 at 2:41 AM, Paul Smith  wrote:
>
> Thanks Robert for the reply, all of that sounds fairly hairy.  I did try a
> full optimize of the shard index using Luke, but the residual über-segment
> still has the  filed definitions in it.  Are saying in (1) that the creating
> of a new Shard index through a custom call to IndexWriter.addIndexes(..)
> would produce a _fully_ optimized index without the fields, and that is
> different than what an Optimize operation through ES would call? More a
> technical question now on what the differences is between the Optimize call
> and a manual create-new-index-from-multiple-readers.  (I actually though
> that's what the Optimize does in practical terms, but there's obviously more
> or less going on under the hood under these different code paths).
>
> We're going the reindex route for now, was just hoping there was some
> special trick we could do a little easier than the above. :)
>

Optimize and normal merging don't "garbage collect" unused fields from
fieldinfos:

https://issues.apache.org/jira/browse/LUCENE-1761

The addindexes trick is also a forced merge, but it decorates the
readers-to-be-merged: lying
and hiding the fields as if they don't exist.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZW2-FEjA6CChSR3%2Br0GQYAfJ9ZOOhyU565V79QMTrPFWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Removing unused fields (more Lucene than ES but..)

2014-03-31 Thread Robert Muir
It is actually possible in lucene 4, but there is nothing really
convenient setup to do this.

You have two choices there:
1. trigger a massive merge (essentially an optimize), by wrapping all
readers and calling IndexWriter.addIndexes(Reader...).
2. wrap readers in a custom merge policy and do it slowly over time.

in both cases you'd use something like
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/index/FieldFilterAtomicReader.java

for lucene 3, this would be more complicated, I don't think its
impossible but there is no available code unfortunately in this case.


On Mon, Mar 31, 2014 at 11:37 PM, Paul Smith  wrote:
> ok, this is more low level Lucene, but in the context of an ElasticSearch
> cluster, is there any way to get an index/shard to optimize away a bunch of
> fields that are no longer used (literally have no term values associated
> with it.
>
> We had an application bug introduced that polluted an index with a very
> large number of fields (25,000 fields... *cough*) , and lets just say things
> weren't well after that.
>
> we've deleted all the rogue records, but the shards still contain the raw
> Lucene Field information (we've inspected these with Luke) and the cluster
> is heavily CPU bound processing "refreshVersionTable" calls that is in a
> large loop a function of the number of fields in the segments.
>
> We've attempted a test optimize of the index using Luke on a single shard,
> but the residual segments post-optimize still contain a large number of
> these fields, all with no values associated with them.
>
> Obviously a reindex would do this, but if there's any other bright ideas
> that are quicker than that (45 million item index we're trying to keep up)
> would be most welcome!
>
> We're on ES 0.19.10 still (lucene 3.6.1).  (you can tell me "upgrade"
> another day please..)
>
> Here's a snapshot picture from the Luke on a single shard from this index.
>
> cheers!
>
> Paul Smith
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAHfYWB5nO%3DDQ50SQ4kgde6JvT%3DgjQ_7FmLbVcXVk5Kiurwme%2Bg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZXZNf2y7AXsJFJg7hBOyJmEW%2BOvcNZse1JfQx0XcFyynA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Indexing performance with doc values (particularly with larger number of fields)

2014-03-23 Thread Robert Muir
Would be a nice benchmark to run (and if you find hotspots/slow things
to go improve in lucene...)!

The data structures for docvalues are less complex than the data
structures for the inverted index.

I've enabled docvalues for many fields as you suggest in the past, and
in my tests the time for e.g. segment merging was still dominated by
the inverted index (terms dict, postings lists, etc), as I had all the
fields indexed for search, too. But nothing is free: some of this
stuff is data-dependent so you have to test.

About the heap, you are right, its probably best to adjust your heap
accordingly if you are using dovalues.


On Sun, Mar 23, 2014 at 10:01 PM, Alex at Ikanow  wrote:
> This might be more of a Lucene question, but a quick google didn't throw up
> anything.
>
> Has anyone done/seen any benchmarking on indexing performance (overhead) due
> to using doc values?
>
> I often index quite large JSON objects, with many fields (eg 50), I'm trying
> to get a feel for whether I can just let all of them be doc values on the
> off chance I'll want to aggregate over them, or whether I need to pick
> beforehand which fields will support aggregation.
>
> (A related question: presumably allowing a mix of doc values fields and
> "legacy" fields is a bad idea, because if you use doc values fields you want
> a low max heap so that the file cache has lots of memory available, whereas
> if you use the field cache you need a large heap - is that about right, or
> am i missing something?)
>
> Thanks for any insight!
>
> Alex
> Ikanow
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/0361eda4-ab39-4536-b91a-ccb710921edd%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZVGxGcM_QrFHEXsaa%3DQcH_Er_h1s4LgBQDE0kU7c%2Bi2JQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Fast vector highlighter does not work with explicit span_near queries

2014-03-22 Thread Robert Muir
On Sat, Mar 22, 2014 at 10:26 AM, Harry Waye  wrote:
> Thanks Robert, a useful caveat to add to the highlighter docs?

Yes, I think so.

>
> The reason I looked in to using the fvh was due to the plain highlighter not
> using the correct analyzer if the index analyzer was specified specified via
> a path in the indexed document (_analyzer: {path: ...}).  Looks like an easy
> fix though.
>
> I'll add issues for both.

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZVJjvinZ8-bLNU%2B1T07LQ_u3HXidg0UFB2vCf954E5_og%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Fast vector highlighter does not work with explicit span_near queries

2014-03-22 Thread Robert Muir
FVH definitely doesn't recognize span-near queries. In general, when
it comes to the spanquery family the plain highlighter will work
better because it has explicit support for those queries.

Maybe you want to open an issue? Its not obvious how to fix though,
because of how span-near queries can arbitrarily nest, yet fvh's
design needs to "flatten" the query to a simple list of queries and
phrases.

In the meantime, I'd recommend the plain highlighter.

On Fri, Mar 21, 2014 at 9:01 AM, Harry Waye  wrote:
> FYI this is ES 1.0.1
>
>
> On Friday, March 21, 2014 1:00:33 PM UTC, Harry Waye wrote:
>>
>> I'm trying to use fvh with span_near queries but it appears to be totally
>> broken.  Other query types work, even it's query_string equivalent.  Is
>> there anything I am doing incorrectly here?  Or is there a work around that
>> I can employ in the meantime?  Below is a recreation:
>>
>> # Set up index with mappings
>> curl -XPOST localhost:9200/a -d '{
>>   "mappings": {
>> "document": {
>>   "properties": {
>> "text": {
>>   "type": "string",
>>   "term_vector": "with_positions_offsets"
>> }
>>   }
>> }
>>   }
>> }'
>>
>> # Put text to field with positions offsets
>> curl -XPOST localhost:9200/a/document/1 -d '{"text": "a b"}'
>>
>> # Query with fvh highlighter gives no highlight
>> curl -XPOST localhost:9200/a/document/_search -d '{
>>   "query": {
>> "span_near": {
>>   "slop": 0,
>>   "clauses": [{"span_term": {"text": "a"}}, {"span_term": {"text":
>> "b"}}]
>> }
>>   },
>>   "highlight": {"fields": {"text": {"type":"fvh"}}}
>> }'
>>
>> #
>> {"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.22145195,"hits":[{"_index":"a","_type":"document","_id":"1","_score":0.22145195,
>> "_source" : {"text": "a b"}}]}}
>>
>> # Query with plain
>> curl -XPOST localhost:9200/a/document/_search -d '{
>>   "query": {
>> "span_near": {
>>   "slop": 0,
>>   "clauses": [{"span_term": {"text": "a"}}, {"span_term": {"text":
>> "b"}}]
>> }
>>   },
>>   "highlight": {"fields": {"text": {"type":"plain"}}}
>> }'
>>
>> #
>> {"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.22145195,"hits":[{"_index":"a","_type":"document","_id":"1","_score":0.22145195,
>> "_source" : {"text": "a b"},"highlight":{"text":["a
>> b"]}}]}}
>>
>> curl -XPOST localhost:9200/a/document/_search -d '{
>>   "query": {
>> "query_string": {
>>   "query": "\"a b\"~0",
>>   "default_field": "text"
>> }
>>   },
>>   "highlight": {"fields": {"text": {"type":"fvh"}}}
>> }'
>>
>> #
>> {"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.38356602,"hits":[{"_index":"a","_type":"document","_id":"1","_score":0.38356602,
>> "_source" : {"text": "a b"},"highlight":{"text":["a b"]}}]}}
>>
>> # Try a match query
>> curl -XPOST localhost:9200/a/document/_search -d '{
>>   "query": {
>> "match": {
>>   "text": "a b"
>> }
>>   },
>>   "highlight": {"fields": {"text": {"type":"fvh"}}}
>> }'
>>
>> #
>> {"took":14,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.2712221,"hits":[{"_index":"a","_type":"document","_id":"1","_score":0.2712221,
>> "_source" : {"text": "a b"},"highlight":{"text":["a
>> b"]}}]}}
>>
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/0b86427d-1033-493b-a874-0411f3b77ec4%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZXJQT%3DHvPON4hTY_QccQh2bxCDXabcAFHOc6wBaHFUHwg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Stop filter problem: enablePositionIncrements=false is not supported anymore as of Lucene 4.4 as it can create broken token streams

2013-12-18 Thread Robert Muir
the old disabling of position increments was bogus.
for example a stop filter could remove a token and "move" a synonym
from one word to another.

so this option conflated two unrelated things: whether or not a "gap"
should be introduced when a word is removed, and whether any existing
positions (e.g. from synonyms) should be respected.

in my opinion (but i have not thought it over in a while, look at the
issue age) its possible to prevent the introduction of gaps while
still respecting existing ones:
https://issues.apache.org/jira/browse/LUCENE-4065


On Wed, Dec 18, 2013 at 11:54 PM, Michael Cheremuhin  wrote:
> Hi Jondow,
>
> Is there any progress on the issue?
>
> среда, 4 сентября 2013 г., 17:16:47 UTC+3 пользователь Jondow написал:
>>
>> Hi Jörg,
>>
>> The problem is that the default is set to true, and with it set to true,
>> my shingle filter results include underscores because of the stop filter in
>> use, which I don't want. Traditionally the way to get rid of this was to set
>> enablePositionIncrements to false in the stop filter. This is no longer
>> possible, hence my predicament. :-(
>>
>> On Wednesday, 4 September 2013 14:49:44 UTC+2, Jörg Prante wrote:
>>>
>>> Drop enable_position_increments parameter or set it to true.
>>>
>>> In shingle filters, you should set min_shingle_size to 2.
>>>
>>> Jörg
>>>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/30411acd-1a9d-4332-a3bf-13e7249d91a8%40googlegroups.com.
>
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAOdYfZVVDM6MjjS5E%2Bx68B8PXOkBRsjeZuRE8831frcS6CR7Fw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.