Re: Elasticsearch ingest performance

2015-04-23 Thread Michael McCandless
You can try the ideas here too: https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing Mike McCandless On Wed, Apr 22, 2015 at 8:00 PM, Kimbro Staken wrote: > Hello Brian, > > Many things will affect the rate of ingest, the biggest one is making sure > the load gets sprea

Re: refresh_interval:"10s" is better than refresh_interval:"-1"?

2015-04-15 Thread Michael McCandless
On Tue, Apr 14, 2015 at 7:36 AM, Hajime wrote: > Possibly it is IO bound but I don't seem too many io wait on Cpu or write > activity on iostat.By the way,uses ssd and xfs as file system and default > Directory ( I think it becomes MMapDirectory). > Local SSD (not e.g. Amazon's EBS backed by SSD

Re: refresh_interval:"10s" is better than refresh_interval:"-1"?

2015-04-13 Thread Michael McCandless
; Should I configure something like "*bulk.thread_pool*" size or > "indices.memory.max_shard_index_buffer_size" > ( > https://github.com/elastic/elasticsearch/blob/97559c0614d900a682d01afc241615cf5627fb4c/src/main/java/org/elasticsearch/indices/memory/IndexingMemoryControl

Re: refresh_interval:"10s" is better than refresh_interval:"-1"?

2015-04-13 Thread Michael McCandless
You should see better performance with -1 refresh_interval, because Lucene will flush larger, single segments, causing less merging pressure. Are both of your tests (-1 vs 10s) fully saturating CPU and/or IO on your nodes? If not, then that can explain it: when you have 10s refresh_interval, a se

Re: corrupted shard after optimize

2015-03-24 Thread Michael McCandless
Hmm, not good. Which version of ES? Do you have a full stack trace for the exception? To run CheckIndex you need to add all ES jars to the classpath. It's easiest to just use a wildcard for this, e.g.: java -cp "/path/to/es-install/lib/*" org.apache.lucene.index.CheckIndex ... Make sure you

Re: "now throttling indexing"

2015-03-13 Thread Michael McCandless
That is the right setting to disable store throttling, but even without throttling writes MB/sec for merges, the merges can still fall behind, leading to index throttling. ES does this to protect the health of the index because too many segments will cause all sorts of trouble. What IO system is

Re: Clear deleted docs

2015-03-13 Thread Michael McCandless
Note that only_expunge_deletes=true will only merge the segment away if it has > 10% delete docs by default, otherwise it leaves the segment as is. See http://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-merge.html on how to change that 10% default. But it's almost always

Re: OOME When large segments merge

2015-03-12 Thread Michael McCandless
Do you have many fields with norms enabled? Mike McCandless On Thu, Mar 12, 2015 at 1:20 PM, Mark Greene wrote: > I've noticed periodically that data nodes in my cluster will run out of > heap space when large segments start merging. I attached a screenshot of > what ma

Re: Missing SegmentInfo files after upgrade question (Issue 7430)

2015-03-05 Thread Michael McCandless
ought. Unfortunately, I did minimal replication and the > other copy was wiped out due to disk failure. Is there a way to run that > index without the bad shard (4 out of 5 still good)? I'm gonna guess no. > > Thanks, > Kris. > > On Thu, Mar 5, 2015 at 11:23 AM, Michael M

Re: Missing SegmentInfo files after upgrade question (Issue 7430)

2015-03-05 Thread Michael McCandless
That one shard is likely hosed. But if you a good replica of that shard then you may be able to delete the hosed shard and let ES recover from the good one. Or restore from snapshot... Mike McCandless http://blog.mikemccandless.com On Thu, Mar 5, 2015 at 2:13 PM, krispyjala wrote: > Hey all,

Re: elasticsearch Index throttling info message comes in es 1.3.1 version

2015-03-05 Thread Michael McCandless
This means Lucene's segment merges can't keep up. Try increasing or disabling the store level IO throttling: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-store.html Mike McCandless http://blog.mikemccandless.com On Thu, Mar 5, 2015 at 5:53 AM, shanmuthu83

Re: Decreasing Heap Size Results in Better TPS, How can this happen??

2015-02-18 Thread Michael McCandless
Smaller JVM heap means more free RAM for the OS to cache hot pages from your index ... in general you should only give the JVM as much as it needs (will ever need) and a bit more for safety, and give the rest to the OS so it can put hot parts of your index in RAM. Mike McCandless http://blog.mike

Re: Read past EOF exception on .tis and .fdt file

2015-02-18 Thread Michael McCandless
ES has the index.shard.check_on_startup to run CheckIndex on startup of a shard: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html Mike McCandless http://blog.mikemccandless.com On Wed, Feb 18, 2015 at 1:17 PM, Jilles van Gurp wrote: > plus 1 for a less i

Re: Can't get Documents Deleted below 40% - performance issues - help needed

2015-02-01 Thread Michael McCandless
It's normal to see 40-60% deleted docs if you frequently update existing documents. See this recent blog post I wrote for some details: http://www.elasticsearch.org/blog/lucenes-handling-of-deleted-documents/ Mike McCandless http://blog.mikemccandless.com On Sun, Feb 1, 2015 at 3:50 PM, Mark Wa

Re: Confusing results from fuzzy query (1 term, 1 field)

2015-01-27 Thread Michael McCandless
Looks like this was answered on StackOverflow? Mike McCandless http://blog.mikemccandless.com On Mon, Jan 26, 2015 at 7:54 PM, Steve Pearlman wrote: > For a well formatted example, please see: > http://stackoverflow.com/questions/28161480/fuzzy-not-functioning-as-expected-one-term-search-see-e

Re: Better understanding Lucene/Shard overheads

2015-01-24 Thread Michael McCandless
On Fri, Jan 23, 2015 at 8:42 PM, Drew Kutcharian wrote: > Thanks Mike. I’m still a bit unclear on these comments: > > IndexReader requires some RAM for each segment to hold structures like > live docs, terms index, index data structures for doc values fields, and > holds open a number of file des

Re: Better understanding Lucene/Shard overheads

2015-01-23 Thread Michael McCandless
There is definitely a non-trivial per-index cost. >From Lucene's standpoint, ES holds an IndexReader (for searching) and IndexWriter (for indexing) open. IndexReader requires some RAM for each segment to hold structures like live docs, terms index, index data structures for doc values fields, and

Re: scrolling and lucene segments

2015-01-16 Thread Michael McCandless
The segments are effectively ref counted, so once the last scroll still using an old (already merged away) segment is deleted, it will be removed. Mike McCandless http://blog.mikemccandless.com On Fri, Jan 16, 2015 at 4:15 AM, Jason Wee wrote: > > http://www.elasticsearch.org/guide/en/elastics

Re: Migrating lucene drill sideways query to elasticsearch

2015-01-16 Thread Michael McCandless
I think you must do separate filters to compute the sideways facet counts. Mike McCandless http://blog.mikemccandless.com On Fri, Jan 16, 2015 at 10:15 AM, Bo Finnerup Madsen wrote: > Hi, > > I am trying to migrate a project from Lucene to elasticsearch, and for the > most part it is a pleasur

Re: Postings highlighter throws exception on some queries

2015-01-14 Thread Michael McCandless
Super, thanks for brining closure. Mike McCandless http://blog.mikemccandless.com On Wed, Jan 14, 2015 at 9:59 AM, Peter Bowyer wrote: > Hi Michael, > > Thanks for your response - it turned out to be user error. I'd set up the > mappings correctly, but a few records in my bulk import file were

Re: Postings highlighter throws exception on some queries

2015-01-13 Thread Michael McCandless
This looks like a bug: clearly from your mappings, field "content" was indexed with offsets, yet the error message (incorrectly) claims otherwise. Does the bug still happen on the last 1.4.x release (1.4.2)? If you search only on the content/content.english field does the error still happen? (i.

Re: Guaranteed upper bound for near real time search

2015-01-02 Thread Michael McCandless
The 1s refresh_interval means that ES will open (takes some time) and warm (takes some more time) a new NRT reader, and after that reader is done opening, 1s later it will open again. So it's possible in your case it takes 2s to open + warm a new NRT reader (check the node's logs). But 2s is quit

Re: Write amplification and SSD

2014-12-16 Thread Michael McCandless
It means that ES works well with SSDs since Lucene is write-once under the hood, so it is "easy" on the SSDs, vs other approaches which do random writes to different places causing the higher write amplification. But, this is balanced with the fact that Lucene must also periodically merge the segm

Re: slow performance on phrase queries in should clause

2014-12-05 Thread Michael McCandless
It's likely the should is (stupidly) being fully expanded before being AND'd with the must ... but there are improvements here (XBooleanFilter.java) to this in master, are you able to test and see if it's still slow? Mike McCandless http://blog.mikemccandless.com 2014-12-04 19:21 GMT-05:00 Kiree

Re: Good merge settings for interactively maintained index

2014-12-04 Thread Michael McCandless
5:30 AM, Michael McCandless wrote: > 25-40% is definitely "normal" for an index where many docs are being > replaced; I've seen this go up to ~65% before large merges bring it back > down. > > On 2) there may be some improvements we can make to Lucene default > Tiere

Re: Good merge settings for interactively maintained index

2014-12-04 Thread Michael McCandless
25-40% is definitely "normal" for an index where many docs are being replaced; I've seen this go up to ~65% before large merges bring it back down. On 2) there may be some improvements we can make to Lucene default TieredMergePolicy here, to reclaim deletes for the "too large" segments ... I'll ha

Re: Odd behavior of bulk loading speed - good riddle?

2014-11-24 Thread Michael McCandless
Which version of ES? This is probably not related to the slowdown, but when using scripts for updating docs, it's best to keep the script constant, and use params for the changing values (all the $vars in your PHP script). This means ES will compile the script once and reuse that, vs paying compi

Re: How can we get all ID's which are generated by Elastic Search for each record while using bulk insert ?

2014-11-17 Thread Michael McCandless
The bulk response tells you the id assigned to each indexed doc. See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html Mike McCandless http://blog.mikemccandless.com On Mon, Nov 17, 2014 at 1:56 PM, Subbarao Kondragunta < subbu2perso...@gmail.com> wrote: > Ho

Re: _all field getting corrupted, no mapping changes possible anymore

2014-11-07 Thread Michael McCandless
On Fri, Nov 7, 2014 at 6:41 AM, wrote: > thank you for your fast reply. > > this actually happened with 1.4.0Beta1 so I am not sure if it's the same > issue. > Sorry, what I mean is that this issue, which adds checking for mapping conflicts in the _all field and was fixed in 1.4.0Beta1, causes t

Re: TokenStream contract violation: close() call missing

2014-11-07 Thread Michael McCandless
On Thu, Nov 6, 2014 at 3:05 PM, Richard Tier wrote: > Thanks for reply. > > The autocomplete analyzer: > > { > 'analysis': { > 'analyzer': { > 'autocomplete': { >'type': 'custom', >'tokenizer': 'standard', >'filter'

Re: _all field getting corrupted, no mapping changes possible anymore

2014-11-07 Thread Michael McCandless
Hmm this is likely due to https://github.com/elasticsearch/elasticsearch/pull/7377 (fixed in 1.4.0Beta1) which was done to prevent conflicting mapping changes to the _all field. What change are you trying to make, that hits this error? Is there a stack trace? Mike McCandless http://blog.mikem

Re: TokenStream contract violation: close() call missing

2014-11-06 Thread Michael McCandless
Hmm, not good. What does your "autocomplete" analyzer look like? Can you post the full stack trace? Mike McCandless http://blog.mikemccandless.com On Wed, Nov 5, 2014 at 7:05 PM, Richard Tier wrote: > An internal error happens when I do a "suggest" query. I get "TokenStream > contract violat

Re: refresh thread consumes CPU resource when changing refresh_interval to -1

2014-10-15 Thread Michael McCandless
OK this will be fixed in the next ES release: https://github.com/elasticsearch/elasticsearch/pull/8087 Thank you for reporting this. Mike McCandless http://blog.mikemccandless.com On Wed, Oct 15, 2014 at 2:40 AM, Shinsuke Sugaya wrote: > Hi, > > I encountered a problem in InternalIndexShard#E

Re: refresh thread consumes CPU resource when changing refresh_interval to -1

2014-10-15 Thread Michael McCandless
I agree, this looks like a bug. I'll open an issue ... thank you for reporting! Mike McCandless http://blog.mikemccandless.com On Wed, Oct 15, 2014 at 2:40 AM, Shinsuke Sugaya wrote: > Hi, > > I encountered a problem in InternalIndexShard#EngineRefresher. > The problem is, 1 core consumed 100

Re: Indexing is becoming slow, what to look for?

2014-10-02 Thread Michael McCandless
On Tue, Sep 9, 2014 at 6:55 AM, Thomas wrote: > By setting this parameter, some additional questions of mine have been > generated: > > By setting indices.memory.index_buffer_size to a specific node and not to > all nodes of the cluster, will this configuration be taken into account > from all no

Re: Indexing is being throttled

2014-09-18 Thread Michael McCandless
Try disabling merge IO throttling, especially if your index is on SSD/s. (It's on by default at a paltry 20 MB/sec). Merge IO throttling causes merges to run slowly which eventually causes them to back up enough to the point where indexing must be throttled... Also see the recent post about tuni

Re: Purge the deleted documents on disk

2014-09-15 Thread Michael McCandless
By default Lucene/ES will only merge away the segment if it has "enough" deletes, where "enough" defaults to 10% of the segment. The setting is index.merge.policy.expunge_deletes_allowed ... so you can change that if you want to. However I would strongly advise not worrying about this: merging is

Re: Indexing is becoming slow, what to look for?

2014-09-05 Thread Michael McCandless
Maybe index throttling is happening (ES would say so in the logs) because your merging is falling behind? Do you throttle IO for merges (it's throttled at paltry 20 MB / sec by default)? What does hot threads report? How about top/iostat? We just got a blog post out about improving indexing thr

Re: Reduce Number of Segments

2014-08-28 Thread Michael McCandless
On Thu, Aug 28, 2014 at 3:25 PM, Chris Decker wrote: > Mike, > > :) > > I upgraded to 1.3.2 yesterday mid-afternoon. So far things feel much > snappier, but I wiped my ‘data’ directory so ES has less to search (though > most of my queries only go back 1 day anyways; I go back 3 days on Monday’s

Re: Reduce Number of Segments

2014-08-28 Thread Michael McCandless
elp the situation, but I want to make sure > I’m taking full advantage of my resources. > > > > Thanks, > Chris > > > From: Michael McCandless > Reply: elasticsearch@googlegroups.com > > > Date: August 26, 2014 at 4:27:31 PM > To: elasticsearch@goog

Re: indices.memory.index_buffer_size

2014-08-26 Thread Michael McCandless
See also https://github.com/elasticsearch/elasticsearch/pull/7440 (will be in 1.4.0) which returns the actual RAM buffer size assigned to that shard by the "little dance". Mike McCandless http://blog.mikemccandless.com On Tue, Aug 26, 2014 at 2:15 PM, Nikolas Everett wrote: > I just looked at

Re: Reduce Number of Segments

2014-08-26 Thread Michael McCandless
: > Mike, > > Thanks for the response. > > I'm running ES 1.2.1. It appears the issue that you reported / corrected > was included with ES 1.2.0. > > *Any other ideas / suggestions? *Were the settings that I posted sane? > > > Thanks!, > Chris > > On Mo

Re: Reduce Number of Segments

2014-08-25 Thread Michael McCandless
Which version of ES are you using? Versions before 1.2 have a bug that caused merge throttling to throttle far more than requested such that you couldn't get any faster than ~8 MB / sec. See https://github.com/elasticsearch/elasticsearch/issues/6018 Tiered merge policy is best. Mike McCandless

Re: Optimization Questions

2014-08-19 Thread Michael McCandless
You could turn on TRACE logging for the "lucene.iw" component. This will give tons of details about what merges are being done. Normally, if there are no writes going to the index at the same time, an optimize with max_num_segments=1 really should get down to 1 segment in the end ... not sure why

Re: bulk indexing - optimal refresh_interval

2014-07-29 Thread Michael McCandless
Disabling refresh (-1) is a good choice if you are fully maximizing your cluster's CPU/IO resources (using enough bulk client threads or async requests). In that case it should give faster indexing throughput than 30s refresh. But if you are not saturating the cluster's resources, then a refresh

Re: Garbage collection pauses causing cluster to get unresponsive

2014-07-18 Thread Michael McCandless
On Fri, Jul 18, 2014 at 9:26 AM, Srinath C wrote: > Yes Michael, the instance store SSD are faring much better than the EBS > ones. > In your EBS tests, were those SSDs attached via EBS? Or magnetic? > There are 7-9 clients each using one bulk processor with concurrent > requests of 4 each. D

Re: Garbage collection pauses causing cluster to get unresponsive

2014-07-18 Thread Michael McCandless
ound 60K docs per second. A lot of EsRejectedExecutionExceptions were >> seen. >> >>Also attaching the iostat output for these instances. >> >> Regards, >> Srinath. >> >> >> >> >> On Wed, Jul 16, 2014 at 3:34 PM, joergpra...@

Re: No efect refresh_interval

2014-07-17 Thread Michael McCandless
p10 => $data[10], >> p11 => $data[11] >> }}); >> >> } >> close($DATA); >> $bulk->flush; >> >> Setting refresh_interval to 600s in both cases has no effect. Data are >> availa

Re: No efect refresh_interval

2014-07-16 Thread Michael McCandless
Which ES version are you using? You should use the latest (soon to be 1.3): there have been a number of bulk-indexing improvements recently. Are you using the bulk API with multiple/async client threads? Are you saturating either CPU or IO in your cluster (so that the test is really a full clust

Re: Garbage collection pauses causing cluster to get unresponsive

2014-07-16 Thread Michael McCandless
Michael and all. Really appreciate you help. >> I'll try out as per your suggestions and run the tests. Will post back on >> my progress. >> >> >> >> On Tue, Jul 15, 2014 at 3:17 PM, Michael McCandless < >> m...@elasticsearch.com> wrote: >>

Re: architecture and performance question on searching small subsets of documents

2014-07-16 Thread Michael McCandless
Try the filter approach first and only if performance isn't good enough, look into other approaches. Lucene is quite fast at intersecting filters with large postings lists these days... Separate index per user is not only wasteful, because of the duplicated content, but will consume substantially

Re: excessive merging/small segment sizes

2014-07-15 Thread Michael McCandless
On Sun, Jul 13, 2014 at 5:32 AM, Michael McCandless wrote: For the index close, I didn't issue any command, elasticsearch seemed to >> do that on its own. The code is in IndexingMemoryController. The triggering >> event seems to be the ram buffer size change, this t

Re: Garbage collection pauses causing cluster to get unresponsive

2014-07-15 Thread Michael McCandless
First off, upgrade ES to the latest (1.2.2) release; there have been a number of bulk indexing improvements since 1.1. Second, disable merge IO throttling. Third, use the default settings, but increase index.refresh_interval to perhaps 5s, and set index.translog.flush_threshold_ops to maybe 5

Re: Keep the number of segments to 5

2014-07-14 Thread Michael McCandless
Also, optimize is an incredibly costly (CPU, IO) operation. Really, it should only be done when you know the index will no longer change, e.g. when the daily log index is done being written. Mike McCandless http://blog.mikemccandless.com On Sun, Jul 13, 2014 at 9:26 AM, Itamar Syn-Hershko wro

Re: excessive merging/small segment sizes

2014-07-14 Thread Michael McCandless
On Mon, Jul 14, 2014 at 12:06 AM, Kireet Reddy wrote: > We did the test with ES still running and indexing data, ES still > running/not indexing, and ES stopped. All three showed the poor i/o rate. > Then after a few minutes, the copy i/o rate somehow increased again. It was > really strange. We

Re: excessive merging/small segment sizes

2014-07-13 Thread Michael McCandless
On Fri, Jul 11, 2014 at 7:35 PM, Kireet Reddy wrote: > The problem reappeared. We did some tests today around copying a large > file on the nodes to test i/o throughput. On the loaded node, the copy was > really slow, maybe 30x slower. So it seems your suspicion around something > external interf

Re: Reading and writing the same document too fast --> data loss

2014-07-13 Thread Michael McCandless
You don't need to add your own external versions; just use ES's internal versions (starts at 1 when you create the doc, and increments each time it's updated). You know the correct version because you retrieved the current doc first from ES, which returns its current version. Then you make your c

Re: Reading and writing the same document too fast --> data loss

2014-07-11 Thread Michael McCandless
Maybe you need to use versioning, to ensure the 3rd write doesn't undo (overwrite) the changes of the 2nd write? See http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/optimistic-concurrency-control.html Mike McCandless http://blog.mikemccandless.com On Fri, Jul 11, 2014 at 6:24

Re: excessive merging/small segment sizes

2014-07-10 Thread Michael McCandless
Another question: have you disabled merge throttling? And, which version of ES are you using? Mike McCandless http://blog.mikemccandless.com On Thu, Jul 10, 2014 at 5:49 AM, Michael McCandless wrote: > Indeed the hot threads on node5 didn't reveal anything unexpected: they &g

Re: excessive merging/small segment sizes

2014-07-10 Thread Michael McCandless
wrote: > Sorry, here it is: > > https://www.dropbox.com/sh/3s6m0bhz4eshi6m/AAABnRXFCLrCne-GLG1zvQP3a > > Also a couple of graphs of the memory usage. > > > On Wednesday, July 9, 2014 2:10:49 PM UTC-7, Michael McCandless wrote: > >> Hmm link doesn't see

Re: excessive merging/small segment sizes

2014-07-09 Thread Michael McCandless
h? >> >> It seems strange to me that this would only happen on one node while we >> have replica set to at least 1 for all our indices. It seems like the >> problems should happen on a couple nodes simultaneously. >> >> --Kireet >> >> >> On Monday,

Re: excessive merging/small segment sizes

2014-07-07 Thread Michael McCandless
r documents can vary greatly in size, they average a couple KB >but can rarely be several MB. >2. we do use language analysis plugins, perhaps one of these is acting >up? >3. We eagerly load one field into the field data cache. But the cache > size is ok and the ove

Re: excessive merging/small segment sizes

2014-07-07 Thread Michael McCandless
machine with a 32GB heap and 96GB of >> memory with 4 spinning disks. >> >> node 5 log (normal) <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip> >> node 6 log (high load) >> <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip> >> >> On Sunday, July 6

Re: excessive merging/small segment sizes

2014-07-06 Thread Michael McCandless
ays ago, so the shards of each index are > balanced across the nodes. We have external metrics around document ingest > rate and there was no spike during this time period. > > > > Thanks > Kireet > > > On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote: &g

Re: excessive merging/small segment sizes

2014-07-06 Thread Michael McCandless
It's perfectly normal/healthy for many small merges below the floor size to happen. I think you should first figure out why this node is different from the others? Are you sure it's merging CPU cost that's different? Mike McCandless http://blog.mikemccandless.com On Sat, Jul 5, 2014 at 9:51 P

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-25 Thread Michael McCandless
Some responses below: On Tue, Jun 24, 2014 at 7:04 PM, Cindy Hsin wrote: > Looks like the memory usage increased a lot with 10k fields with these two > parameter disabled. > > Based on the experiment we have done, looks like ES have abnormal memory > usage and performance degradation when number

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-23 Thread Michael McCandless
Hi Cindy, There isn't a hard limit on the number of field Lucene supports, it's more than per-field there is highish heap used, added CPU/IO cost for merging, etc. It's just not a well tested usage of Lucene, not something the developers focus on optimizing, etc. Partitioning by _type won't chan

Re: Bulk inserting is slow

2014-06-23 Thread Michael McCandless
You'll actually get better indexing performance if you leave refresh enabled, maybe at 5s. This is because ES a separate refresh thread which will do the flushing, instead of having your bulk indexing threads to it when RAM is full, effectively giving you one more thread of concurrency. Mike McCa

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-21 Thread Michael McCandless
On Fri, Jun 20, 2014 at 8:00 PM, Cindy Hsin wrote: > Hi, Mike: > > Since both ES and Solr uses Lucene, do you know why we only see big ingest > performance degradation with ES but not Solr? > I'm not sure why: clearly something is slow with ES as you add more and more fields. I think it has to

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-18 Thread Michael McCandless
On Wed, Jun 18, 2014 at 2:38 AM, Maco Ma wrote: > I tried your script with setting iwc.setRAMBufferSizeMB(4)/ and 48G > heap size. The speed can be around 430 docs/sec before the first flush and > the final speed is 350 docs/sec. Not sure what configuration Solr uses and > its ingestion speed

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-17 Thread Michael McCandless
I tested roughly your Scenario 2 (100K unique fields, 100 fields per document) with a straight Lucene test (attached, but not sure if the list strips attachments). Net/net I see ~100 docs/sec with one thread ... which is very slow. Lucene stores quite a lot for each unique indexed field name and

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-17 Thread Michael McCandless
Hi, Could you post the scripts you linked to (new_ES_config.sh, new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined? I can't download them from where you linked. Optimizing every 10 seconds or 10 minutes is really not a good idea in general, but I guess if you're doing the same with ES

Re: Elasticsearch/Lucene Delete space reuse? recovery?

2014-06-05 Thread Michael McCandless
The default merge policy in Lucene (TieredMergePolicy) has a bias towards segments with more deletes, so it is "trying" to merge those ones away. You can increase this bias by setting index.reclaim_deletes_weight (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modu

Re: Unable to create mapping and settings using Java API

2014-05-02 Thread Michael McCandless
Hmm, I'm able to create an index and its mappings/settings with a single JSON request to http://localhost:9200/. What settings are you trying to set? Mike http://blog.mikemccandless.com On Thu, May 1, 2014 at 5:10 PM, Amit Soni wrote: > hello everyone - I have settings and mapping defined in

Re: Elasticsearch on java7u55 ?

2014-04-18 Thread Michael McCandless
1.7u55 should be safe for ElasticSearch; we just put out a blog post about this: http://www.elasticsearch.org/blog/java-1-7u55-safe-use-elasticsearch-lucene/ And I'll fix the nightly Lucene benchmarks to use u55 too! I should NOT have been using u40: it's not safe. Mike http://blog.mikemccand