Re: Improving Bulk Indexing

2014-02-04 Thread joergpra...@gmail.com
SSD will improve overall performance very much, yes. Disk drives are the slowest part in the chain and this will help. No more low IOPS, so it will significantly reduce the load on CPU (less IO waits). More RAM will not help that much. In fact, more RAM will slow down persisting, it increases

Re: Improving Bulk Indexing

2014-02-04 Thread joergpra...@gmail.com
My use case is bibliographic data indexing of academic and public libraries. There are ~100m records from various sources that I regularly extract, transform into JSON-LD, and load into Elasticsearch. Some are files, some are fetched by JDBC. I have six 32-core servers in our place, organized in 2

Re: Improving Bulk Indexing

2014-02-04 Thread ZenMaster80
Good to know, I will keep this in mind, even though I will try to go for SSD as I personally had great success with them in the past! When you say 10-12 MB/sec, is this with doc parsing/processing or just ES index time. For my humble test on a quadcore labtop, I am pushing 6 MB/sec with

Re: Improving Bulk Indexing

2014-02-04 Thread joergpra...@gmail.com
SSD is the best you can do for the persistence layer. I have such an ES 4xSSD RAID0 server at home, with 800 MB/sec sustained write I/O rate. My servers for my day job are some years old when some TB in SSD costed a fortune. The higher the writing rate and IOPS capacity of the drives are, the

Re: Improving Bulk Indexing

2014-02-03 Thread ZenMaster80
Jörg, Just so I understand this, if I were to index 100 MB worth of data total with chunk volumes of 5 MB each, this means I have to index 20 times.If I were to set the bulk size to 20 MB, I will have to index 5 times. This is a small data size, picture I have millions of documents. Are you

Re: Improving Bulk Indexing

2014-02-03 Thread joergpra...@gmail.com
Not sure if I understand. If I had to index a pile of documents, say 15M, I would build bulk request of 1000 documents, where each doc is in avg ~1K so I end up at ~1MB. I would not care about different doc size as they equal out over the total amountThen I send this bulk request over the wire.

Re: Improving Bulk Indexing

2014-02-03 Thread ZenMaster80
Thanks again for clarifying this, I think I understand this, what I was referring to in my prior posts was the difference between setting 1000 documents vs 1 documents, I was thinking the bigger the chunk volume will produce less over the wire index requests, but I understand your

Re: Improving Bulk Indexing

2014-02-02 Thread joergpra...@gmail.com
What is the default of JVM 64 MB limit? Elasticsearch uses by default 1 GB heap, not 64 MB. Maybe you have an extra JVM with your bulk client that uses 64 MB? This is much too few. Use 4-6 GB heap if your machine allows that. Note, JVM 7 of OpenJDK/Oracle, which is recommended, uses 25% of your