Re: Bulk Indexing Problems

joergpra...@gmail.com Tue, 09 Sep 2014 15:50:20 -0700

Code looks okay, so it might be just the full volume that is in the way

Jörg


On Tue, Sep 9, 2014 at 8:44 PM, Joshua P <jpetersen...@gmail.com> wrote:

> This is the code I've been using to index:
>
> I'm going to try to fix the running out of space issue and then try
> slimming down settings. Thank you.
>
> public class Indexer {
>
>     private static final Logger logger = LogManager.getLogger(
> "ESBulkUploader");
>
>     public static void main(String[] args) throws IOException,
> NoSuchFieldException {
>
>         DBConnection dbConn = new DBConnection("");
>
>         String query = "SELECT TOP 300000 * FROM vw_PropertyGeneralInfo
> WHERE Country_id = 1 ORDER BY Property_id DESC";
>
>         System.out.println("getting data");
>         List<PropertyGeneralInfoRow> pgiTable =  dbConn.
> ExecuteQueryWithoutParameters(query);
>         System.out.println("got data");
>
>         ObjectMapper mapper = new ObjectMapper();
>
>         Settings settings = ImmutableSettings.settingsBuilder().put("
> cluster.name", "property_transaction_data").build();
>
>         Client client = new TransportClient(settings).addTransportAddress(
> new InetSocketTransportAddress("192.168.133.131", 9300));
>
>         BulkProcessor bulkProcessor = BulkProcessor.builder(client, new
> BulkProcessor.Listener() {
>             @Override
>             public void beforeBulk(long executionId, BulkRequest request)
> {
>                 System.out.println("About to index " + request.
> numberOfActions() + " records of size " + request.estimatedSizeInBytes() +
> ".");
>             }
>
>             @Override
>             public void afterBulk(long executionId, BulkRequest request,
> BulkResponse response) {
>                 if( response.hasFailures() ){
>                     for( BulkItemResponse item : response.getItems() ){
>                         BulkItemResponse.Failure failure = item.getFailure
> ();
>                         if( failure != null ){
>                             System.out.println(failure.getId() + " -- " +
> failure.getStatus().name() + " -- " + failure.getMessage() + " -- " +
> failure.getType());
>                         }
>                     }
>                 }
>
>                 System.out.println("Successfully indexed " + request.
> numberOfActions() + " records in " + response.getTook() + ".");
>             }
>
>             @Override
>             public void afterBulk(long executionId, BulkRequest request,
> Throwable failure) {
>                 System.out.println("failure somewhere on " + request.
> toString());
>                 failure.printStackTrace();
>                 logger.warn("failure on " + request.toString());
>             }
>         }).setBulkActions(500).setConcurrentRequests(1).build();
>
>         for( int i = 0; i < pgiTable.size(); i++ ){
>             //prep location field
>             PropertyGeneralInfoRow pgiRow = pgiTable.get(i);
>
>             Double[] location = {pgiRow.getLon_dbl(), pgiRow.getLat_dbl
> ()};
>
>             geocode geocode = new geocode();
>
>             geocode.setLocation(location);
>
>             pgiRow.setGeocode(geocode);
>
>             // prep full address string
>             pgiRow.setFulladdressstring(pgiRow.getPropertykey_tx() + ", "
> +
>                     pgiRow.getCity_tx() + ", " + pgiRow.getStateprov_cd()
> +
>                     ", " + pgiRow.getCountry_tx() + ", " + pgiRow.
> getPostalcode_tx());
>
>             String jsonRow = mapper.writeValueAsString(pgiRow);
>
>             if( jsonRow != null && !jsonRow.isEmpty() && !jsonRow.equals(
> "{}") ){
>                 bulkProcessor.add(new IndexRequest("rcapropertydata",
> "rcaproperty").source(jsonRow.getBytes()));
>                 //
> bulkProcessor.add(client.prepareIndex("rcapropertydata",
> "rcaproperty").setSource(jsonRow));
>             }
>             else{
>                 // don't add null strings..
>                 try{
>                     System.out.println(pgiRow.toString());
>                 }
>                 catch (Exception e){
>                     System.out.println("Some error in the toString()
> method...");
>                 }
>                 System.out.println("Some json output was null. -- " +
> pgiRow.getProperty_id().toString());
>             }
>
>         }
>
>         bulkProcessor.flush();
>         bulkProcessor.close();
>
>     }
>
>
>
> }
>
>
>
> On Tuesday, September 9, 2014 1:57:54 PM UTC-4, Jörg Prante wrote:
>>
>> Check the path.data setting in config/elasticsearch.yml
>>
>> Jörg
>>
>> On Tue, Sep 9, 2014 at 7:50 PM, Joshua P <jpeter...@gmail.com> wrote:
>>
>>> Just reran the indexer and found this error coming up. I'm running out
>>> of disk space on the partition ES wants to write to.
>>>
>>> F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR --
>>> TranslogException[[index_type][0] Failed to write operation
>>> [org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested:
>>> IOException[No space left on device];  -- index_type
>>>
>>> Where would I change the write location? Which config file?
>>>
>>> On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote:
>>>>
>>>> Hi Jörg,
>>>>
>>>> Can you elaborate on what you mean by I still need more fine tuning?
>>>>
>>>> I've upped the heap size to 4g (in both places I mentioned before
>>>> because it's not clear to me which one ES actually uses). I haven't tried
>>>> to index again yet.
>>>> Other than throttling my indexing, what are some other things I need to
>>>> be thinking about?
>>>>
>>>> On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:
>>>>>
>>>>> Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and
>>>>> indexing around 1 million docs, you need some more fine tuning, which is
>>>>> complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
>>>>> GB RAM.
>>>>>
>>>>> Jörg
>>>>>
>>>>> On Tue, Sep 9, 2014 at 5:39 PM, Joshua P <jpeter...@gmail.com> wrote:
>>>>>
>>>>>> Here is /etc/default/elasticsearch
>>>>>>
>>>>>> # Run Elasticsearch as this user ID and group ID
>>>>>> #ES_USER=elasticsearch
>>>>>> #ES_GROUP=elasticsearch
>>>>>>
>>>>>> # Heap Size (defaults to 256m min, 1g max)
>>>>>> ES_HEAP_SIZE=512m
>>>>>>
>>>>>> # Heap new generation
>>>>>> #ES_HEAP_NEWSIZE=
>>>>>>
>>>>>> # max direct memory
>>>>>> #ES_DIRECT_SIZE=
>>>>>>
>>>>>> # Maximum number of open files, defaults to 65535.
>>>>>> MAX_OPEN_FILES=65535
>>>>>>
>>>>>> # Maximum locked memory size. Set to "unlimited" if you use the
>>>>>> # bootstrap.mlockall option in elasticsearch.yml. You must also set
>>>>>> # ES_HEAP_SIZE.
>>>>>> MAX_LOCKED_MEMORY=unlimited
>>>>>>
>>>>>> # Maximum number of VMA (Virtual Memory Areas) a process can own
>>>>>> #MAX_MAP_COUNT=262144
>>>>>>
>>>>>> # Elasticsearch log directory
>>>>>> #LOG_DIR=/var/log/elasticsearch
>>>>>>
>>>>>> # Elasticsearch data directory
>>>>>> #DATA_DIR=/var/lib/elasticsearch
>>>>>>
>>>>>> # Elasticsearch work directory
>>>>>> #WORK_DIR=/tmp/elasticsearch
>>>>>>
>>>>>> # Elasticsearch configuration directory
>>>>>> #CONF_DIR=/etc/elasticsearch
>>>>>>
>>>>>> # Elasticsearch configuration file (elasticsearch.yml)
>>>>>> #CONF_FILE=/etc/elasticsearch/elasticsearch.yml
>>>>>>
>>>>>> # Additional Java OPTS
>>>>>> #ES_JAVA_OPTS=
>>>>>>
>>>>>> # Configure restart on package upgrade (true, every other setting
>>>>>> will lead to not restarting)
>>>>>> #RESTART_ON_UPGRADE=true
>>>>>>
>>>>>> I also see the same setting in /etc/init.d/elasticsearch. Do you know
>>>>>> which file takes priority? And what a good size would be?
>>>>>>
>>>>>> On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:
>>>>>>>
>>>>>>> Hello Joshua ,
>>>>>>>
>>>>>>> I am not sure which variable you are referring to on the memory
>>>>>>> settings in the config file , please paste the comment and config.
>>>>>>> I usually change the config from init.d script.
>>>>>>>
>>>>>>> Best approach would be to bulk index say 10,000 feeds in sync mode ,
>>>>>>> wait until is everything is indexed and then proceed to the next batch.
>>>>>>> I am not sure about the java API , but long back i used to curl to
>>>>>>> this stats API and see how much request was rejected.
>>>>>>>
>>>>>>> Thanks
>>>>>>>           Vineeth
>>>>>>>
>>>>>>> On Tue, Sep 9, 2014 at 8:58 PM, Joshua P <jpeter...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You also said you wouldn't recommend indexing that much information
>>>>>>>> at once. How would you suggest breaking it up and what status should I 
>>>>>>>> look
>>>>>>>> for before doing another batch? I have to come up with some process 
>>>>>>>> that is
>>>>>>>> repeatable and mostly automated.
>>>>>>>>
>>>>>>>> On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:
>>>>>>>>>
>>>>>>>>> Thanks for the reply, Vineeth!
>>>>>>>>>
>>>>>>>>> What's a practical heap size? I've seen some people saying they
>>>>>>>>> set it to 30gb but this confuses me because in the
>>>>>>>>> /etc/default/elasticsearch file, the comment suggests the max is only 
>>>>>>>>> 1gb?
>>>>>>>>>
>>>>>>>>> I'll look into the threadpool issue. Is there a Java API for
>>>>>>>>> monitoring Cluster Node health? Can you point me at an example or 
>>>>>>>>> give me a
>>>>>>>>> link to that?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello Joshuva ,
>>>>>>>>>>
>>>>>>>>>> I have a feeling this has something to do with the threadpool.
>>>>>>>>>> There is a limit on number of feeds to be queued for indexing.
>>>>>>>>>>
>>>>>>>>>> Try increasing the size of threadpool queue of index and bulk to
>>>>>>>>>> a large number.
>>>>>>>>>> Also through cluster node API on threadpool, you can see if any
>>>>>>>>>> request has failed.
>>>>>>>>>> Monitor this API for any failed request due to large volume.
>>>>>>>>>>
>>>>>>>>>> Threadpool - http://www.elasticsearch.org/guide/en/elasticsearch/
>>>>>>>>>> reference/current/modules-threadpool.html
>>>>>>>>>> Threadpool stats - http://www.elasticsearch.org
>>>>>>>>>> /guide/en/elasticsearch/reference/current/cluster-nodes-stat
>>>>>>>>>> s.html
>>>>>>>>>>
>>>>>>>>>> Having said that , i wont recommend bulk indexing that much
>>>>>>>>>> information at a time and 512 MB is not going to help much.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>           Vineeth
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 9, 2014 at 7:48 PM, Joshua P <jpeter...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi there!
>>>>>>>>>>>
>>>>>>>>>>> I'm trying to do a one-time index of about 800,000 records into
>>>>>>>>>>> an instance of elasticsearch. But I'm having a bit of trouble. It
>>>>>>>>>>> continually fails around 200,000 records. Looking at in the 
>>>>>>>>>>> Elasticsearch
>>>>>>>>>>> Head Plugin, my index goes offline and becomes unrecoverable.
>>>>>>>>>>>
>>>>>>>>>>> For now, I have it running on a VM on my personal machine.
>>>>>>>>>>>
>>>>>>>>>>> VM Config:
>>>>>>>>>>> Ubuntu Server 14.04 64-Bit
>>>>>>>>>>> 8 GB RAM
>>>>>>>>>>> 2 Processors
>>>>>>>>>>> 32 GB SSD
>>>>>>>>>>>
>>>>>>>>>>> Java
>>>>>>>>>>> java version "1.7.0_65"
>>>>>>>>>>> OpenJDK Runtime Environment (IcedTea 2.5.1)
>>>>>>>>>>> (7u65-2.5.1-4ubuntu1~0.14.04.2)
>>>>>>>>>>> OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
>>>>>>>>>>>
>>>>>>>>>>> Elasticsearch is using mostly the defaults. This is the output
>>>>>>>>>>> of:
>>>>>>>>>>> curl http://localhost:9200/_nodes/process?pretty
>>>>>>>>>>> {
>>>>>>>>>>>   "cluster_name" : "property_transaction_data",
>>>>>>>>>>>   "nodes" : {
>>>>>>>>>>>     "KlFkO_qgSOKmV_jjj5xeVw" : {
>>>>>>>>>>>       "name" : "Marvin Flumm",
>>>>>>>>>>>       "transport_address" : "inet[/192.168.133.131:9300]",
>>>>>>>>>>>       "host" : "ubuntu-es",
>>>>>>>>>>>       "ip" : "127.0.1.1",
>>>>>>>>>>>       "version" : "1.3.2",
>>>>>>>>>>>       "build" : "dee175d",
>>>>>>>>>>>       "http_address" : "inet[/192.168.133.131:9200]",
>>>>>>>>>>>       "process" : {
>>>>>>>>>>>         "refresh_interval_in_millis" : 1000,
>>>>>>>>>>>         "id" : 1092,
>>>>>>>>>>>         "max_file_descriptors" : 65535,
>>>>>>>>>>>         "mlockall" : true
>>>>>>>>>>>       }
>>>>>>>>>>>     }
>>>>>>>>>>>   }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> I adjusted ES_HEAP_SIZE to 512mb.
>>>>>>>>>>>
>>>>>>>>>>> I'm using the following code to pull data from SQL Server and
>>>>>>>>>>> index it.
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>> Google Groups "elasticsearch" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>> it, send an email to elasticsearc...@googlegroups.com.
>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
>>>>>>>>>>> f-462f-bdcf-df717cbc6269%40googlegroups.com
>>>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "elasticsearch" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/0dcac495-a07
>>>>>>>> 1-4644-9349-109071fb1855%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40goo
>>>>>> glegroups.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/7618bd05-9a0f-4248-8f16-0950198473db%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/7618bd05-9a0f-4248-8f16-0950198473db%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEmuc98erxCGxxn_E8JDzsxvSzu-%3D_w6qLL8RyPeves9w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Bulk Indexing Problems

Reply via email to