This is the code I've been using to index: I'm going to try to fix the running out of space issue and then try slimming down settings. Thank you.
public class Indexer { private static final Logger logger = LogManager.getLogger( "ESBulkUploader"); public static void main(String[] args) throws IOException, NoSuchFieldException { DBConnection dbConn = new DBConnection(""); String query = "SELECT TOP 300000 * FROM vw_PropertyGeneralInfo WHERE Country_id = 1 ORDER BY Property_id DESC"; System.out.println("getting data"); List<PropertyGeneralInfoRow> pgiTable = dbConn. ExecuteQueryWithoutParameters(query); System.out.println("got data"); ObjectMapper mapper = new ObjectMapper(); Settings settings = ImmutableSettings.settingsBuilder().put( "cluster.name", "property_transaction_data").build(); Client client = new TransportClient(settings).addTransportAddress( new InetSocketTransportAddress("192.168.133.131", 9300)); BulkProcessor bulkProcessor = BulkProcessor.builder(client, new BulkProcessor.Listener() { @Override public void beforeBulk(long executionId, BulkRequest request) { System.out.println("About to index " + request. numberOfActions() + " records of size " + request.estimatedSizeInBytes() + "."); } @Override public void afterBulk(long executionId, BulkRequest request, BulkResponse response) { if( response.hasFailures() ){ for( BulkItemResponse item : response.getItems() ){ BulkItemResponse.Failure failure = item.getFailure (); if( failure != null ){ System.out.println(failure.getId() + " -- " + failure.getStatus().name() + " -- " + failure.getMessage() + " -- " + failure.getType()); } } } System.out.println("Successfully indexed " + request. numberOfActions() + " records in " + response.getTook() + "."); } @Override public void afterBulk(long executionId, BulkRequest request, Throwable failure) { System.out.println("failure somewhere on " + request. toString()); failure.printStackTrace(); logger.warn("failure on " + request.toString()); } }).setBulkActions(500).setConcurrentRequests(1).build(); for( int i = 0; i < pgiTable.size(); i++ ){ //prep location field PropertyGeneralInfoRow pgiRow = pgiTable.get(i); Double[] location = {pgiRow.getLon_dbl(), pgiRow.getLat_dbl()}; geocode geocode = new geocode(); geocode.setLocation(location); pgiRow.setGeocode(geocode); // prep full address string pgiRow.setFulladdressstring(pgiRow.getPropertykey_tx() + ", " + pgiRow.getCity_tx() + ", " + pgiRow.getStateprov_cd() + ", " + pgiRow.getCountry_tx() + ", " + pgiRow. getPostalcode_tx()); String jsonRow = mapper.writeValueAsString(pgiRow); if( jsonRow != null && !jsonRow.isEmpty() && !jsonRow.equals( "{}") ){ bulkProcessor.add(new IndexRequest("rcapropertydata", "rcaproperty").source(jsonRow.getBytes())); // bulkProcessor.add(client.prepareIndex("rcapropertydata", "rcaproperty").setSource(jsonRow)); } else{ // don't add null strings.. try{ System.out.println(pgiRow.toString()); } catch (Exception e){ System.out.println("Some error in the toString() method..."); } System.out.println("Some json output was null. -- " + pgiRow .getProperty_id().toString()); } } bulkProcessor.flush(); bulkProcessor.close(); } } On Tuesday, September 9, 2014 1:57:54 PM UTC-4, Jörg Prante wrote: > > Check the path.data setting in config/elasticsearch.yml > > Jörg > > On Tue, Sep 9, 2014 at 7:50 PM, Joshua P <jpeter...@gmail.com > <javascript:>> wrote: > >> Just reran the indexer and found this error coming up. I'm running out of >> disk space on the partition ES wants to write to. >> >> F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR -- >> TranslogException[[index_type][0] Failed to write operation >> [org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested: >> IOException[No space left on device]; -- index_type >> >> Where would I change the write location? Which config file? >> >> On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote: >>> >>> Hi Jörg, >>> >>> Can you elaborate on what you mean by I still need more fine tuning? >>> >>> I've upped the heap size to 4g (in both places I mentioned before >>> because it's not clear to me which one ES actually uses). I haven't tried >>> to index again yet. >>> Other than throttling my indexing, what are some other things I need to >>> be thinking about? >>> >>> On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote: >>>> >>>> Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and >>>> indexing around 1 million docs, you need some more fine tuning, which is >>>> complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8 >>>> GB RAM. >>>> >>>> Jörg >>>> >>>> On Tue, Sep 9, 2014 at 5:39 PM, Joshua P <jpeter...@gmail.com> wrote: >>>> >>>>> Here is /etc/default/elasticsearch >>>>> >>>>> # Run Elasticsearch as this user ID and group ID >>>>> #ES_USER=elasticsearch >>>>> #ES_GROUP=elasticsearch >>>>> >>>>> # Heap Size (defaults to 256m min, 1g max) >>>>> ES_HEAP_SIZE=512m >>>>> >>>>> # Heap new generation >>>>> #ES_HEAP_NEWSIZE= >>>>> >>>>> # max direct memory >>>>> #ES_DIRECT_SIZE= >>>>> >>>>> # Maximum number of open files, defaults to 65535. >>>>> MAX_OPEN_FILES=65535 >>>>> >>>>> # Maximum locked memory size. Set to "unlimited" if you use the >>>>> # bootstrap.mlockall option in elasticsearch.yml. You must also set >>>>> # ES_HEAP_SIZE. >>>>> MAX_LOCKED_MEMORY=unlimited >>>>> >>>>> # Maximum number of VMA (Virtual Memory Areas) a process can own >>>>> #MAX_MAP_COUNT=262144 >>>>> >>>>> # Elasticsearch log directory >>>>> #LOG_DIR=/var/log/elasticsearch >>>>> >>>>> # Elasticsearch data directory >>>>> #DATA_DIR=/var/lib/elasticsearch >>>>> >>>>> # Elasticsearch work directory >>>>> #WORK_DIR=/tmp/elasticsearch >>>>> >>>>> # Elasticsearch configuration directory >>>>> #CONF_DIR=/etc/elasticsearch >>>>> >>>>> # Elasticsearch configuration file (elasticsearch.yml) >>>>> #CONF_FILE=/etc/elasticsearch/elasticsearch.yml >>>>> >>>>> # Additional Java OPTS >>>>> #ES_JAVA_OPTS= >>>>> >>>>> # Configure restart on package upgrade (true, every other setting will >>>>> lead to not restarting) >>>>> #RESTART_ON_UPGRADE=true >>>>> >>>>> I also see the same setting in /etc/init.d/elasticsearch. Do you know >>>>> which file takes priority? And what a good size would be? >>>>> >>>>> On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote: >>>>>> >>>>>> Hello Joshua , >>>>>> >>>>>> I am not sure which variable you are referring to on the memory >>>>>> settings in the config file , please paste the comment and config. >>>>>> I usually change the config from init.d script. >>>>>> >>>>>> Best approach would be to bulk index say 10,000 feeds in sync mode , >>>>>> wait until is everything is indexed and then proceed to the next batch. >>>>>> I am not sure about the java API , but long back i used to curl to >>>>>> this stats API and see how much request was rejected. >>>>>> >>>>>> Thanks >>>>>> Vineeth >>>>>> >>>>>> On Tue, Sep 9, 2014 at 8:58 PM, Joshua P <jpeter...@gmail.com> wrote: >>>>>> >>>>>>> You also said you wouldn't recommend indexing that much information >>>>>>> at once. How would you suggest breaking it up and what status should I >>>>>>> look >>>>>>> for before doing another batch? I have to come up with some process >>>>>>> that is >>>>>>> repeatable and mostly automated. >>>>>>> >>>>>>> On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote: >>>>>>>> >>>>>>>> Thanks for the reply, Vineeth! >>>>>>>> >>>>>>>> What's a practical heap size? I've seen some people saying they set >>>>>>>> it to 30gb but this confuses me because in the >>>>>>>> /etc/default/elasticsearch >>>>>>>> file, the comment suggests the max is only 1gb? >>>>>>>> >>>>>>>> I'll look into the threadpool issue. Is there a Java API for >>>>>>>> monitoring Cluster Node health? Can you point me at an example or give >>>>>>>> me a >>>>>>>> link to that? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hello Joshuva , >>>>>>>>> >>>>>>>>> I have a feeling this has something to do with the threadpool. >>>>>>>>> There is a limit on number of feeds to be queued for indexing. >>>>>>>>> >>>>>>>>> Try increasing the size of threadpool queue of index and bulk to a >>>>>>>>> large number. >>>>>>>>> Also through cluster node API on threadpool, you can see if any >>>>>>>>> request has failed. >>>>>>>>> Monitor this API for any failed request due to large volume. >>>>>>>>> >>>>>>>>> Threadpool - http://www.elasticsearch.org/guide/en/elasticsearch/ >>>>>>>>> reference/current/modules-threadpool.html >>>>>>>>> Threadpool stats - http://www.elasticsearch.org >>>>>>>>> /guide/en/elasticsearch/reference/current/cluster-nodes-stats.html >>>>>>>>> >>>>>>>>> Having said that , i wont recommend bulk indexing that much >>>>>>>>> information at a time and 512 MB is not going to help much. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Vineeth >>>>>>>>> >>>>>>>>> On Tue, Sep 9, 2014 at 7:48 PM, Joshua P <jpeter...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi there! >>>>>>>>>> >>>>>>>>>> I'm trying to do a one-time index of about 800,000 records into >>>>>>>>>> an instance of elasticsearch. But I'm having a bit of trouble. It >>>>>>>>>> continually fails around 200,000 records. Looking at in the >>>>>>>>>> Elasticsearch >>>>>>>>>> Head Plugin, my index goes offline and becomes unrecoverable. >>>>>>>>>> >>>>>>>>>> For now, I have it running on a VM on my personal machine. >>>>>>>>>> >>>>>>>>>> VM Config: >>>>>>>>>> Ubuntu Server 14.04 64-Bit >>>>>>>>>> 8 GB RAM >>>>>>>>>> 2 Processors >>>>>>>>>> 32 GB SSD >>>>>>>>>> >>>>>>>>>> Java >>>>>>>>>> java version "1.7.0_65" >>>>>>>>>> OpenJDK Runtime Environment (IcedTea 2.5.1) >>>>>>>>>> (7u65-2.5.1-4ubuntu1~0.14.04.2) >>>>>>>>>> OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) >>>>>>>>>> >>>>>>>>>> Elasticsearch is using mostly the defaults. This is the output >>>>>>>>>> of: >>>>>>>>>> curl http://localhost:9200/_nodes/process?pretty >>>>>>>>>> { >>>>>>>>>> "cluster_name" : "property_transaction_data", >>>>>>>>>> "nodes" : { >>>>>>>>>> "KlFkO_qgSOKmV_jjj5xeVw" : { >>>>>>>>>> "name" : "Marvin Flumm", >>>>>>>>>> "transport_address" : "inet[/192.168.133.131:9300]", >>>>>>>>>> "host" : "ubuntu-es", >>>>>>>>>> "ip" : "127.0.1.1", >>>>>>>>>> "version" : "1.3.2", >>>>>>>>>> "build" : "dee175d", >>>>>>>>>> "http_address" : "inet[/192.168.133.131:9200]", >>>>>>>>>> "process" : { >>>>>>>>>> "refresh_interval_in_millis" : 1000, >>>>>>>>>> "id" : 1092, >>>>>>>>>> "max_file_descriptors" : 65535, >>>>>>>>>> "mlockall" : true >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> I adjusted ES_HEAP_SIZE to 512mb. >>>>>>>>>> >>>>>>>>>> I'm using the following code to pull data from SQL Server and >>>>>>>>>> index it. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "elasticsearch" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3 >>>>>>>>>> f-462f-bdcf-df717cbc6269%40googlegroups.com >>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "elasticsearch" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elasticsearch/0dcac495-a07 >>>>>>> 1-4644-9349-109071fb1855%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to elasticsearc...@googlegroups.com. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c% >>>>> 40googlegroups.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7618bd05-9a0f-4248-8f16-0950198473db%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.