I can confirm this behavior, seen when sending json docs in batch, never happens when sending one by one, but sporadic when sending batches.
Like if sole/jetty drops couple of documents out of the batch. Regards > On 21 Jul 2015, at 21:38, Vineeth Dasaraju <vineeth.ii...@gmail.com> wrote: > > Hi, > > Thank You Erick for your inputs. I tried creating batches of 1000 objects > and indexing it to solr. The performance is way better than before but I > find that number of indexed documents that is shown in the dashboard is > lesser than the number of documents that I had actually indexed through > solrj. My code is as follows: > > private static String SOLR_SERVER_URL = "http://localhost:8983/solr/newcore > "; > private static String JSON_FILE_PATH = "/home/vineeth/week1_fixed.json"; > private static JSONParser parser = new JSONParser(); > private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL); > > public static void main(String[] args) throws IOException, > SolrServerException, ParseException { > File file = new File(JSON_FILE_PATH); > Scanner scn=new Scanner(file,"UTF-8"); > JSONObject object; > int i = 0; > Collection<SolrInputDocument> batch = new > ArrayList<SolrInputDocument>(); > while(scn.hasNext()){ > object= (JSONObject) parser.parse(scn.nextLine()); > SolrInputDocument doc = indexJSON(object); > batch.add(doc); > if(i%1000==0){ > System.out.println("Indexed " + (i+1) + " objects." ); > solr.add(batch); > batch = new ArrayList<SolrInputDocument>(); > } > i++; > } > solr.add(batch); > solr.commit(); > System.out.println("Indexed " + (i+1) + " objects." ); > } > > public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws > ParseException, IOException, SolrServerException { > Collection<SolrInputDocument> batch = new > ArrayList<SolrInputDocument>(); > > SolrInputDocument mainEvent = new SolrInputDocument(); > mainEvent.addField("id", generateID()); > mainEvent.addField("RawEventMessage", jsonOBJ.get("RawEventMessage")); > mainEvent.addField("EventUid", jsonOBJ.get("EventUid")); > mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector")); > mainEvent.addField("EventMessageType", jsonOBJ.get("EventMessageType")); > mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent")); > mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC")); > > Object obj = parser.parse(jsonOBJ.get("User").toString()); > JSONObject userObj = (JSONObject) obj; > > SolrInputDocument childUserEvent = new SolrInputDocument(); > childUserEvent.addField("id", generateID()); > childUserEvent.addField("User", userObj.get("User")); > > obj = parser.parse(jsonOBJ.get("EventDescription").toString()); > JSONObject eventdescriptionObj = (JSONObject) obj; > > SolrInputDocument childEventDescEvent = new SolrInputDocument(); > childEventDescEvent.addField("id", generateID()); > childEventDescEvent.addField("EventApplicationName", > eventdescriptionObj.get("EventApplicationName")); > childEventDescEvent.addField("Query", eventdescriptionObj.get("Query")); > > obj= JSONValue.parse(eventdescriptionObj.get("Information").toString()); > JSONArray informationArray = (JSONArray) obj; > > for(int i = 0; i<informationArray.size(); i++){ > JSONObject domain = (JSONObject) informationArray.get(i); > > SolrInputDocument domainDoc = new SolrInputDocument(); > domainDoc.addField("id", generateID()); > domainDoc.addField("domainName", domain.get("domainName")); > > String s = domain.get("columns").toString(); > obj= JSONValue.parse(s); > JSONArray ColumnsArray = (JSONArray) obj; > > SolrInputDocument columnsDoc = new SolrInputDocument(); > columnsDoc.addField("id", generateID()); > > for(int j = 0; j<ColumnsArray.size(); j++){ > JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j); > SolrInputDocument columnDoc = new SolrInputDocument(); > columnDoc.addField("id", generateID()); > columnDoc.addField("movieName", ColumnsObj.get("movieName")); > columnsDoc.addChildDocument(columnDoc); > } > domainDoc.addChildDocument(columnsDoc); > childEventDescEvent.addChildDocument(domainDoc); > } > > mainEvent.addChildDocument(childEventDescEvent); > mainEvent.addChildDocument(childUserEvent); > return mainEvent; > } > > I would be grateful if you could let me know what I am missing. > > On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> First thing is it looks like you're only sending one document at a >> time, perhaps with child objects. This is not optimal at all. I >> usually batch my docs up in groups of 1,000, and there is anecdotal >> evidence that there may (depending on the docs) be some gains above >> that number. Gotta balance the batch size off against how bug the docs >> are of course. >> >> Assuming that you really are calling this method for one doc (and >> children) at a time, the far bigger problem other than calling >> server.add for each parent/children is that you're then calling >> solr.commit() every time. This is an anti-pattern. Generally, let the >> autoCommit setting in solrconfig.xml handle the intermediate commits >> while the indexing program is running and only issue a commit at the >> very end of the job if at all. >> >> Best, >> Erick >> >> On Sun, Jul 19, 2015 at 12:08 PM, Vineeth Dasaraju >> <vineeth.ii...@gmail.com> wrote: >>> Hi, >>> >>> I am trying to index JSON objects (which contain nested JSON objects and >>> Arrays in them) into solr. >>> >>> My JSON Object looks like the following (This is fake data that I am >> using >>> for this example): >>> >>> { >>> "RawEventMessage": "Lorem ipsum dolor sit amet, consectetur >> adipiscing >>> elit. Aliquam dolor orci, placerat ac pretium a, tincidunt consectetur >>> mauris. Etiam sollicitudin sapien id odio tempus, non sodales odio >> iaculis. >>> Donec fringilla diam at placerat interdum. Proin vitae arcu non augue >>> facilisis auctor id non neque. Integer non nibh sit amet justo facilisis >>> semper a vel ligula. Pellentesque commodo vulputate consequat. ", >>> "EventUid": "1279706565", >>> "TimeOfEvent": "2015-05-01-08-07-13", >>> "TimeOfEventUTC": "2015-05-01-01-07-13", >>> "EventCollector": "kafka", >>> "EventMessageType": "kafka-@column", >>> "User": { >>> "User": "Lorem ipsum", >>> "UserGroup": "Manager", >>> "Location": "consectetur adipiscing", >>> "Department": "Legal" >>> }, >>> "EventDescription": { >>> "EventApplicationName": "", >>> "Query": "SELECT * FROM MOVIES", >>> "Information": [ >>> { >>> "domainName": "English", >>> "columns": [ >>> { >>> "movieName": "Casablanca", >>> "duration": "154", >>> }, >>> { >>> "movieName": "Die Hard", >>> "duration": "127", >>> } >>> ] >>> }, >>> { >>> "domainName": "Hindi", >>> "columns": [ >>> { >>> "movieName": "DDLJ", >>> "duration": "176", >>> } >>> ] >>> } >>> ] >>> } >>> } >>> >>> >>> >>> My function for indexing the object is as follows: >>> >>> public static void indexJSON(JSONObject jsonOBJ) throws ParseException, >>> IOException, SolrServerException { >>> Collection<SolrInputDocument> batch = new >>> ArrayList<SolrInputDocument>(); >>> >>> SolrInputDocument mainEvent = new SolrInputDocument(); >>> mainEvent.addField("id", generateID()); >>> mainEvent.addField("RawEventMessage", >> jsonOBJ.get("RawEventMessage")); >>> mainEvent.addField("EventUid", jsonOBJ.get("EventUid")); >>> mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector")); >>> mainEvent.addField("EventMessageType", >> jsonOBJ.get("EventMessageType")); >>> mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent")); >>> mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC")); >>> >>> Object obj = parser.parse(jsonOBJ.get("User").toString()); >>> JSONObject userObj = (JSONObject) obj; >>> >>> SolrInputDocument childUserEvent = new SolrInputDocument(); >>> childUserEvent.addField("id", generateID()); >>> childUserEvent.addField("User", userObj.get("User")); >>> >>> obj = parser.parse(jsonOBJ.get("EventDescription").toString()); >>> JSONObject eventdescriptionObj = (JSONObject) obj; >>> >>> SolrInputDocument childEventDescEvent = new SolrInputDocument(); >>> childEventDescEvent.addField("id", generateID()); >>> childEventDescEvent.addField("EventApplicationName", >>> eventdescriptionObj.get("EventApplicationName")); >>> childEventDescEvent.addField("Query", >> eventdescriptionObj.get("Query")); >>> >>> obj= >> JSONValue.parse(eventdescriptionObj.get("Information").toString()); >>> JSONArray informationArray = (JSONArray) obj; >>> >>> for(int i = 0; i<informationArray.size(); i++){ >>> JSONObject domain = (JSONObject) informationArray.get(i); >>> >>> SolrInputDocument domainDoc = new SolrInputDocument(); >>> domainDoc.addField("id", generateID()); >>> domainDoc.addField("domainName", domain.get("domainName")); >>> >>> String s = domain.get("columns").toString(); >>> obj= JSONValue.parse(s); >>> JSONArray ColumnsArray = (JSONArray) obj; >>> >>> SolrInputDocument columnsDoc = new SolrInputDocument(); >>> columnsDoc.addField("id", generateID()); >>> >>> for(int j = 0; j<ColumnsArray.size(); j++){ >>> JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j); >>> SolrInputDocument columnDoc = new SolrInputDocument(); >>> columnDoc.addField("id", generateID()); >>> columnDoc.addField("movieName", ColumnsObj.get("movieName")); >>> columnsDoc.addChildDocument(columnDoc); >>> } >>> domainDoc.addChildDocument(columnsDoc); >>> childEventDescEvent.addChildDocument(domainDoc); >>> } >>> >>> mainEvent.addChildDocument(childEventDescEvent); >>> mainEvent.addChildDocument(childUserEvent); >>> batch.add(mainEvent); >>> solr.add(batch); >>> solr.commit(); >>> } >>> >>> When I try to index the using the above code, I am able to index only 12 >>> Objects per second. Is there a faster way to do the indexing? I believe I >>> am using the json-fast parser which is one of the fastest parsers for >> json. >>> >>> Your help will be very valuable to me. >>> >>> Thanks, >>> Vineeth >>