I can confirm this behavior, seen when sending json docs in batch, never 
happens when sending one by one, but sporadic when sending batches.

Like if sole/jetty drops couple of documents out of the batch.


> On 21 Jul 2015, at 21:38, Vineeth Dasaraju <vineeth.ii...@gmail.com> wrote:
> Hi,
> Thank You Erick for your inputs. I tried creating batches of 1000 objects
> and indexing it to solr. The performance is way better than before but I
> find that number of indexed documents that is shown in the dashboard is
> lesser than the number of documents that I had actually indexed through
> solrj. My code is as follows:
> private static String SOLR_SERVER_URL = "http://localhost:8983/solr/newcore
> ";
> private static String JSON_FILE_PATH = "/home/vineeth/week1_fixed.json";
> private static JSONParser parser = new JSONParser();
> private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);
> public static void main(String[] args) throws IOException,
> SolrServerException, ParseException {
>        File file = new File(JSON_FILE_PATH);
>        Scanner scn=new Scanner(file,"UTF-8");
>        JSONObject object;
>        int i = 0;
>        Collection<SolrInputDocument> batch = new
> ArrayList<SolrInputDocument>();
>        while(scn.hasNext()){
>            object= (JSONObject) parser.parse(scn.nextLine());
>            SolrInputDocument doc = indexJSON(object);
>            batch.add(doc);
>            if(i%1000==0){
>                System.out.println("Indexed " + (i+1) + " objects." );
>                solr.add(batch);
>                batch = new ArrayList<SolrInputDocument>();
>            }
>            i++;
>        }
>        solr.add(batch);
>        solr.commit();
>        System.out.println("Indexed " + (i+1) + " objects." );
> }
> public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
> ParseException, IOException, SolrServerException {
>    Collection<SolrInputDocument> batch = new
> ArrayList<SolrInputDocument>();
>    SolrInputDocument mainEvent = new SolrInputDocument();
>    mainEvent.addField("id", generateID());
>    mainEvent.addField("RawEventMessage", jsonOBJ.get("RawEventMessage"));
>    mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
>    mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
>    mainEvent.addField("EventMessageType", jsonOBJ.get("EventMessageType"));
>    mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
>    mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));
>    Object obj = parser.parse(jsonOBJ.get("User").toString());
>    JSONObject userObj = (JSONObject) obj;
>    SolrInputDocument childUserEvent = new SolrInputDocument();
>    childUserEvent.addField("id", generateID());
>    childUserEvent.addField("User", userObj.get("User"));
>    obj = parser.parse(jsonOBJ.get("EventDescription").toString());
>    JSONObject eventdescriptionObj = (JSONObject) obj;
>    SolrInputDocument childEventDescEvent = new SolrInputDocument();
>    childEventDescEvent.addField("id", generateID());
>    childEventDescEvent.addField("EventApplicationName",
> eventdescriptionObj.get("EventApplicationName"));
>    childEventDescEvent.addField("Query", eventdescriptionObj.get("Query"));
>    obj= JSONValue.parse(eventdescriptionObj.get("Information").toString());
>    JSONArray informationArray = (JSONArray) obj;
>    for(int i = 0; i<informationArray.size(); i++){
>        JSONObject domain = (JSONObject) informationArray.get(i);
>        SolrInputDocument domainDoc = new SolrInputDocument();
>        domainDoc.addField("id", generateID());
>        domainDoc.addField("domainName", domain.get("domainName"));
>        String s = domain.get("columns").toString();
>        obj= JSONValue.parse(s);
>        JSONArray ColumnsArray = (JSONArray) obj;
>        SolrInputDocument columnsDoc = new SolrInputDocument();
>        columnsDoc.addField("id", generateID());
>        for(int j = 0; j<ColumnsArray.size(); j++){
>            JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
>            SolrInputDocument columnDoc = new SolrInputDocument();
>            columnDoc.addField("id", generateID());
>            columnDoc.addField("movieName", ColumnsObj.get("movieName"));
>            columnsDoc.addChildDocument(columnDoc);
>        }
>        domainDoc.addChildDocument(columnsDoc);
>        childEventDescEvent.addChildDocument(domainDoc);
>    }
>    mainEvent.addChildDocument(childEventDescEvent);
>    mainEvent.addChildDocument(childUserEvent);
>    return mainEvent;
> }
> I would be grateful if you could let me know what I am missing.
> On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>> First thing is it looks like you're only sending one document at a
>> time, perhaps with child objects. This is not optimal at all. I
>> usually batch my docs up in groups of 1,000, and there is anecdotal
>> evidence that there may (depending on the docs) be some gains above
>> that number. Gotta balance the batch size off against how bug the docs
>> are of course.
>> Assuming that you really are calling this method for one doc (and
>> children) at a time, the far bigger problem other than calling
>> server.add for each parent/children is that you're then calling
>> solr.commit() every time. This is an anti-pattern. Generally, let the
>> autoCommit setting in solrconfig.xml handle the intermediate commits
>> while the indexing program is running and only issue a commit at the
>> very end of the job if at all.
>> Best,
>> Erick
>> On Sun, Jul 19, 2015 at 12:08 PM, Vineeth Dasaraju
>> <vineeth.ii...@gmail.com> wrote:
>>> Hi,
>>> I am trying to index JSON objects (which contain nested JSON objects and
>>> Arrays in them) into solr.
>>> My JSON Object looks like the following (This is fake data that I am
>> using
>>> for this example):
>>> {
>>>    "RawEventMessage": "Lorem ipsum dolor sit amet, consectetur
>> adipiscing
>>> elit. Aliquam dolor orci, placerat ac pretium a, tincidunt consectetur
>>> mauris. Etiam sollicitudin sapien id odio tempus, non sodales odio
>> iaculis.
>>> Donec fringilla diam at placerat interdum. Proin vitae arcu non augue
>>> facilisis auctor id non neque. Integer non nibh sit amet justo facilisis
>>> semper a vel ligula. Pellentesque commodo vulputate consequat. ",
>>>    "EventUid": "1279706565",
>>>    "TimeOfEvent": "2015-05-01-08-07-13",
>>>    "TimeOfEventUTC": "2015-05-01-01-07-13",
>>>    "EventCollector": "kafka",
>>>    "EventMessageType": "kafka-@column",
>>>    "User": {
>>>        "User": "Lorem ipsum",
>>>        "UserGroup": "Manager",
>>>        "Location": "consectetur adipiscing",
>>>        "Department": "Legal"
>>>    },
>>>    "EventDescription": {
>>>        "EventApplicationName": "",
>>>        "Query": "SELECT * FROM MOVIES",
>>>        "Information": [
>>>            {
>>>                "domainName": "English",
>>>                "columns": [
>>>                    {
>>>                        "movieName": "Casablanca",
>>>                        "duration": "154",
>>>                    },
>>>    {
>>>                        "movieName": "Die Hard",
>>>                        "duration": "127",
>>>                    }
>>>                ]
>>>            },
>>>            {
>>>                "domainName": "Hindi",
>>>                "columns": [
>>>                    {
>>>                        "movieName": "DDLJ",
>>>                        "duration": "176",
>>>                    }
>>>                ]
>>>            }
>>>        ]
>>>    }
>>> }
>>> My function for indexing the object is as follows:
>>> public static void indexJSON(JSONObject jsonOBJ) throws ParseException,
>>> IOException, SolrServerException {
>>>    Collection<SolrInputDocument> batch = new
>>> ArrayList<SolrInputDocument>();
>>>    SolrInputDocument mainEvent = new SolrInputDocument();
>>>    mainEvent.addField("id", generateID());
>>>    mainEvent.addField("RawEventMessage",
>> jsonOBJ.get("RawEventMessage"));
>>>    mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
>>>    mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
>>>    mainEvent.addField("EventMessageType",
>> jsonOBJ.get("EventMessageType"));
>>>    mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
>>>    mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));
>>>    Object obj = parser.parse(jsonOBJ.get("User").toString());
>>>    JSONObject userObj = (JSONObject) obj;
>>>    SolrInputDocument childUserEvent = new SolrInputDocument();
>>>    childUserEvent.addField("id", generateID());
>>>    childUserEvent.addField("User", userObj.get("User"));
>>>    obj = parser.parse(jsonOBJ.get("EventDescription").toString());
>>>    JSONObject eventdescriptionObj = (JSONObject) obj;
>>>    SolrInputDocument childEventDescEvent = new SolrInputDocument();
>>>    childEventDescEvent.addField("id", generateID());
>>>    childEventDescEvent.addField("EventApplicationName",
>>> eventdescriptionObj.get("EventApplicationName"));
>>>    childEventDescEvent.addField("Query",
>> eventdescriptionObj.get("Query"));
>>>    obj=
>> JSONValue.parse(eventdescriptionObj.get("Information").toString());
>>>    JSONArray informationArray = (JSONArray) obj;
>>>    for(int i = 0; i<informationArray.size(); i++){
>>>        JSONObject domain = (JSONObject) informationArray.get(i);
>>>        SolrInputDocument domainDoc = new SolrInputDocument();
>>>        domainDoc.addField("id", generateID());
>>>        domainDoc.addField("domainName", domain.get("domainName"));
>>>        String s = domain.get("columns").toString();
>>>        obj= JSONValue.parse(s);
>>>        JSONArray ColumnsArray = (JSONArray) obj;
>>>        SolrInputDocument columnsDoc = new SolrInputDocument();
>>>        columnsDoc.addField("id", generateID());
>>>        for(int j = 0; j<ColumnsArray.size(); j++){
>>>            JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
>>>            SolrInputDocument columnDoc = new SolrInputDocument();
>>>            columnDoc.addField("id", generateID());
>>>            columnDoc.addField("movieName", ColumnsObj.get("movieName"));
>>>            columnsDoc.addChildDocument(columnDoc);
>>>        }
>>>        domainDoc.addChildDocument(columnsDoc);
>>>        childEventDescEvent.addChildDocument(domainDoc);
>>>    }
>>>    mainEvent.addChildDocument(childEventDescEvent);
>>>    mainEvent.addChildDocument(childUserEvent);
>>>    batch.add(mainEvent);
>>>    solr.add(batch);
>>>    solr.commit();
>>> }
>>> When I try to index the using the above code, I am able to index only 12
>>> Objects per second. Is there a faster way to do the indexing? I believe I
>>> am using the json-fast parser which is one of the fastest parsers for
>> json.
>>> Your help will be very valuable to me.
>>> Thanks,
>>> Vineeth

Reply via email to