Are you making sure that every document has a unique ID? Index into an
empty Solr, then look at your maxdocs vs numdocs. If they are different
(maxdocs is higher) then some of your documents have been deleted,
meaning some were overwritten.

That might be a place to look.

Upayavira

On Tue, Jul 21, 2015, at 09:24 PM, solr.user.1...@gmail.com wrote:
> I can confirm this behavior, seen when sending json docs in batch, never
> happens when sending one by one, but sporadic when sending batches.
> 
> Like if sole/jetty drops couple of documents out of the batch.
> 
> Regards
> 
> > On 21 Jul 2015, at 21:38, Vineeth Dasaraju <vineeth.ii...@gmail.com> wrote:
> > 
> > Hi,
> > 
> > Thank You Erick for your inputs. I tried creating batches of 1000 objects
> > and indexing it to solr. The performance is way better than before but I
> > find that number of indexed documents that is shown in the dashboard is
> > lesser than the number of documents that I had actually indexed through
> > solrj. My code is as follows:
> > 
> > private static String SOLR_SERVER_URL = "http://localhost:8983/solr/newcore
> > ";
> > private static String JSON_FILE_PATH = "/home/vineeth/week1_fixed.json";
> > private static JSONParser parser = new JSONParser();
> > private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);
> > 
> > public static void main(String[] args) throws IOException,
> > SolrServerException, ParseException {
> >        File file = new File(JSON_FILE_PATH);
> >        Scanner scn=new Scanner(file,"UTF-8");
> >        JSONObject object;
> >        int i = 0;
> >        Collection<SolrInputDocument> batch = new
> > ArrayList<SolrInputDocument>();
> >        while(scn.hasNext()){
> >            object= (JSONObject) parser.parse(scn.nextLine());
> >            SolrInputDocument doc = indexJSON(object);
> >            batch.add(doc);
> >            if(i%1000==0){
> >                System.out.println("Indexed " + (i+1) + " objects." );
> >                solr.add(batch);
> >                batch = new ArrayList<SolrInputDocument>();
> >            }
> >            i++;
> >        }
> >        solr.add(batch);
> >        solr.commit();
> >        System.out.println("Indexed " + (i+1) + " objects." );
> > }
> > 
> > public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
> > ParseException, IOException, SolrServerException {
> >    Collection<SolrInputDocument> batch = new
> > ArrayList<SolrInputDocument>();
> > 
> >    SolrInputDocument mainEvent = new SolrInputDocument();
> >    mainEvent.addField("id", generateID());
> >    mainEvent.addField("RawEventMessage", jsonOBJ.get("RawEventMessage"));
> >    mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
> >    mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
> >    mainEvent.addField("EventMessageType", jsonOBJ.get("EventMessageType"));
> >    mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
> >    mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));
> > 
> >    Object obj = parser.parse(jsonOBJ.get("User").toString());
> >    JSONObject userObj = (JSONObject) obj;
> > 
> >    SolrInputDocument childUserEvent = new SolrInputDocument();
> >    childUserEvent.addField("id", generateID());
> >    childUserEvent.addField("User", userObj.get("User"));
> > 
> >    obj = parser.parse(jsonOBJ.get("EventDescription").toString());
> >    JSONObject eventdescriptionObj = (JSONObject) obj;
> > 
> >    SolrInputDocument childEventDescEvent = new SolrInputDocument();
> >    childEventDescEvent.addField("id", generateID());
> >    childEventDescEvent.addField("EventApplicationName",
> > eventdescriptionObj.get("EventApplicationName"));
> >    childEventDescEvent.addField("Query", eventdescriptionObj.get("Query"));
> > 
> >    obj= JSONValue.parse(eventdescriptionObj.get("Information").toString());
> >    JSONArray informationArray = (JSONArray) obj;
> > 
> >    for(int i = 0; i<informationArray.size(); i++){
> >        JSONObject domain = (JSONObject) informationArray.get(i);
> > 
> >        SolrInputDocument domainDoc = new SolrInputDocument();
> >        domainDoc.addField("id", generateID());
> >        domainDoc.addField("domainName", domain.get("domainName"));
> > 
> >        String s = domain.get("columns").toString();
> >        obj= JSONValue.parse(s);
> >        JSONArray ColumnsArray = (JSONArray) obj;
> > 
> >        SolrInputDocument columnsDoc = new SolrInputDocument();
> >        columnsDoc.addField("id", generateID());
> > 
> >        for(int j = 0; j<ColumnsArray.size(); j++){
> >            JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
> >            SolrInputDocument columnDoc = new SolrInputDocument();
> >            columnDoc.addField("id", generateID());
> >            columnDoc.addField("movieName", ColumnsObj.get("movieName"));
> >            columnsDoc.addChildDocument(columnDoc);
> >        }
> >        domainDoc.addChildDocument(columnsDoc);
> >        childEventDescEvent.addChildDocument(domainDoc);
> >    }
> > 
> >    mainEvent.addChildDocument(childEventDescEvent);
> >    mainEvent.addChildDocument(childUserEvent);
> >    return mainEvent;
> > }
> > 
> > I would be grateful if you could let me know what I am missing.
> > 
> > On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> > 
> >> First thing is it looks like you're only sending one document at a
> >> time, perhaps with child objects. This is not optimal at all. I
> >> usually batch my docs up in groups of 1,000, and there is anecdotal
> >> evidence that there may (depending on the docs) be some gains above
> >> that number. Gotta balance the batch size off against how bug the docs
> >> are of course.
> >> 
> >> Assuming that you really are calling this method for one doc (and
> >> children) at a time, the far bigger problem other than calling
> >> server.add for each parent/children is that you're then calling
> >> solr.commit() every time. This is an anti-pattern. Generally, let the
> >> autoCommit setting in solrconfig.xml handle the intermediate commits
> >> while the indexing program is running and only issue a commit at the
> >> very end of the job if at all.
> >> 
> >> Best,
> >> Erick
> >> 
> >> On Sun, Jul 19, 2015 at 12:08 PM, Vineeth Dasaraju
> >> <vineeth.ii...@gmail.com> wrote:
> >>> Hi,
> >>> 
> >>> I am trying to index JSON objects (which contain nested JSON objects and
> >>> Arrays in them) into solr.
> >>> 
> >>> My JSON Object looks like the following (This is fake data that I am
> >> using
> >>> for this example):
> >>> 
> >>> {
> >>>    "RawEventMessage": "Lorem ipsum dolor sit amet, consectetur
> >> adipiscing
> >>> elit. Aliquam dolor orci, placerat ac pretium a, tincidunt consectetur
> >>> mauris. Etiam sollicitudin sapien id odio tempus, non sodales odio
> >> iaculis.
> >>> Donec fringilla diam at placerat interdum. Proin vitae arcu non augue
> >>> facilisis auctor id non neque. Integer non nibh sit amet justo facilisis
> >>> semper a vel ligula. Pellentesque commodo vulputate consequat. ",
> >>>    "EventUid": "1279706565",
> >>>    "TimeOfEvent": "2015-05-01-08-07-13",
> >>>    "TimeOfEventUTC": "2015-05-01-01-07-13",
> >>>    "EventCollector": "kafka",
> >>>    "EventMessageType": "kafka-@column",
> >>>    "User": {
> >>>        "User": "Lorem ipsum",
> >>>        "UserGroup": "Manager",
> >>>        "Location": "consectetur adipiscing",
> >>>        "Department": "Legal"
> >>>    },
> >>>    "EventDescription": {
> >>>        "EventApplicationName": "",
> >>>        "Query": "SELECT * FROM MOVIES",
> >>>        "Information": [
> >>>            {
> >>>                "domainName": "English",
> >>>                "columns": [
> >>>                    {
> >>>                        "movieName": "Casablanca",
> >>>                        "duration": "154",
> >>>                    },
> >>>    {
> >>>                        "movieName": "Die Hard",
> >>>                        "duration": "127",
> >>>                    }
> >>>                ]
> >>>            },
> >>>            {
> >>>                "domainName": "Hindi",
> >>>                "columns": [
> >>>                    {
> >>>                        "movieName": "DDLJ",
> >>>                        "duration": "176",
> >>>                    }
> >>>                ]
> >>>            }
> >>>        ]
> >>>    }
> >>> }
> >>> 
> >>> 
> >>> 
> >>> My function for indexing the object is as follows:
> >>> 
> >>> public static void indexJSON(JSONObject jsonOBJ) throws ParseException,
> >>> IOException, SolrServerException {
> >>>    Collection<SolrInputDocument> batch = new
> >>> ArrayList<SolrInputDocument>();
> >>> 
> >>>    SolrInputDocument mainEvent = new SolrInputDocument();
> >>>    mainEvent.addField("id", generateID());
> >>>    mainEvent.addField("RawEventMessage",
> >> jsonOBJ.get("RawEventMessage"));
> >>>    mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
> >>>    mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
> >>>    mainEvent.addField("EventMessageType",
> >> jsonOBJ.get("EventMessageType"));
> >>>    mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
> >>>    mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));
> >>> 
> >>>    Object obj = parser.parse(jsonOBJ.get("User").toString());
> >>>    JSONObject userObj = (JSONObject) obj;
> >>> 
> >>>    SolrInputDocument childUserEvent = new SolrInputDocument();
> >>>    childUserEvent.addField("id", generateID());
> >>>    childUserEvent.addField("User", userObj.get("User"));
> >>> 
> >>>    obj = parser.parse(jsonOBJ.get("EventDescription").toString());
> >>>    JSONObject eventdescriptionObj = (JSONObject) obj;
> >>> 
> >>>    SolrInputDocument childEventDescEvent = new SolrInputDocument();
> >>>    childEventDescEvent.addField("id", generateID());
> >>>    childEventDescEvent.addField("EventApplicationName",
> >>> eventdescriptionObj.get("EventApplicationName"));
> >>>    childEventDescEvent.addField("Query",
> >> eventdescriptionObj.get("Query"));
> >>> 
> >>>    obj=
> >> JSONValue.parse(eventdescriptionObj.get("Information").toString());
> >>>    JSONArray informationArray = (JSONArray) obj;
> >>> 
> >>>    for(int i = 0; i<informationArray.size(); i++){
> >>>        JSONObject domain = (JSONObject) informationArray.get(i);
> >>> 
> >>>        SolrInputDocument domainDoc = new SolrInputDocument();
> >>>        domainDoc.addField("id", generateID());
> >>>        domainDoc.addField("domainName", domain.get("domainName"));
> >>> 
> >>>        String s = domain.get("columns").toString();
> >>>        obj= JSONValue.parse(s);
> >>>        JSONArray ColumnsArray = (JSONArray) obj;
> >>> 
> >>>        SolrInputDocument columnsDoc = new SolrInputDocument();
> >>>        columnsDoc.addField("id", generateID());
> >>> 
> >>>        for(int j = 0; j<ColumnsArray.size(); j++){
> >>>            JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
> >>>            SolrInputDocument columnDoc = new SolrInputDocument();
> >>>            columnDoc.addField("id", generateID());
> >>>            columnDoc.addField("movieName", ColumnsObj.get("movieName"));
> >>>            columnsDoc.addChildDocument(columnDoc);
> >>>        }
> >>>        domainDoc.addChildDocument(columnsDoc);
> >>>        childEventDescEvent.addChildDocument(domainDoc);
> >>>    }
> >>> 
> >>>    mainEvent.addChildDocument(childEventDescEvent);
> >>>    mainEvent.addChildDocument(childUserEvent);
> >>>    batch.add(mainEvent);
> >>>    solr.add(batch);
> >>>    solr.commit();
> >>> }
> >>> 
> >>> When I try to index the using the above code, I am able to index only 12
> >>> Objects per second. Is there a faster way to do the indexing? I believe I
> >>> am using the json-fast parser which is one of the fastest parsers for
> >> json.
> >>> 
> >>> Your help will be very valuable to me.
> >>> 
> >>> Thanks,
> >>> Vineeth
> >> 

Reply via email to