I am new to Apache Solr and have been struggling with indexing some JSON files.
I have several TB of twitter data in JSON format that I am having trouble posting/indexing. I am trying to use a schemaless schema so I don't have to add 200+ records fields manually. 1. The first issue is none of the records have '[' or ']' wrapped around the records. So it looks like this: { "created_at": "Sun Apr 19 23:45:45 +0000 2015","id": 5.899379634353e+17, "id_str": "589937963435302912",<truncated for mailing list>} Just to validate the schemaless portion was working I used a single "tweet" and trimmed it down to bare minimum. The brackets not being in the origian appears to be a problem as when I tried to process just a small portion of one record it requires me to wrap the row in a [ ] (I assume to make it an array) to index correctly. Like the following: [{ "created_at": "Sun Apr 19 23:45:45 +0000 2015","id": 5.899379634353e+17, "id_str": "589937963435302912",<truncated for mailing list>}] Is there a way around this? I didn't want to preprocess the TB's of JSON data that is in this format to add '[', ',' and '[' around all of the data. 2. The second issue is some of the fields have null values. e.g. "in_reply_to_status_id": null, I think I figured a way to resolve this by manually adding the field as a "strings" type but if I miss one it will kick the file out. Just wanted to see if there was something I could add to the schemaless configuration to have it pick up null fields as replace them as strings automatically? Or is there a better way to handle this? 3. The last issue I think my most difficult issue. Which is dealing with "nested" or "children" fields in my JSON data. The data looks like this. https://gist.github.com/gnip/764239. Is there anyways to index this information preferably automatically (schemaless method) without having to flatten all of my data? Thanks.