I am new to Apache Solr and have been struggling with indexing some JSON files.

I have several TB of twitter data in JSON format that I am having trouble 
posting/indexing. I am trying to use a schemaless schema so I don't have to add 
200+ records fields manually.

1.

The first issue is none of the records have '[' or ']' wrapped around the 
records. So it looks like this:

 { "created_at": "Sun Apr 19 23:45:45 +0000 2015","id": 5.899379634353e+17, 
"id_str": "589937963435302912",<truncated for mailing list>}


Just to validate the schemaless portion was working I used a single "tweet" and 
trimmed it down to bare minimum. The brackets not being in the origian appears 
to be a problem as when I tried to process just a small portion of one record 
it requires me to wrap the row in a [ ] (I assume to make it an array) to index 
correctly.  Like the following:

[{ "created_at": "Sun Apr 19 23:45:45 +0000 2015","id": 5.899379634353e+17, 
"id_str": "589937963435302912",<truncated for mailing list>}]

Is there a way around this? I didn't want to preprocess the TB's of JSON data 
that is in this format to add '[', ',' and '[' around all of the data.

2. 

The second issue is some of the fields have null values. 
e.g. "in_reply_to_status_id": null,

I think I figured a way to resolve this by manually adding the field as a 
"strings" type but if I miss one it will kick the file out. Just wanted to see 
if there was something I could add to the schemaless configuration to have it 
pick up null fields as replace them as strings automatically? Or is there a 
better way to handle this?


3. 
The last issue I think my most difficult issue. Which is dealing with "nested" 
or "children" fields in my JSON data.

The data looks like this. https://gist.github.com/gnip/764239. Is there anyways 
to index this information preferably automatically (schemaless method) without 
having to flatten all of my data?

Thanks.

Reply via email to