On Mon, Sep 21, 2015, at 02:53 AM, Kevin Vasko wrote: > I am new to Apache Solr and have been struggling with indexing some JSON > files. > > I have several TB of twitter data in JSON format that I am having trouble > posting/indexing. I am trying to use a schemaless schema so I don't have > to add 200+ records fields manually. > > 1. > > The first issue is none of the records have '[' or ']' wrapped around the > records. So it looks like this: > > { "created_at": "Sun Apr 19 23:45:45 +0000 2015","id": > 5.899379634353e+17, "id_str": "589937963435302912",<truncated for > mailing list>} > > > Just to validate the schemaless portion was working I used a single > "tweet" and trimmed it down to bare minimum. The brackets not being in > the origian appears to be a problem as when I tried to process just a > small portion of one record it requires me to wrap the row in a [ ] (I > assume to make it an array) to index correctly. Like the following: > > [{ "created_at": "Sun Apr 19 23:45:45 +0000 2015","id": > 5.899379634353e+17, "id_str": "589937963435302912",<truncated for mailing > list>}] > > Is there a way around this? I didn't want to preprocess the TB's of JSON > data that is in this format to add '[', ',' and '[' around all of the > data. > > 2. > > The second issue is some of the fields have null values. > e.g. "in_reply_to_status_id": null, > > I think I figured a way to resolve this by manually adding the field as a > "strings" type but if I miss one it will kick the file out. Just wanted > to see if there was something I could add to the schemaless configuration > to have it pick up null fields as replace them as strings automatically? > Or is there a better way to handle this? > > > 3. > The last issue I think my most difficult issue. Which is dealing with > "nested" or "children" fields in my JSON data. > > The data looks like this. https://gist.github.com/gnip/764239. Is there > anyways to index this information preferably automatically (schemaless > method) without having to flatten all of my data?
1. Solr is designed to handle large amounts of content. You don't want to be pushing documents one at a time, as you will be wasting huge amounts of effort needlessly. Therefore, Solr assumes that when it receives JSON, it will be in an array of documents. IIRC, when you post an object {}, it will be considered a partial update instruction. 2. Don't rely upon the schemaless setup. Define your schema - you can't actually live without one. Relying upon the data to work it out for you is fraught with risk. Whether you define it via HTTP calls, or via editing an XML file, is up to you. Just don't rely upon it correctly guessing. Also, when you have a 'null', the equivalent in Solr is to omit the field. There is typically no concept in Solr for storing a null value. 3. Look at block joins, they may well help. But remember a Lucene index is currently largely flat - you won't get anything like the versatility out of it that you would from a relational database (in relation to nested structures) as that isn't what it was designed for. Really, you're gonna want to identify what you want OUT of your data, and then identify a data structure that will allow you to achieve it. You cannot assume that there is a standard way of doing it that will support every use-case. Upayavira