On Mon, Sep 21, 2015, at 02:53 AM, Kevin Vasko wrote:
> I am new to Apache Solr and have been struggling with indexing some JSON
> files.
> 
> I have several TB of twitter data in JSON format that I am having trouble
> posting/indexing. I am trying to use a schemaless schema so I don't have
> to add 200+ records fields manually.
> 
> 1.
> 
> The first issue is none of the records have '[' or ']' wrapped around the
> records. So it looks like this:
> 
>  { "created_at": "Sun Apr 19 23:45:45 +0000 2015","id":
>  5.899379634353e+17, "id_str": "589937963435302912",<truncated for
>  mailing list>}
> 
> 
> Just to validate the schemaless portion was working I used a single
> "tweet" and trimmed it down to bare minimum. The brackets not being in
> the origian appears to be a problem as when I tried to process just a
> small portion of one record it requires me to wrap the row in a [ ] (I
> assume to make it an array) to index correctly.  Like the following:
> 
> [{ "created_at": "Sun Apr 19 23:45:45 +0000 2015","id":
> 5.899379634353e+17, "id_str": "589937963435302912",<truncated for mailing
> list>}]
> 
> Is there a way around this? I didn't want to preprocess the TB's of JSON
> data that is in this format to add '[', ',' and '[' around all of the
> data.
> 
> 2. 
> 
> The second issue is some of the fields have null values. 
> e.g. "in_reply_to_status_id": null,
> 
> I think I figured a way to resolve this by manually adding the field as a
> "strings" type but if I miss one it will kick the file out. Just wanted
> to see if there was something I could add to the schemaless configuration
> to have it pick up null fields as replace them as strings automatically?
> Or is there a better way to handle this?
> 
> 
> 3. 
> The last issue I think my most difficult issue. Which is dealing with
> "nested" or "children" fields in my JSON data.
> 
> The data looks like this. https://gist.github.com/gnip/764239. Is there
> anyways to index this information preferably automatically (schemaless
> method) without having to flatten all of my data?


1. Solr is designed to handle large amounts of content. You don't want
to be pushing documents one at a time, as you will be wasting huge
amounts of effort needlessly. Therefore, Solr assumes that when it
receives JSON, it will be in an array of documents. IIRC, when you post
an object {}, it will be considered a partial update instruction.

2. Don't rely upon the schemaless setup. Define your schema - you can't
actually live without one. Relying upon the data to work it out for you
is fraught with risk. Whether you define it via HTTP calls, or via
editing an XML file, is up to you. Just don't rely upon it correctly
guessing.

Also, when you have a 'null', the equivalent in Solr is to omit the
field. There is typically no concept in Solr for storing a null value.

3. Look at block joins, they may well help. But remember a Lucene index
is currently largely flat - you won't get anything like the versatility
out of it that you would from a relational database (in relation to
nested structures) as that isn't what it was designed for. Really,
you're gonna want to identify what you want OUT of your data, and then
identify a data structure that will allow you to achieve it. You cannot
assume that there is a standard way of doing it that will support every
use-case.

Upayavira 

Reply via email to