Hi,

I am trying to fit the following data in Solr to support flexible queries
and would like to get your input on the same. I have data about users say:

contentID (assume uuid),
platform (eg. website, mobile etc),
softwareVersion (eg. sw1.1, sw2.5, ..etc),
regionId (eg. us144, uk123, etc..)
....

and few more other such fields. This data is partially pre aggregated (read
Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
format:

timestamp  pre-aggregated data [ uniques, total]
 Jan 15    [ 12, 4]
 Jan 14    [ 4, 3]
 Jan 13    [ 8, 7]
 ...        ...

And then I also have less granular data say "contentID = uuid123 and
platform = mobile and softwareVersion = ANY and regionId = ANY (These
values will be more than above table since granularity is reduced)

timestamp : pre-aggregated data [uniques, total]
 Jan 15    [ 100, 40]
 Jan 14    [ 45, 30]
 ...           ...

I'll get queries like "contentID = uuid123 and platform = mobile" , give
sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 -
Jan01.

I was thinking of simple schema where documents will be like (first example
above):

{
  "contentID": "uuid12349789",
  "platform" : "mobile",
  "softwareVersion": "sw1.2",
  "regionId": "ANY",
  "ts" : "2017-01-15T01:01:21Z",
  "unique": 12,
  "total": 4
}

second example from above:

{
  "contentID": "uuid12349789",
  "platform" : "mobile",
  "softwareVersion": "ANY",
  "regionId": "ANY",
  "ts" : "2017-01-15T01:01:21Z",
  "unique": 100,
  "total": 40
}

Possible optimization:

{
  "contentID": "uuid12349789",
  "platform.mobile.softwareVersion.sw1.2.region.us12" : {
      "unique": 12,
      "total": 4
  },
 "platform.mobile.softwareVersion.sw1.2.region.ANY" : {
      "unique": 100,
      "total": 40
  },
  "ts" : "2017-01-15T01:01:21Z"
  }

Challenges: Number of such rows is very large and it'll grow exponentially
with every new field - For instance if I go with above suggested schema,
I'll end up storing a new document for each combination of
contentID,platform,softwareVersion,regionId. Now if we throw in another
field to this document, number of combinations increase exponentially.I
have more than a billion such combination rows already.

I am hoping to find advice by experts if

   1. Multiple such fields can be fit in same document for different 'ts'
   such that range queries are possible on it.
   2. time range (ts) can be fit in same document as a list(?) (to reduce
   number of rows). I know multivalued fields don't support complex data
   types, but if anything else can be done with the data/schema to reduce
   query time and number of rows.

The number of these rows are very large, for sure more than 1billion (if we
go with the schema I was suggesting). What schema would you suggest for
this that'll fit query requirements?

FYI: All queries will be exact match on fields (no partial or tokenized),
so no analysis on fields is necessary. And almost all queries are range
queries.

Thanks,

KP

Reply via email to