Re: Solr schema design: fitting time-series data

2017-01-17 Thread map reduced
That's a good point Alex, about indexed vs stored. Since all my queries are
exact match, I can just have them stored=false to save space. I believe
that helps since there are billions of rows and it'll hopefully save on
quite some of space.
But nothing can be done for squeezing dates in same document right?

Thanks for the reply.


On Tue, Jan 17, 2017 at 10:50 AM, Alexandre Rafalovitch 
wrote:

> On 16 January 2017 at 00:54, map reduced  wrote:
> > some way to squeeze timestamps in single
> > document so that it doesn't increase the number of document by a lot and
> I
> > am still able to range query on 'ts'.
>
> Would DateRangeField be useful here?
> https://cwiki.apache.org/confluence/display/solr/Working+with+Dates
>
> Also, if the fields are indexed and not stored, more records with the
> same values is not such a big deal because they effectively just add
> more indexes from the token tables. Of course, I am not sure whether
> this advice scales as much as your specific use case requires, but it
> is just something to keep in mind.
>
> Regards,
>Alex.
>
>
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>


Re: Solr schema design: fitting time-series data

2017-01-17 Thread Alexandre Rafalovitch
On 16 January 2017 at 00:54, map reduced  wrote:
> some way to squeeze timestamps in single
> document so that it doesn't increase the number of document by a lot and I
> am still able to range query on 'ts'.

Would DateRangeField be useful here?
https://cwiki.apache.org/confluence/display/solr/Working+with+Dates

Also, if the fields are indexed and not stored, more records with the
same values is not such a big deal because they effectively just add
more indexes from the token tables. Of course, I am not sure whether
this advice scales as much as your specific use case requires, but it
is just something to keep in mind.

Regards,
   Alex.



http://www.solr-start.com/ - Resources for Solr users, new and experienced


Re: Solr schema design: fitting time-series data

2017-01-17 Thread map reduced
Anyone has any idea?

On Sun, Jan 15, 2017 at 9:54 PM, map reduced  wrote:

> I may have used wrong terminology, by complex types I meant non-primitive
> types. Mutlivalued can be conceptualized as a list of values for instance
> in your example myint = [ 32, 77] etc which you can possibly analyze and
> query upon. What I was trying to ask is if a complex type can be
> multi-valued or something along those lines that can be supported by range
> queries.
>
> For instance: Below rows will have to be individual docs in Solr (in my
> knowledge) -  If I want to range query from ts=Jan 12 to ts=Jan 15 give me
> sum of 'unique' where 'contentId=1,product=mobile'
>
> contentId=1,product=mobilets=Jan15 total=12,unique=5
> contentId=1,product=mobilets=Jan14 total=10,unique=3
> contentId=1,product=mobilets=Jan13 total=15,unique=2
> contentId=1,product=mobilets=Jan12 total=17,unique=4
> ..
>
> This increases number of documents in Solr by a lot. Only if there was a
> way to do something like:
>
> {
> contentId=1
> product=mobile
> ts = [
>
> {
>
> time = Jan15
>
> total = 12
>
> unique = 15
>
> },
> {
>
> time = Jan16
>
> total = 10
>
> unique = 3
>
> },
>
> ..
> ..
> ]}
>
> Of course above isn't allowed, but some way to squeeze timestamps in
> single document so that it doesn't increase the number of document by a lot
> and I am still able to range query on 'ts'.
>
> For some (combination of fields) rows the timestamps may go upto last 3-6
> months!
>
> Let me know if I am still being unclear.
>
> On Sun, Jan 15, 2017 at 8:04 PM, Erick Erickson 
> wrote:
>
>> bq: I know multivalued fields don't support complex data  types
>>
>> Not sure what you're talking about here. mulitValued actually has
>> nothing to do with data types. You can have text fields which
>> are analyzed and produce multiple tokens and are multiValued.
>> You can have primitive types (string, int/long/float/double,
>> boolean etc) that are multivalued. or they can be single valued.
>>
>> All "multiValued" means is that the _input_ can have the same field
>> repeated, i.e.
>> 
>>   some stuff
>>   more stuff
>>   77
>> 
>>
>> This doc would fail of mytext or myint were multiValued=false but
>> succeed if multiValued=true at index time.
>>
>> There are some subtleties with text (analyzed) multivalued fields having
>> to do with token offsets, but that's not germane.
>>
>> Does that change your problem? Your document could have a dozen
>> timestamps
>>
>> However, there isn't a good way to query across multiple multivalued
>> fields
>> in parallel. That is, a doc like
>>
>> myint=1
>> myint=2
>> myint=3
>> mylong=4
>> mylong=5
>> mylong=6
>>
>> there's no good way to say "only match this document if mhyint=1 AND
>> mylong=4 AND they_are_both_in_the_same_position.
>>
>> That is, asking for myint=1 AND mylong=6 would match the above. Is
>> that what you're
>> wondering about?
>>
>> --
>> I expect you're really asking to do the second above, in which case you
>> might
>> want to look at StreamingExpressions and/or ParallelSQL in Solr 6.x
>>
>> Best,
>> Erick
>>
>> On Sun, Jan 15, 2017 at 7:31 PM, map reduced  wrote:
>> > Hi,
>> >
>> > I am trying to fit the following data in Solr to support flexible
>> queries
>> > and would like to get your input on the same. I have data about users
>> say:
>> >
>> > contentID (assume uuid),
>> > platform (eg. website, mobile etc),
>> > softwareVersion (eg. sw1.1, sw2.5, ..etc),
>> > regionId (eg. us144, uk123, etc..)
>> > 
>> >
>> > and few more other such fields. This data is partially pre aggregated
>> (read
>> > Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
>> > mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
>> > format:
>> >
>> > timestamp  pre-aggregated data [ uniques, total]
>> >  Jan 15[ 12, 4]
>> >  Jan 14[ 4, 3]
>> >  Jan 13[ 8, 7]
>> >  ......
>> >
>> > And then I also have less granular data say "contentID = uuid123 and
>> > platform = mobile and softwareVersion = ANY and regionId = ANY (These
>> > values will be more than above table since granularity is reduced)
>> >
>> > timestamp : pre-aggregated data [uniques, total]
>> >  Jan 15[ 100, 40]
>> >  Jan 14[ 45, 30]
>> >  ...   ...
>> >
>> > I'll get queries like "contentID = uuid123 and platform = mobile" , give
>> > sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
>> > platform=mobile and softwareVersion=sw1.2", give sum of 'total' for
>> Jan15 -
>> > Jan01.
>> >
>> > I was thinking of simple schema where documents will be like (first
>> example
>> > above):
>> >
>> > {
>> >   "contentID": "uuid12349789",
>> >   "platform" : "mobile",
>> >   "softwareVersion": "sw1.2",
>> >   "regionId": "ANY",
>> >   "ts" : "2017-01-15T01:01:21Z",
>> >   "unique": 12,
>> >   "total": 4
>> > }
>> >
>> > second example from above:
>> >
>> > {
>> >   "contentID": "uuid12349789",
>> >   "platform" : "mobile",
>> >   "so

Re: Solr schema design: fitting time-series data

2017-01-15 Thread map reduced
I may have used wrong terminology, by complex types I meant non-primitive
types. Mutlivalued can be conceptualized as a list of values for instance
in your example myint = [ 32, 77] etc which you can possibly analyze and
query upon. What I was trying to ask is if a complex type can be
multi-valued or something along those lines that can be supported by range
queries.

For instance: Below rows will have to be individual docs in Solr (in my
knowledge) -  If I want to range query from ts=Jan 12 to ts=Jan 15 give me
sum of 'unique' where 'contentId=1,product=mobile'

contentId=1,product=mobilets=Jan15 total=12,unique=5
contentId=1,product=mobilets=Jan14 total=10,unique=3
contentId=1,product=mobilets=Jan13 total=15,unique=2
contentId=1,product=mobilets=Jan12 total=17,unique=4
..

This increases number of documents in Solr by a lot. Only if there was a
way to do something like:

{
contentId=1
product=mobile
ts = [

{

time = Jan15

total = 12

unique = 15

},
{

time = Jan16

total = 10

unique = 3

},

..
..
]}

Of course above isn't allowed, but some way to squeeze timestamps in single
document so that it doesn't increase the number of document by a lot and I
am still able to range query on 'ts'.

For some (combination of fields) rows the timestamps may go upto last 3-6
months!

Let me know if I am still being unclear.

On Sun, Jan 15, 2017 at 8:04 PM, Erick Erickson 
wrote:

> bq: I know multivalued fields don't support complex data  types
>
> Not sure what you're talking about here. mulitValued actually has
> nothing to do with data types. You can have text fields which
> are analyzed and produce multiple tokens and are multiValued.
> You can have primitive types (string, int/long/float/double,
> boolean etc) that are multivalued. or they can be single valued.
>
> All "multiValued" means is that the _input_ can have the same field
> repeated, i.e.
> 
>   some stuff
>   more stuff
>   77
> 
>
> This doc would fail of mytext or myint were multiValued=false but
> succeed if multiValued=true at index time.
>
> There are some subtleties with text (analyzed) multivalued fields having
> to do with token offsets, but that's not germane.
>
> Does that change your problem? Your document could have a dozen
> timestamps
>
> However, there isn't a good way to query across multiple multivalued fields
> in parallel. That is, a doc like
>
> myint=1
> myint=2
> myint=3
> mylong=4
> mylong=5
> mylong=6
>
> there's no good way to say "only match this document if mhyint=1 AND
> mylong=4 AND they_are_both_in_the_same_position.
>
> That is, asking for myint=1 AND mylong=6 would match the above. Is
> that what you're
> wondering about?
>
> --
> I expect you're really asking to do the second above, in which case you
> might
> want to look at StreamingExpressions and/or ParallelSQL in Solr 6.x
>
> Best,
> Erick
>
> On Sun, Jan 15, 2017 at 7:31 PM, map reduced  wrote:
> > Hi,
> >
> > I am trying to fit the following data in Solr to support flexible queries
> > and would like to get your input on the same. I have data about users
> say:
> >
> > contentID (assume uuid),
> > platform (eg. website, mobile etc),
> > softwareVersion (eg. sw1.1, sw2.5, ..etc),
> > regionId (eg. us144, uk123, etc..)
> > 
> >
> > and few more other such fields. This data is partially pre aggregated
> (read
> > Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
> > mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
> > format:
> >
> > timestamp  pre-aggregated data [ uniques, total]
> >  Jan 15[ 12, 4]
> >  Jan 14[ 4, 3]
> >  Jan 13[ 8, 7]
> >  ......
> >
> > And then I also have less granular data say "contentID = uuid123 and
> > platform = mobile and softwareVersion = ANY and regionId = ANY (These
> > values will be more than above table since granularity is reduced)
> >
> > timestamp : pre-aggregated data [uniques, total]
> >  Jan 15[ 100, 40]
> >  Jan 14[ 45, 30]
> >  ...   ...
> >
> > I'll get queries like "contentID = uuid123 and platform = mobile" , give
> > sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
> > platform=mobile and softwareVersion=sw1.2", give sum of 'total' for
> Jan15 -
> > Jan01.
> >
> > I was thinking of simple schema where documents will be like (first
> example
> > above):
> >
> > {
> >   "contentID": "uuid12349789",
> >   "platform" : "mobile",
> >   "softwareVersion": "sw1.2",
> >   "regionId": "ANY",
> >   "ts" : "2017-01-15T01:01:21Z",
> >   "unique": 12,
> >   "total": 4
> > }
> >
> > second example from above:
> >
> > {
> >   "contentID": "uuid12349789",
> >   "platform" : "mobile",
> >   "softwareVersion": "ANY",
> >   "regionId": "ANY",
> >   "ts" : "2017-01-15T01:01:21Z",
> >   "unique": 100,
> >   "total": 40
> > }
> >
> > Possible optimization:
> >
> > {
> >   "contentID": "uuid12349789",
> >   "platform.mobile.softwareVersion.sw1.2.region.us12" : {
> >   "unique": 12,
>

Re: Solr schema design: fitting time-series data

2017-01-15 Thread Erick Erickson
bq: I know multivalued fields don't support complex data  types

Not sure what you're talking about here. mulitValued actually has
nothing to do with data types. You can have text fields which
are analyzed and produce multiple tokens and are multiValued.
You can have primitive types (string, int/long/float/double,
boolean etc) that are multivalued. or they can be single valued.

All "multiValued" means is that the _input_ can have the same field
repeated, i.e.

  some stuff
  more stuff
  77


This doc would fail of mytext or myint were multiValued=false but
succeed if multiValued=true at index time.

There are some subtleties with text (analyzed) multivalued fields having
to do with token offsets, but that's not germane.

Does that change your problem? Your document could have a dozen
timestamps

However, there isn't a good way to query across multiple multivalued fields
in parallel. That is, a doc like

myint=1
myint=2
myint=3
mylong=4
mylong=5
mylong=6

there's no good way to say "only match this document if mhyint=1 AND
mylong=4 AND they_are_both_in_the_same_position.

That is, asking for myint=1 AND mylong=6 would match the above. Is
that what you're
wondering about?

--
I expect you're really asking to do the second above, in which case you might
want to look at StreamingExpressions and/or ParallelSQL in Solr 6.x

Best,
Erick

On Sun, Jan 15, 2017 at 7:31 PM, map reduced  wrote:
> Hi,
>
> I am trying to fit the following data in Solr to support flexible queries
> and would like to get your input on the same. I have data about users say:
>
> contentID (assume uuid),
> platform (eg. website, mobile etc),
> softwareVersion (eg. sw1.1, sw2.5, ..etc),
> regionId (eg. us144, uk123, etc..)
> 
>
> and few more other such fields. This data is partially pre aggregated (read
> Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
> mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
> format:
>
> timestamp  pre-aggregated data [ uniques, total]
>  Jan 15[ 12, 4]
>  Jan 14[ 4, 3]
>  Jan 13[ 8, 7]
>  ......
>
> And then I also have less granular data say "contentID = uuid123 and
> platform = mobile and softwareVersion = ANY and regionId = ANY (These
> values will be more than above table since granularity is reduced)
>
> timestamp : pre-aggregated data [uniques, total]
>  Jan 15[ 100, 40]
>  Jan 14[ 45, 30]
>  ...   ...
>
> I'll get queries like "contentID = uuid123 and platform = mobile" , give
> sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
> platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 -
> Jan01.
>
> I was thinking of simple schema where documents will be like (first example
> above):
>
> {
>   "contentID": "uuid12349789",
>   "platform" : "mobile",
>   "softwareVersion": "sw1.2",
>   "regionId": "ANY",
>   "ts" : "2017-01-15T01:01:21Z",
>   "unique": 12,
>   "total": 4
> }
>
> second example from above:
>
> {
>   "contentID": "uuid12349789",
>   "platform" : "mobile",
>   "softwareVersion": "ANY",
>   "regionId": "ANY",
>   "ts" : "2017-01-15T01:01:21Z",
>   "unique": 100,
>   "total": 40
> }
>
> Possible optimization:
>
> {
>   "contentID": "uuid12349789",
>   "platform.mobile.softwareVersion.sw1.2.region.us12" : {
>   "unique": 12,
>   "total": 4
>   },
>  "platform.mobile.softwareVersion.sw1.2.region.ANY" : {
>   "unique": 100,
>   "total": 40
>   },
>   "ts" : "2017-01-15T01:01:21Z"
>   }
>
> Challenges: Number of such rows is very large and it'll grow exponentially
> with every new field - For instance if I go with above suggested schema,
> I'll end up storing a new document for each combination of
> contentID,platform,softwareVersion,regionId. Now if we throw in another
> field to this document, number of combinations increase exponentially.I
> have more than a billion such combination rows already.
>
> I am hoping to find advice by experts if
>
>1. Multiple such fields can be fit in same document for different 'ts'
>such that range queries are possible on it.
>2. time range (ts) can be fit in same document as a list(?) (to reduce
>number of rows). I know multivalued fields don't support complex data
>types, but if anything else can be done with the data/schema to reduce
>query time and number of rows.
>
> The number of these rows are very large, for sure more than 1billion (if we
> go with the schema I was suggesting). What schema would you suggest for
> this that'll fit query requirements?
>
> FYI: All queries will be exact match on fields (no partial or tokenized),
> so no analysis on fields is necessary. And almost all queries are range
> queries.
>
> Thanks,
>
> KP


Solr schema design: fitting time-series data

2017-01-15 Thread map reduced
Hi,

I am trying to fit the following data in Solr to support flexible queries
and would like to get your input on the same. I have data about users say:

contentID (assume uuid),
platform (eg. website, mobile etc),
softwareVersion (eg. sw1.1, sw2.5, ..etc),
regionId (eg. us144, uk123, etc..)


and few more other such fields. This data is partially pre aggregated (read
Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
format:

timestamp  pre-aggregated data [ uniques, total]
 Jan 15[ 12, 4]
 Jan 14[ 4, 3]
 Jan 13[ 8, 7]
 ......

And then I also have less granular data say "contentID = uuid123 and
platform = mobile and softwareVersion = ANY and regionId = ANY (These
values will be more than above table since granularity is reduced)

timestamp : pre-aggregated data [uniques, total]
 Jan 15[ 100, 40]
 Jan 14[ 45, 30]
 ...   ...

I'll get queries like "contentID = uuid123 and platform = mobile" , give
sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 -
Jan01.

I was thinking of simple schema where documents will be like (first example
above):

{
  "contentID": "uuid12349789",
  "platform" : "mobile",
  "softwareVersion": "sw1.2",
  "regionId": "ANY",
  "ts" : "2017-01-15T01:01:21Z",
  "unique": 12,
  "total": 4
}

second example from above:

{
  "contentID": "uuid12349789",
  "platform" : "mobile",
  "softwareVersion": "ANY",
  "regionId": "ANY",
  "ts" : "2017-01-15T01:01:21Z",
  "unique": 100,
  "total": 40
}

Possible optimization:

{
  "contentID": "uuid12349789",
  "platform.mobile.softwareVersion.sw1.2.region.us12" : {
  "unique": 12,
  "total": 4
  },
 "platform.mobile.softwareVersion.sw1.2.region.ANY" : {
  "unique": 100,
  "total": 40
  },
  "ts" : "2017-01-15T01:01:21Z"
  }

Challenges: Number of such rows is very large and it'll grow exponentially
with every new field - For instance if I go with above suggested schema,
I'll end up storing a new document for each combination of
contentID,platform,softwareVersion,regionId. Now if we throw in another
field to this document, number of combinations increase exponentially.I
have more than a billion such combination rows already.

I am hoping to find advice by experts if

   1. Multiple such fields can be fit in same document for different 'ts'
   such that range queries are possible on it.
   2. time range (ts) can be fit in same document as a list(?) (to reduce
   number of rows). I know multivalued fields don't support complex data
   types, but if anything else can be done with the data/schema to reduce
   query time and number of rows.

The number of these rows are very large, for sure more than 1billion (if we
go with the schema I was suggesting). What schema would you suggest for
this that'll fit query requirements?

FYI: All queries will be exact match on fields (no partial or tokenized),
so no analysis on fields is necessary. And almost all queries are range
queries.

Thanks,

KP