Question about a group by user and having sum(field) >= x in a date range

Iulian M Thu, 24 Apr 2014 07:06:41 -0700

Hey guys,

We have a use case which we’re trying to solve using SOLR.
This is the input data, which of course can be manipulated as we desire to
comply with a certain solution:


user_id
day (days since 1970 - eg. 16100; this can be the current day - 60 days at
most).
item1_count (optional - the value is an int value with a range of 1 to 10
000)
item2_count (optional)
…
itemN_count (optional)

An user_id can have at most one document for a given day. Or it can have
none.

Here's and example:
<doc>
<int name="user_id">1234</int>
<int name="date">16100</int>
<int name="1000_count">7</int>
<int name="1500_count">2</int>
<int name="1800_count">8</int>
<int name="1900_count">2</int>
</doc>

<doc>
<int name="user_id">1234</int>
<int name="date">16101</int>
<int name="1001_count">2</int>
<int name="1500_count">4</int>
</doc>


There are a couple of queries we want to be able to make:
1.
SELECT * FROM table WHERE NOW() - day <= Xdays GROUP BY (user_id) HAVING
sum(item1_count) >= 10;
2.
SELECT * FROM table WHERE NOW() - day <= Xdays AND (HAS item1 field) GROUP
BY (user_id)
SELECT * FROM table WHERE NOW() - day <= Xdays AND (HAS item1 field) AND
(HAS item2 field) GROUP BY (user_id)
SELECT * FROM table WHERE NOW() - day <= Xdays AND (HAS item1 field) OR
(HAS item6 field) AND (HAS item7 field) GROUP BY (user_id)

We're just being interested in how many documents are being returned by
these queries. We don't care about the actual returned documents.

We came up with a solution, but it turned out to be inefficient.
The idea was to have one document per user, like so:
user_id
item1_dayX_count
item1_dayY_count
item1_dayZ_count
item2_dayX_count
item2_dayY_count
…

Example:
<doc>
<int name="user_id">1234</int>
<int name="1000_16100_count">7</int>
<int name="1500_16100_count">2</int>
<int name="1800_16100_count">8</int>
<int name="1900_16100_count">2</int>
<int name="1001_16101_count">2</int>
<int name="1500_16101_count">4</int>
</doc>

Then, for this query:
SELECT COUNT(*) FROM table WHERE NOW() - day <= 10days AND HAS item1 field
GROUP BY (user_id)
Suppose current_day is 16100 and item1 is 1500. We do this SOLR query:
1500_16100_count:* AND 1500_16099_count:* AND 1500_16098_count:* … and so
on for all ten days.
We'd then see how many documents match this query, which is just the result
we want to use further.
Now you can image that for 14, 30 or 60 days this query is going to get
pretttty long.

For the other query:
SELECT * FROM table WHERE NOW() - day <= Xdays GROUP BY (user_id) HAVING
sum(item1_count) >= 10;
We have to do some pretty ugly sums in SOLR and again repeated many times.


I've been trying to leverage the StatsComponent and the FieldCollapsing
features in SOLR, but I have not been able to do this for this use case.
Would you please care to help with a solution or direct to some helpful
documentation / similar problems solved?

Thanks!
Julian

Question about a group by user and having sum(field) >= x in a date range

Reply via email to