Hey guys, We have a use case which we’re trying to solve using SOLR. This is the input data, which of course can be manipulated as we desire to comply with a certain solution:
user_id day (days since 1970 - eg. 16100; this can be the current day - 60 days at most). item1_count (optional - the value is an int value with a range of 1 to 10 000) item2_count (optional) … itemN_count (optional) An user_id can have at most one document for a given day. Or it can have none. Here's and example: <doc> <int name="user_id">1234</int> <int name="date">16100</int> <int name="1000_count">7</int> <int name="1500_count">2</int> <int name="1800_count">8</int> <int name="1900_count">2</int> </doc> <doc> <int name="user_id">1234</int> <int name="date">16101</int> <int name="1001_count">2</int> <int name="1500_count">4</int> </doc> There are a couple of queries we want to be able to make: 1. SELECT * FROM table WHERE NOW() - day <= Xdays GROUP BY (user_id) HAVING sum(item1_count) >= 10; 2. SELECT * FROM table WHERE NOW() - day <= Xdays AND (HAS item1 field) GROUP BY (user_id) SELECT * FROM table WHERE NOW() - day <= Xdays AND (HAS item1 field) AND (HAS item2 field) GROUP BY (user_id) SELECT * FROM table WHERE NOW() - day <= Xdays AND (HAS item1 field) OR (HAS item6 field) AND (HAS item7 field) GROUP BY (user_id) We're just being interested in how many documents are being returned by these queries. We don't care about the actual returned documents. We came up with a solution, but it turned out to be inefficient. The idea was to have one document per user, like so: user_id item1_dayX_count item1_dayY_count item1_dayZ_count item2_dayX_count item2_dayY_count … Example: <doc> <int name="user_id">1234</int> <int name="1000_16100_count">7</int> <int name="1500_16100_count">2</int> <int name="1800_16100_count">8</int> <int name="1900_16100_count">2</int> <int name="1001_16101_count">2</int> <int name="1500_16101_count">4</int> </doc> Then, for this query: SELECT COUNT(*) FROM table WHERE NOW() - day <= 10days AND HAS item1 field GROUP BY (user_id) Suppose current_day is 16100 and item1 is 1500. We do this SOLR query: 1500_16100_count:* AND 1500_16099_count:* AND 1500_16098_count:* … and so on for all ten days. We'd then see how many documents match this query, which is just the result we want to use further. Now you can image that for 14, 30 or 60 days this query is going to get pretttty long. For the other query: SELECT * FROM table WHERE NOW() - day <= Xdays GROUP BY (user_id) HAVING sum(item1_count) >= 10; We have to do some pretty ugly sums in SOLR and again repeated many times. I've been trying to leverage the StatsComponent and the FieldCollapsing features in SOLR, but I have not been able to do this for this use case. Would you please care to help with a solution or direct to some helpful documentation / similar problems solved? Thanks! Julian