Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-16 Thread Marcel Ruiz Forns
OK, so I think we have our candidates: 1) PostgreSQL 2) Cassandra We can speak about this at our next tasking meeting. If someone has more suggestions or comments, we've still a couple days until then. Thank you all! Marcel On Sat, Jun 13, 2015 at 11:37 AM, Joseph Allemandou < jalleman...@wiki

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-13 Thread Joseph Allemandou
Andrew, Toby, that makes perfect sense. While thinking that the distributed aspect of Impala would handle high availability issues, I very much understand that having a front-end system relying on the analytics cluster is not as good as having a dedicated storage solution. Thanks for the good point

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-12 Thread Toby Negrin
As someone who has run production serving systems on top of Hadoop, I think this is risky. We've had substantial planned and unplanned downtime on the cluster (which is to be expected) and it would be bad for a pageview API to be impacted. -Toby On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto wrote

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-12 Thread Andrew Otto
> I think we could add Impala in storage technologies to assess. I think we don’t want to build the pageview API on top of the Analytics Cluster. > On Jun 12, 2015, at 05:37, Joseph Allemandou > wrote: > > I think we could add Impala in storage technologies to assess. > It allows reading / c

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-12 Thread Joseph Allemandou
I think we could add Impala in storage technologies to assess. It allows reading / computing straight from HDFS and should be fast enough for not too bad UEx. Maybe ? On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns wrote: > This thread seems to have paused for 1 or 2 days now. > > So summar

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-11 Thread Marcel Ruiz Forns
This thread seems to have paused for 1 or 2 days now. So summarizing, the following storage technologies have been mentioned: - PostgreSQL - MySQL - Cassandra - Voldemort And the following concerns have been raised on using something that: - We're already familiar with - Permi

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-10 Thread Marcel Ruiz Forns
If we are going to completely denormalize the data sets for anonymization, and we expect just slice and dice queries to the database, I think we wouldn't take much advantage of a relational DB, because it wouldn't need to aggregate values, slice or dice, all slices and dices would be precomputed, r

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Dan Andreescu
On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke wrote: > On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu > wrote: > >> Eric, I think we should allow arbitrary querying on any dimension for >> that first data block. We could pre-aggregate all of those combinations >> pretty easily since the dimensi

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Gabriel Wicke
On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu wrote: > Eric, I think we should allow arbitrary querying on any dimension for that > first data block. We could pre-aggregate all of those combinations pretty > easily since the dimensions have very low cardinality. > Are you thinking about someth

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Dan Andreescu
> > Dan/Kevin: slightly OT, are we aware of any use case related to features > that would be exposing PV data in production? I’ve seen mocks from the > Discovery team with PV data embedded in articles or search interfaces and > I’m not sure what their status is. > We have asks from people to expos

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Dario Taraborelli
Thanks, Gabriel – this is super-helpful. Dan/Kevin: slightly OT, are we aware of any use case related to features that would be exposing PV data in production? I’ve seen mocks from the Discovery team with PV data embedded in articles or search interfaces and I’m not sure what their status is.

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Dan Andreescu
Eric, I think we should allow arbitrary querying on any dimension for that first data block. We could pre-aggregate all of those combinations pretty easily since the dimensions have very low cardinality. For the article-level data, no, we'd want just basic timeseries querying. Thanks Gabriel, if

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Gabriel Wicke
I think Eric's original response got lost, so let me include it below: >>> dim1, dim2, dim3, ..., dimN, val >>>a, null, null, ..., null, 15// pv for dim1=a >>>a, x, null, ..., null, 34// pv for dim1=a & dim2=x >>>a, x, 1, ..., null, 27// pv f

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Toby Negrin
@otto: I believe that the high throughput bulk input use case will be difficult for a relational db (e.g. postgres) to handle. It will be interesting to see how well cassandra can handle the queries that people want to run. Tradeoffs... @dario: RestBASE and Cassandra are definitely different thing

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Andrew Otto
> p.s. I will never drink Bud Lite Lime. Like, never. You have the wrong attitude. Pretend it is beer-soda, not beer. Beer + sprite is yummy! (I’ll let someone else figure out how this advice also applies to the database analogy.) > On Jun 8, 2015, at 19:52, Dan Andreescu wrote: > >>

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Adam Baso
Does one of these options support both SQL based and REST interfaces with TSV and JSON output with little user setup? I'm thinking of your typical Ubuntu/Mac/Windows user with the ability to import TSV data into a spreadsheet. On Tuesday, June 9, 2015, Dario Taraborelli wrote: > I too would love

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Dario Taraborelli
I too would love to understand if RestBASE can become our default solution for this kind of data-intensive APIs. Can you guys briefly explain what kind of queries and aggregations would be problematic if we were to go with Cassandra? > On Jun 9, 2015, at 8:39 AM, Oliver Keyes wrote: > > Rememb

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-08 Thread Oliver Keyes
Remember that (as things currently stand) putting the thing on labs means meta-analytics ("how are the cubes being used?") being a pain in the backside to integrate with our existing storage solutions. On 8 June 2015 at 22:52, Dan Andreescu wrote: >> As always, I'd recommend that we go with tech

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-08 Thread Dan Andreescu
> > As always, I'd recommend that we go with tech we are familiar with -- > mysql or cassandra. We have a cassandra committer on staff who would be > able to answer these questions in detail. > > > WMF uses PostGRES for some things, no? Or is that is just in labs? > Since this data is meant to be

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-08 Thread Eric Evans
On Mon, Jun 8, 2015 at 7:44 PM, Gabriel Wicke wrote: > (+ Eric) > > On Mon, Jun 8, 2015 at 5:42 PM, Toby Negrin wrote: >> >> As always, I'd recommend that we go with tech we are familiar with -- >> mysql or cassandra. We have a cassandra committer on staff who would be able >> to answer these que

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-08 Thread Andrew Otto
> As always, I'd recommend that we go with tech we are familiar with -- mysql > or cassandra. We have a cassandra committer on staff who would be able to > answer these questions in detail. WMF uses PostGRES for some things, no? Or is that is just in labs? > On Jun 8, 2015, at 17:42, Toby N

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-08 Thread Gabriel Wicke
(+ Eric) On Mon, Jun 8, 2015 at 5:42 PM, Toby Negrin wrote: > As always, I'd recommend that we go with tech we are familiar with -- > mysql or cassandra. We have a cassandra committer on staff who would be > able to answer these questions in detail. > > -Toby > > On Mon, Jun 8, 2015 at 4:46 PM,

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-08 Thread Toby Negrin
As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail. -Toby On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns wrote: > *This discussion is intended to be a branch of

[Analytics] [Technical] Pick storage for pageview cubes

2015-06-08 Thread Marcel Ruiz Forns
*This discussion is intended to be a branch of the thread: "[Analytics] Pageview API Status update".* Hi all, We Analytics are trying to *choose a storage technology to keep the pageview data* for analysis. We don't want to get to a final system that covers all our needs yet (there are still thi