Hi, glad that I can help. May I suggest that I continue in the creation of use cases and the respective types of query profiles: * Wikipedia Edit History: After an initial glance the history is made up of 40 or so tables. I would design some user stories using join like queries across multiple tables - or however they are called in Drill. * I did not have an opportunity to check the Enron Stuff, but here I would design user stories as if building an email client, this would lead to heavy usage of a full text searching.
There are some additional data-sets I would like to suggest: http://aws.amazon.com/datasets * Freebase.com: Simulate a visualization to jump from topic to topic as usert stories. This would lead to queries on a random and very small rowset. * Wikipedia Page Traffic Statistics: Simulate a log analysis. Heavy aggregation and date function on a large number of rows. * Global Weather Measurements: Design user stories based on geographic and chronoligic aggregation of climate data to visualize trends. Regards Stefan ________________________________________ Von: Michael Hausenblas [[email protected]] Gesendet: Donnerstag, 10. Januar 2013 19:54 An: [email protected] Betreff: Re: Introduction > Michael Hausenblas is beginning to collect data sets and query examples for > different plausible use cases ranging from small to large. He should show > up on the mailing list shortly and you could coordinate with him. Welcome, Stefan - great to have you on board! So the idea would be to compile a list of datasets along with typical (interesting) queries formulated in natural language. One thing we need to get this off the ground is the Wiki but I gather Ted is on that .. Datasets that might be of interest include, but are not restricted to: * Wikipedia edit history from [1] * Census data (US, Eurostat, etc.) * AOL search logs * Enron emails [2] Feel free to come up with additional ones as well. I suppose we can continue the discussion (who looks into what) here on the list and once the Wiki is available we can co-ordinate also via it. Cheers, Michael [1] http://en.wikipedia.org/wiki/Wikipedia:Database_download [2] http://www.cs.cmu.edu/~enron/ -- Michael Hausenblas Ireland, Europe http://mhausenblas.info/ On 10 Jan 2013, at 10:19, Ted Dunning <[email protected]> wrote: > Stefan, > > One of the key things to do right now is to work on use cases. > > Michael Hausenblas is beginning to collect data sets and query examples for > different plausible use cases ranging from small to large. He should show > up on the mailing list shortly and you could coordinate with him. > > On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan > <[email protected]>wrote: > >> Hi all, >> I am working for a IT consulting agency in Germany. One of the goals of >> our team for 2013 is active (as in giving) participation in the open source >> community and offering our customers cutting-edge analytical tools for >> large to huge data bases. You guys hit the spot! >> >> I would like to start offering my personal help (volunteer work for now, >> later I could pitch in a day or two per week perhaps) in any role which >> would help. I am a somewhat strong enterprise java developer, can deal >> sufficiently well with HTML5 frontends, know most things about build >> environments and testing and should be able to do some design or >> documentation. >> >> Is there anything I can do? >> >> Stefan >>
