Sorry, bad typo: I have 50GB of data, NOT 500GB ;). And I usually only query a 1 GB subset of this data using Drill.
On Thu, Feb 4, 2016 at 1:04 PM, Peder Jakobsen | gmail <[email protected]> wrote: > On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht < > [email protected]> wrote: > >> Is there a reason to create a single file? Typically you may want more >> files to improve parallel operation on distributed systems like drill. >> > > Good question. I'm not actually using Drill for "big data". In fact, I > never deal with "big data", and I'm unlikely to ever do so. > > But I do have 500 GB of CSV files spread across about 100 directories. > They are all part of the same dataset, but this is how it's been organized > by the government department who has released it as and Open Data dump > > Drill saves me the hassle of having to stitch these files together using > python or awk. I love being able to just query the files using SQL (so far > it's slow though, I need to figure out why - 18 seconds for a simple query > is too much). Data eventually needs to end up on the web to share it with > other people, and I use crossfilter.js and D3.js for presentation. I need > fine grained control over online data presentation, and all BI tools I've > seen are terrible in this department, eg. Tableau. > > So I need my data in a format that can be read by common web frameworks, > and that usually implies dealing with a single file that can be uploaded to > the web server. No need for a database, since I'm just reading a few > columns from a big flat file. > > I run my apps on a low cost virtual server. I don't have access to > java/virtualbox/MongoDB etc. Nor do I think these things are necessary: > K.I.S.S > > So this use case may be quite different from many of the more "corporate" > users, but Drill is so very useful regardless. > > > >
