Hi Andries, the trouble is that I run Drill on my desktop machine, but I have no server available to me that is capable of running Drill. Most $10/month hosting accounts do not permit you to run java apps. For this reason I simply use Drill for "pre-processing" of the files that I eventually will use in my simple 50 line python web app.
Even if I could run drill on my server, this seems like a lot of overhead for something as simple as a flat file with 6 columns. On Thu, Feb 4, 2016 at 1:31 PM, Andries Engelbrecht < [email protected]> wrote: > You can create multiple parquet files and have the ability to query them > all through the Drill SQL interface with minimal overhead. > > Creating a single 50GB parquet file is likely not be the best option for > performance, perhaps use Drill partitioning for the parquet files to speed > up queries and reads in the future. Although parquet should be more > efficient that CSV to store data. You can still limit Drill to a single > thread to limit memory use for parquet CTAS and potentially number of files > created. > > A bit of experimentation may help to find the optimum config for your use > case. > > --Andries > > > On Feb 4, 2016, at 10:12 AM, Peder Jakobsen | gmail <[email protected]> > wrote: > > > > Sorry, bad typo: I have 50GB of data, NOT 500GB ;). And I usually only > > query a 1 GB subset of this data using Drill. > > > > > > > > On Thu, Feb 4, 2016 at 1:04 PM, Peder Jakobsen | gmail < > [email protected]> > > wrote: > > > >> On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht < > >> [email protected]> wrote: > >> > >>> Is there a reason to create a single file? Typically you may want more > >>> files to improve parallel operation on distributed systems like drill. > >>> > >> > >> Good question. I'm not actually using Drill for "big data". In fact, > I > >> never deal with "big data", and I'm unlikely to ever do so. > >> > >> But I do have 500 GB of CSV files spread across about 100 directories. > >> They are all part of the same dataset, but this is how it's been > organized > >> by the government department who has released it as and Open Data dump > >> > >> Drill saves me the hassle of having to stitch these files together using > >> python or awk. I love being able to just query the files using SQL (so > far > >> it's slow though, I need to figure out why - 18 seconds for a simple > query > >> is too much). Data eventually needs to end up on the web to share it > with > >> other people, and I use crossfilter.js and D3.js for presentation. I > need > >> fine grained control over online data presentation, and all BI tools > I've > >> seen are terrible in this department, eg. Tableau. > >> > >> So I need my data in a format that can be read by common web frameworks, > >> and that usually implies dealing with a single file that can be > uploaded to > >> the web server. No need for a database, since I'm just reading a few > >> columns from a big flat file. > >> > >> I run my apps on a low cost virtual server. I don't have access to > >> java/virtualbox/MongoDB etc. Nor do I think these things are necessary: > >> K.I.S.S > >> > >> So this use case may be quite different from many of the more > "corporate" > >> users, but Drill is so very useful regardless. > >> > >> > >> > >> > >
