Amazing to see Paul’s Chinese welcome words! Also glad to hear the use case by Wang Liang using Drill and welcome to contribute that as a Drill’s storage plugin.
On Tue, Jul 9, 2019 at 1:00 AM Paul Rogers <[email protected]> wrote: > 王亮 你好, > > > Very creative use of Drill! We usually think of Drill as a tool for "big > data" distributed file systems such as HDFS, MFS and S3. IPFS seems to be > for storing web content. I like how you've shown that IPFS is, in fact, a > distributed file system, and made Drill work in this context. > > Perhaps data scientists might benefit from Minerva: instead of everyone > downloading large data sets and doing queries locally, a data scientist > could instead query the data where it lives on the web. Such a feature > would be especially useful if the data changes over time. > > As Charles mentioned, it would be great if you could offer Minerva changes > to the Drill project. Most extensions live within the Drill project itself, > typically in the "contrib" module. > > The other choice would be for Minerva to be a separate project or repo > that can be integrated with Drill. We have often talked about creating a > true plugin architecture to support such a model, but gaps remain. Minerva > might be a good reason to fix the gaps. > Thanks, > - Paul > > > > On Saturday, July 6, 2019, 02:31:27 AM PDT, 王亮 <[email protected]> > wrote: > > Hi all, > > After reading that excellent book "Learning Apache Drill: Query and Analyze > Distributed Data Sources with SQL", my classmate and I also wanted to write > a Drill storage plugin. We found most DFS and NFS have been supported by > Drill, so we chose a relatively new and promising distributed file system, > IPFS. > > So we built Minerva, a Drill storage plugin that connects IPFS's > decentralized storage and Drill's flexible query engine. Any data file > stored on IPFS can be easily accessed from Drill's query interface, just > like a file stored on a local disk. The basic idea is very simple: run a > Drill instance along the IPFS daemon, and you can connect to other users on > IPFS who are also using Minerva. If one of the users happens to have stored > the file you are trying to query, then Drill can send execution plan to > that node, who executes the operations locally and returns the results > back. Of course, other users can benefit from your node as well, if you are > sharing the data they want. If there are enough people running Minerva, > data sharing and querying can be made distributed and more efficient! > > The query process is as follows: > 0 The user inputs an SQL statement, referencing a file on IPFS by its CID; > 1 The Foreman resolves the CIDs of the "pieces" of the data file, as well > as the IPFS providers of these pieces, by querying the DHT of IPFS; > 2 The Foreman distributes jobs to drillbits running on the providers. > 3 Drillbits on the providers read data from the piece of file on their > local disk, perform any necessary relational operations, and return results > to the Foreman. > 4 The Foreman returns the results to the user. > > Thanks to the modular design of Drill, we could rather "easily" write this > storage plugin. Now this plugin supports basic query operations, both read > and write, but only works with json and csv files. It is not very stable > for now, and the performance is still poor, mainly because it takes to too > long to do DHT queries on IPFS. We are trying to improve these problems in > the future. > > If you are insterested, we have made a few slides that explain the ideas in > details: > https://www.slideshare.net/BowenDing4/minerva-ipfs-storage-plugin-for-ipfs > > Any suggestion is welcome. ^_^ > > Find the code on GitHub: https://github.com/bdchain/Minerva > > Best, > Wang Liang >
