王亮 你好,
Very creative use of Drill! We usually think of Drill as a tool for "big data" distributed file systems such as HDFS, MFS and S3. IPFS seems to be for storing web content. I like how you've shown that IPFS is, in fact, a distributed file system, and made Drill work in this context. Perhaps data scientists might benefit from Minerva: instead of everyone downloading large data sets and doing queries locally, a data scientist could instead query the data where it lives on the web. Such a feature would be especially useful if the data changes over time. As Charles mentioned, it would be great if you could offer Minerva changes to the Drill project. Most extensions live within the Drill project itself, typically in the "contrib" module. The other choice would be for Minerva to be a separate project or repo that can be integrated with Drill. We have often talked about creating a true plugin architecture to support such a model, but gaps remain. Minerva might be a good reason to fix the gaps. Thanks, - Paul On Saturday, July 6, 2019, 02:31:27 AM PDT, 王亮 <wanglian...@gmail.com> wrote: Hi all, After reading that excellent book "Learning Apache Drill: Query and Analyze Distributed Data Sources with SQL", my classmate and I also wanted to write a Drill storage plugin. We found most DFS and NFS have been supported by Drill, so we chose a relatively new and promising distributed file system, IPFS. So we built Minerva, a Drill storage plugin that connects IPFS's decentralized storage and Drill's flexible query engine. Any data file stored on IPFS can be easily accessed from Drill's query interface, just like a file stored on a local disk. The basic idea is very simple: run a Drill instance along the IPFS daemon, and you can connect to other users on IPFS who are also using Minerva. If one of the users happens to have stored the file you are trying to query, then Drill can send execution plan to that node, who executes the operations locally and returns the results back. Of course, other users can benefit from your node as well, if you are sharing the data they want. If there are enough people running Minerva, data sharing and querying can be made distributed and more efficient! The query process is as follows: 0 The user inputs an SQL statement, referencing a file on IPFS by its CID; 1 The Foreman resolves the CIDs of the "pieces" of the data file, as well as the IPFS providers of these pieces, by querying the DHT of IPFS; 2 The Foreman distributes jobs to drillbits running on the providers. 3 Drillbits on the providers read data from the piece of file on their local disk, perform any necessary relational operations, and return results to the Foreman. 4 The Foreman returns the results to the user. Thanks to the modular design of Drill, we could rather "easily" write this storage plugin. Now this plugin supports basic query operations, both read and write, but only works with json and csv files. It is not very stable for now, and the performance is still poor, mainly because it takes to too long to do DHT queries on IPFS. We are trying to improve these problems in the future. If you are insterested, we have made a few slides that explain the ideas in details: https://www.slideshare.net/BowenDing4/minerva-ipfs-storage-plugin-for-ipfs Any suggestion is welcome. ^_^ Find the code on GitHub: https://github.com/bdchain/Minerva Best, Wang Liang