王亮 你好,

Very creative use of Drill! We usually think of Drill as a tool for "big data" 
distributed file systems such as HDFS, MFS and S3. IPFS seems to be for storing 
web content. I like how you've shown that IPFS is, in fact, a distributed file 
system, and made Drill work in this context.

Perhaps data scientists might benefit from Minerva: instead of everyone 
downloading large data sets and doing queries locally, a data scientist could 
instead query the data where it lives on the web. Such a feature would be 
especially useful if the data changes over time.

As Charles mentioned, it would be great if you could offer Minerva changes to 
the Drill project. Most extensions live within the Drill project itself, 
typically in the "contrib" module.

The other choice would be for Minerva to be a separate project or repo that can 
be integrated with Drill. We have often talked about creating a true plugin 
architecture to support such a model, but gaps remain. Minerva might be a good 
reason to fix the gaps. 
Thanks,
- Paul

 

    On Saturday, July 6, 2019, 02:31:27 AM PDT, 王亮 <wanglian...@gmail.com> 
wrote:  
 
 Hi all,

After reading that excellent book "Learning Apache Drill: Query and Analyze
Distributed Data Sources with SQL", my classmate and I also wanted to write
a Drill storage plugin. We found most DFS and NFS have been supported by
Drill, so we chose a relatively new and promising distributed file system,
IPFS.

So we built Minerva, a Drill storage plugin that connects IPFS's
decentralized storage and Drill's flexible query engine. Any data file
stored on IPFS can be easily accessed from Drill's query interface, just
like a file stored on a local disk. The basic idea is very simple: run a
Drill instance along the IPFS daemon, and you can connect to other users on
IPFS who are also using Minerva. If one of the users happens to have stored
the file you are trying to query, then Drill can send execution plan to
that node, who executes the operations locally and returns the results
back. Of course, other users can benefit from your node as well, if you are
sharing the data they want. If there are enough people running Minerva,
data sharing and querying can be made distributed and more efficient!

The query process is as follows:
0 The user inputs an SQL statement, referencing a file on IPFS by its CID;
1 The Foreman resolves the CIDs of the "pieces" of the data file, as well
as the IPFS providers of these pieces, by querying the DHT of IPFS;
2 The Foreman distributes jobs to drillbits running on the providers.
3 Drillbits on the providers read data from the piece of file on their
local disk, perform any necessary relational operations, and return results
to the Foreman.
4 The Foreman returns the results to the user.

Thanks to the modular design of Drill, we could rather "easily" write this
storage plugin. Now this plugin supports basic query operations, both read
and write, but only works with json and csv files. It is not very stable
for now, and the performance is still poor, mainly because it takes to too
long to do DHT queries on IPFS. We are trying to improve these problems in
the future.

If you are insterested, we have made a few slides that explain the ideas in
details:
https://www.slideshare.net/BowenDing4/minerva-ipfs-storage-plugin-for-ipfs

Any suggestion is welcome. ^_^

Find the code on GitHub: https://github.com/bdchain/Minerva

Best,
Wang Liang
  

Reply via email to