Hello Apache Accumulo devs!

I wanted to highlight some recent work building a modern Apache Spark connector 
to Apache Accumulo. The goal is to enable ML capabilities and efficient data 
transfer for data stored in Accumulo. It would be great to get feedback from 
people working with both of these systems to understand how this can benefit 
your use-case or if there are additional features which would be valuable. See 
release notes below.

Thanks,
Scott Graham
Senior Data Scientist
Microsoft | Azure Global Customer Engineering - AI


MASC Release 1.0.3
Microsoft MASC, an Apache Spark connector for Apache Accumulo version 1.0.3 has 
been released. MASC integrates Apache Spark and Apache Accumulo to leverage the 
rich Spark Machine Learning eco-system with scalable and secure data storage 
capabilities of Accumulo. This work is publicly available under the Apache 
License 2.0 on GitHub at https://github.com/microsoft/masc. Feedback, 
questions, and contributions are welcome.
Usage
PySpark based example is here: Accumulo-Spark Connector Demo 
Notebook<https://github.com/microsoft/masc/blob/master/connector/examples/AccumuloSparkConnector.ipynb>.
Connector documentation: 
https://github.com/microsoft/masc/blob/master/connector/README.md
JARs available on Maven Central Repository:

  *   
https://mvnrepository.com/artifact/com.microsoft.masc/microsoft-accumulo-spark-datasource
  *   
https://mvnrepository.com/artifact/com.microsoft.masc/microsoft-accumulo-spark-iterator
Major Features

  *   Simplified Spark DataFrame read/write to Accumulo using DataSource v2 
API<http://shzhangji.com/blog/2018/12/08/spark-datasource-api-v2/>
  *   Speedup of 2-5x over existing approaches for pulling key-value data into 
DataFrame format
  *   Scala and Python support without overhead for moving between languages
  *   Process streaming data from Accumulo without loading it all into Spark 
memory
  *   Push down filtering with a flexible expression language 
(JUEL<http://juel.sourceforge.net/>): this allows the user to use logical 
operators and comparisons to reduce the amount of data returned from Accumulo
  *   Column pruning based on selected fields transparently reduces the amount 
of data returned from Accumulo
  *   Server side inference: this allows the Accumulo nodes to be used to run 
ML model inference using MLeap<https://mleap-docs.combust.ml/> to increase the 
scalability of AI solutions as well as keeping data in Accumulo.
Known Issues

  *   [37<https://github.com/microsoft/masc/issues/37>] Support SaveMode when 
writing DataFrames
Contributions
Thanks to contributions from members on the Azure Government Customer 
Engineering and Azure Government teams.
Markus Cozowicz, Scott Graham, Jun-Ki Min, Chenhui Hu, Arvind Shyamsundar, Marc 
Parisi, Billie Rinaldi, Anupam Sharma, Tao Wu and Pavandeep Kalra.

Reply via email to