[GitHub] [accumulo-website] ctubbsii commented on a change in pull request #221: MASC blog post

GitBox Tue, 25 Feb 2020 14:38:46 -0800

ctubbsii commented on a change in pull request #221: MASC blog post
URL: https://github.com/apache/accumulo-website/pull/221#discussion_r384167359


 ##########
 File path: _posts/blog/2020-02-11-accumulo-spark-connector.md
 ##########
 @@ -0,0 +1,192 @@
+---
+title: Microsoft MASC, an Apache Spark connector for Apache Accumulo
+author: Markus Cozowicz, Scott Graham
+---
+
+# Overview
+MASC provides an Apache Spark native connector for Apache Accumulo to 
integrate the rich Spark machine learning eco-system with the scalable and 
secure data storage capabilities of Accumulo. 
+
+## Major Features
+- Simplified Spark DataFrame read/write to Accumulo using DataSource v2 API
+- Speedup of 2-5x over existing approaches for pulling key-value data into 
DataFrame format
+- Scala and Python support without overhead for moving between languages
+- Process streaming data from Accumulo without loading it all into Spark memory
+- Push down filtering with a flexible expression language 
([JUEL](http://juel.sourceforge.net/)): user can define logical operators and 
comparisons to reduce the amount of data returned from Accumulo 
+- Column pruning based on selected fields transparently reduces the amount of 
data returned from Accumulo
+- Server side inference: ML model inference can run on the Accumulo nodes 
using MLeap to increase the scalability of AI solutions as well as keeping data 
in Accumulo
+
+## Use-cases
+There are many scenarios where use of this connector provides advantages, 
below we list a few common use-cases.
+
+**Scenario 1**: A data analyst needs to execute model inference on large 
amount of data in Accumulo.<br>
+**Benefit**: Instead of transferring all the data to a large Spark cluster to 
score using a Spark model, the model can be exported and pushed down using the 
connector to run on the Accumulo cluster. This can reduce the need for a large 
Spark cluster as well as the amount of data transferred between systems, and 
can improve inference speeds (>2x speedups observed).
+
+**Scenario 2**: A data scientist needs to train a Spark model on a large 
amount of data in Accumulo.<br>
+**Benefit**: Instead of pulling all the data into a large Spark cluster and 
restructuring the format to use Spark ML Lib tools, the connector allows for 
data to be streamed into Spark as a DataFrame reducing time to train and Spark 
cluster size / memory requirements.
+
+**Scenario 3**: A data analyst needs to perform ad hoc analysis on large 
amounts of data stored in Accumulo.<br>
+**Benefit**: Instead of pulling all the data into a large Spark cluster, the 
connector allows for both rows and columns to be pruned using pushdown 
filtering with a flexible expression language.
+
+# Architecture
+The Accumulo-Spark connector is composed of two components:
+
+- Accumulo server-side iterator performs
+  - column pruning
+  - row-based filtering
+  - [MLeap](https://github.com/combust/mleap) ML model inference and
+  - row assembly using [Apache AVRO](https://avro.apache.org/)
+- Spark DataSource V2 
+  - determines the number of Spark tasks based on available Accumulo table 
splits
+  - translates Spark filter conditions into a 
[JUEL](http://juel.sourceforge.net/) expression
+  - configures the Accumulo iterator
+  - deserializes the AVRO payload
+
+<img class="blog-img-center" src="/images/blog/202002_masc/architecture.svg">
 
 Review comment:
   For consistency, it's probably best to stick with the Kramdown syntax for 
images and class attributes: https://kramdown.gettalong.org/syntax.html#images

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [accumulo-website] ctubbsii commented on a change in pull request #221: MASC blog post

Reply via email to