Mahout meeting notes  12.6.2019

==============================



A meeting was held today, Friday 6 Dec 2019 to discuss to discuss the current 
state of the project, planned releases and a general path forward.

Joe Olson, Andrew Palumbo and Trevor Grant met via Google Hangouts at 10:15 AM.


Early discussion was based around AP and TG’s loose and quickly put together 
agenda and ideas. AP started the unofficial agenda doc <10 mins before the 
meeting start, so the agenda was quick n dirty.


An agreement was made early on by TG and AP to focus on the release as the 
build is currently working, and releases are deploying artifacts for Scala 
2.11, scala 2.12, pegged to Java 1.8 and mvn 3.3.9.  A heavy refactoring effort 
was made to After fixing the build, by revamping some very old poms and 
reverting back to the parertnt `Apache pom.xml` `release` goal and adding some 
new information to the release master’s  `.m2/setings.xml`, we are able to 
release and deploy artifacts with Java 1.8, cross compiled for Scala 2.11 and 
Scala 2.12.


https://repository.apache.org/#nexus-search;gav~org.apache.mahout~mahout-core_2.12~~~~kw,versionexpand

https://repository.apache.org/#nexus-search;gav~org.apache.mahout~mahout-hdfs_2.11~~~~kw,versionexpand


A release board was created:


https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=348


And some minor issues were added.


A decision was made to move Docker files and certain planned AWS infrastructure 
as code (TerraForm) slated for the 14.1 release off the central repository, and 
onto both dockerhub.io under a newly created mahout namespace, JO will be 
handling the task of creating the hub.docker.com/<https://hub.docker.com/> 
"Mahout" organization, and moving the Docker files to that space.  
[MAHOUT-2074<https://issues.apache.org/jira/browse/MAHOUT-2074>]



AP will be creating a mahout-contrib repo on his personal page to be merged in 
later with some terraform code and examples, etc probably borrowing heavily 
from Pulsar and spark: https://github.com/apache/pulsar/tree/master/deployment. 
As well AP will (has) begin/begun leveraging some off project time onto this 
mahout-contrib package, or at least is keeping much in the org.apache.mahout 
namespace.  Some work has already been done with NiFi and MiNifi for an SDR 
project under the org.apache.mahout namespace which will be available in the 
mahout-contrib package or an other stand alone package [NOT DISCUSSED] at 
meeting. Forgot to bring up.


AP will fix the `change-scala-version.sh` script, [MAHOUT-2080] and will bump 
the scala version in master over the weekend 
[MAHOUT-20<https://issues.apache.org/jira/browse/MAHOUT-2074>82], at which 
point we will call a code freeze.  And attempt to release by next weekend 
(after cutting an RC).



TG was able to get Jenkins running again, building snapshots, fixing 
[MAHOUT-2073<https://issues.apache.org/jira/browse/MAHOUT-2073>].


JO will look into some other projects build-chains 
MAHOUT-2076<https://issues.apache.org/jira/browse/MAHOUT-2076>, and consider a 
scripts to cut down on RC creation and Release Deployment time by having a 
single script with all release commands, similar to Apache Spark (Pulsar was 
discussed as a reference but the project is moving quickly and they’ve 
refactored their build since last I’d (AP) looked, in fact, deploying pulsar is 
as simple as `mvn clean deploy`.


https://github.com/apache/pulsar/blob/master/.test-infra/jenkins/job_pulsar_release_nightly_snapshot.groovy.


Spark and Flink should have good examples…. E.g:

Spark:

https://github.com/apache/spark/blob/master/dev/make-distribution.sh

https://github.com/apache/spark/tree/master/dev/create-release

Flink:

https://github.com/apache/flink/tree/master/tools/releasing



TG will work on zeppelin integration for some easy mahout-python-ggplot2 
examples.


There was discussion of using the The US Census Api for a data examples.


A long running issue was resolved 
[MAHOUT-2023<https://issues.apache.org/jira/browse/MAHOUT-2023>] Broken Scopt 
Classes:


*ISSUE*:  we have no way of testing this, we need @pat to take a look.  With 
the nightly snapshots being build, the current version in master is available 
in NEXUS:


[JENKINS] Archiving 
/home/jenkins/jenkins-slave/workspace/mahout-nightly/community/spark-cli-drivers/target/mahout-spark-cli-drivers_2.11-14.1-SNAPSHOT.jar
 to 
org.apache.mahout/mahout-spark-cli-drivers_2.11/14.1-20191206.193308-1/mahout-spark-cli-drivers_2.11-14.1-20191206.193308-1.jar



We spoke quickly today, and these notes were compiled to the best of my 
recollection.  If I missed anything, please bring let me know.





Trevor’s Agenda:

  1.  Release # Addressed

     *   Path to release # addressed

     *   Steps -> jira tickets # addressed

     *   Code freeze date # addressed Monday, 9 Dec 2019.

  2.  Other Misc..#  discussed and addressed



Andy’s agenda..


# RELEASE…



  1.  Fix Docker files, # Addressed:  moving dockefiles to dockerhub and IaaS 
code to AP github

  2.  Create a Release script for 14.1 # adressed- ticket + assigned

  3.  Fix Scala-change-version.sh script. # ticket +assigned.

  4.  Add a terraform script to examples for an asg # addressed


# Whish list (Post 14.1 release)



# in-core matrices backed by Off heap and or shared memory, Tighter coupling 
with GPU, native code, python, TPUs, FPGAs.


  1.  Arrow backed in core

     *   Arrow is advertising Sparse and Dense Tensors and CSR matrices, have 
vectors.

        *   Arrow’s general idea is to have off heap shared memory between OS 
and GPU

     *   Been bit in the ass by them before.. Not all packages are as complete 
as advertised.

     *   https://arrow.apache.org/docs/java/

        *   public final class SparseMatrixIndexCSR could be used as well as 
Tensor<T> class.

  2.  Ability to stream data into  in-core matrices off heap buffers from E.g. 
Nifi.

  3.  https://github.com/apache/incubator-datasketches-memory

     *   http://datasketches.incubator.apache.org/docs/Memory/MemoryPackage.html

     *   Streaming sketch algos:

        *   
https://github.com/DataSketches/DataSketches.github.io/blob/master/docs/pdf/Quantiles_KLL.pdf

        *   others

  4.  Tighter, simpler CUDA integration, if arrow is mature enough we may have 
access to cuML, etc,

  5.  Working with off Heap memory also makes Python a more viable and not so 
distant possibility.

# ALGOS

  1.  GLMS

  2.  Evolutionary Algos with Spiking Neurons (FINALLY)

  3.  DrmLike[Complex128]

  4.  Currently working oin a project for which i need Basic Streaming 
capabilities.

     *   To be done in https://github.com/andrewpalumbo/mahout-contrib  Or some 
such.

     *   Integrate Apache DataSketches-incubating for streaming sketching and 
analysis.

        *   Streaming SVD-type algorithms

        *   Find eigenvectors as data streams in to in-core and further stacked 
into DRMs.




===========================================================

The Next meeting will take place Friday 13 Jan 2019 10AM PST.


All are welcome.


Please respond to d...@mahout.apache.org for an invite, and access to next 
week’s agenda.






Reply via email to