roadan commented on a change in pull request #58: documentation for pyspark sdk URL: https://github.com/apache/incubator-amaterasu/pull/58#discussion_r297495190
########## File path: docs/docs/frameworks.md ########## @@ -41,13 +41,145 @@ Amaterasu supports different processing frameworks to be executed. Amaterasu fra # Amaterasu Frameworks +## Python +Apache Amaterasu supports the following types of Python workloads: + +1. PySpark workload ([See below](#pyspark)) + +2. Pandas workload + +3. Generic Python workload + +Each workload type has a dedicated Apache Amaterasu SDK. +The Apache Amaterasu SDK is available in PyPI and can be installed as follows: +```bash +pip install apache-amaterasu +``` + +Alternatively, it is possible to download the SDK source and manually install it via ```easy_install``` or executing the setup script. + +```bash +wget <link to source distribution> +tar -xzf apache-amaterasu-0.2.1-incubating.tar.gz +cd apache-amaterasu-0.2.1-incubating +python setup.py install +``` + +### Action dependencies +Apache Amaterasu has the capability of ensuring Python dependencies are present on all execution nodes when executing action sources. + +In order to define the required dependencies, a ```requirements.txt``` file has to be added to the job repository. +Currently, only a global ```requirements.txt``` is supported. + +Below you can see where the requirements file has to be added: +``` +repo ++-- deps/ +| +-- requirements.txt <-- This is the place for defining dependencies ++-- env/ +| +-- dev/ +| | +-- job.yaml +| | +-- spark.yaml +| +-- test/ +| | +-- job.yaml +| | +-- spark.yaml +| +-- prod/ +| +-- job.yaml +| +-- spark.yaml ++-- src/ +| +-- start/ +| +-- dev/ +| | +-- job.yaml +| | +-- spark.yaml +| +-- test/ +| | +-- job.yaml +| | +-- spark.yaml +| +-- prod/ +| +-- job.yaml +| +-- spark.yaml ++-- maki.yaml + +``` + +When a ```requirements.txt``` file exists, Apache Amaterasu distributes it to the execution containers and locally installs the dependencies in each container. + +> **Important** - Your execution nodes need to have egress connection available in order to use pip + +### Pandas +### Generic Python + + +## Java and JVM programs + ## Apache Spark ### Spark Configuration ### Scala ### PySpark +Apache Amaterasu has the capability of deploying PySpark applications and provide configuration and integration Review comment: This section should be made generic about spark in general and move up ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services