roadan commented on a change in pull request #58: documentation for pyspark sdk URL: https://github.com/apache/incubator-amaterasu/pull/58#discussion_r297473059
########## File path: docs/docs/config.md ########## @@ -77,4 +77,46 @@ All frameworks have their own configuration, Apache Amaterasu allows different f For more information about specific framework configuration options, look at the [frameworks](frameworks/) section of this documentation. ### Datasets + +One aspect of maintaining different deployment environments is where and how you get the data required to run the jobs. + +To provide an abstraction, each of our SDKs provides a way to load and persist data easily. This functionality is based on prior configuration. + +In a job repository, each environment contains a ```datasets.yml``` file. This file contains the configurations of all datasets to be used in the job. + +Below is an example of a simple configuration, for a dataset stored as parquet in Amazon S3. + +```yaml +file: + - uri: s3a://amaterasu-example/input/random-beers + format: parquet + name: random-beers +``` + +#### Detailed configuration +Below are the different types of datasets and their corresponding configuration options. +Do note that different Apache Amaterasu frameworks may have their take on the configurations below. +##### File +The following formats are currently supported - JSON, parquet, CSV, ORC. Review comment: Currently, the... ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services