roadan commented on a change in pull request #58: documentation for pyspark sdk
URL: https://github.com/apache/incubator-amaterasu/pull/58#discussion_r297473059
 
 

 ##########
 File path: docs/docs/config.md
 ##########
 @@ -77,4 +77,46 @@ All frameworks have their own configuration, Apache 
Amaterasu allows different f
 For more information about specific framework configuration options, look at 
the [frameworks](frameworks/) section of this documentation.
 
 ### Datasets 
+
+One aspect of maintaining different deployment environments is where and how 
you get the data required to run the jobs.
+
+To provide an abstraction, each of our SDKs provides a way to load and persist 
data easily. This functionality is based on prior configuration.
+
+In a job repository, each environment contains a ```datasets.yml``` file. This 
file contains the configurations of all datasets to be used in the job.
+
+Below is an example of a simple configuration, for a dataset stored as parquet 
in Amazon S3.
+
+```yaml
+file:
+  - uri: s3a://amaterasu-example/input/random-beers
+    format: parquet
+    name: random-beers
+```
+
+#### Detailed configuration
+Below are the different types of datasets and their corresponding 
configuration options.
+Do note that different Apache Amaterasu frameworks may have their take on the 
configurations below.
+##### File
+The following formats are currently supported - JSON, parquet, CSV, ORC.
 
 Review comment:
   Currently, the...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to