[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format

GitBox Thu, 16 Jan 2020 15:26:16 -0800

jon-wei commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367697988


 ##########
 File path: docs/development/extensions-core/hdfs.md
 ##########
 @@ -36,49 +36,110 @@ To use this Apache Druid extension, make sure to 
[include](../../development/ext
 |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal 
user name |empty|
 
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path
 to keytab file|empty|
 
-If you are using the Hadoop indexer, set your output directory to be a 
location on Hadoop and it will work.
+Besides the above settings, you also need to include all Hadoop configuration 
files (such as `core-site.xml`, `hdfs-site.xml`)
+in the Druid classpath. One way to do this is copying all those files under 
`${DRUID_HOME}/conf/_common`.
+
+If you are using the Hadoop ingestion, set your output directory to be a 
location on Hadoop and it will work.
 If you want to eagerly authenticate against a secured hadoop/hdfs cluster you 
must set `druid.hadoop.security.kerberos.principal` and 
`druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job 
method that runs `kinit` command periodically.
 
-### Configuration for Google Cloud Storage
+### Configuration for Cloud Storage
+
+You can also use the AWS S3 or the Google Cloud Storage as the deep storage 
via HDFS.
+
+#### Configuration for AWS S3
 
-The HDFS extension can also be used for GCS as deep storage.
+To use the AWS S3 as the deep storage, you need to configure 
`druid.storage.storageDirectory` properly.
 
 |Property|Possible Values|Description|Default|
 |--------|---------------|-----------|-------|
-|`druid.storage.type`|hdfs||Must be set.|
-|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.|
+|`druid.storage.type`|hdfs| |Must be set.|
+|`druid.storage.storageDirectory`|s3a://bucket/example/directory or 
s3n://bucket/example/directory|Path to the deep storage|Must be set.|
 
-All services that need to access GCS need to have the [GCS connector 
jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation)
 in their class path. One option is to place this jar in <druid>/lib/ and 
<druid>/extensions/druid-hdfs-storage/
+You also need to include the [Hadoop AWS 
module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html),
 especially the `hadoop-aws.jar` in the Druid classpath.
+Run the below command to install the `hadoop-aws.jar` file under 
`${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.
 
-Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.
-
-<a name="firehose"></a>
+```bash
+java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps 
-h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
+cp 
${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar
 ${DRUID_HOME}/extensions/druid-hdfs-storage/
+```
 
-## Native batch ingestion
+Finally, you need to add the below properties in the `core-site.xml`.
+For more configurations, see the [Hadoop AWS 
module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
+
+```xml
+<property>
+  <name>fs.s3a.impl</name>
+  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
+  <description>The implementation class of the S3A Filesystem</description>
+</property>
+
+<property>
+  <name>fs.AbstractFileSystem.s3a.impl</name>
+  <value>org.apache.hadoop.fs.s3a.S3A</value>
+  <description>The implementation class of the S3A 
AbstractFileSystem.</description>
+</property>
+
+<property>
+  <name>fs.s3a.access.key</name>
+  <description>AWS access key ID. Omit for IAM role-based or provider-based 
authentication.</description>
+  <value>your access key</value>
+</property>
+
+<property>
+  <name>fs.s3a.secret.key</name>
+  <description>AWS secret key. Omit for IAM role-based or provider-based 
authentication.</description>
+  <value>your secret key</value>
+</property>
+```
 
-This firehose ingests events from a predefined list of files from a Hadoop 
filesystem.
-This firehose is _splittable_ and can be used by [native parallel index 
tasks](../../ingestion/native-batch.md#parallel-task).
-Since each split represents an HDFS file, each worker task of `index_parallel` 
will read an object.
+#### Configuration for Google Cloud Storage
 
-Sample spec:
+To use the Google cloud Storage as the deep storage, you need to configure 
`druid.storage.storageDirectory` properly.
 
 Review comment:
   Google cloud Storage -> Google Cloud Storage

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format

Reply via email to