[43/50] samza git commit: Cleanup docs for HDFS connector
Cleanup docs for HDFS connector Project: http://git-wip-us.apache.org/repos/asf/samza/repo Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/f8470b1e Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/f8470b1e Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/f8470b1e Branch: refs/heads/master Commit: f8470b1ed796d00888e4ba176a0105c9b07b2938 Parents: ed196c7 Author: Jagadish Authored: Fri Nov 2 17:33:26 2018 -0700 Committer: Jagadish Committed: Fri Nov 2 17:33:26 2018 -0700 -- .../documentation/versioned/connectors/hdfs.md | 134 +++ 1 file changed, 50 insertions(+), 84 deletions(-) -- http://git-wip-us.apache.org/repos/asf/samza/blob/f8470b1e/docs/learn/documentation/versioned/connectors/hdfs.md -- diff --git a/docs/learn/documentation/versioned/connectors/hdfs.md b/docs/learn/documentation/versioned/connectors/hdfs.md index 9692d18..9b79f24 100644 --- a/docs/learn/documentation/versioned/connectors/hdfs.md +++ b/docs/learn/documentation/versioned/connectors/hdfs.md @@ -21,133 +21,99 @@ title: HDFS Connector ## Overview -Samza applications can read and process data stored in HDFS. Likewise, you can also write processed results to HDFS. - -### Environment Requirement - -Your job needs to run on the same YARN cluster which hosts the HDFS you want to consume from (or write into). +The HDFS connector allows your Samza jobs to read data stored in HDFS files. Likewise, you can write processed results to HDFS. +To interact with HDFS, Samza requires your job to run on the same YARN cluster. ## Consuming from HDFS +### Input Partitioning -You can configure your Samza job to read from HDFS files with the [HdfsSystemConsumer](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java). Avro encoded records are supported out of the box and it is easy to extend to support other formats (plain text, csv, json etc). See Event Format section below. - -### Partitioning +Partitioning works at the level of individual directories and files. Each directory is treated as its own stream and each of its files is treated as a _partition_. For example, Samza creates 5 partitions when it's reading from a directory containing 5 files. There is no way to parallelize the consumption when reading from a single file - you can only have one container to process the file. -Partitioning works at the level of individual directories and files. Each directory is treated as its own stream, while each of its files is treated as a partition. For example, when reading from a directory on HDFS with 10 files, there will be 10 partitions created. This means that you can have up-to 10 containers to process them. If you want to read from a single HDFS file, there is currently no way to break down the consumption - you can only have one container to process the file. +### Input Event format +Samza supports avro natively, and it's easy to extend to other serialization formats. Each avro record read from HDFS is wrapped into a message-envelope. The [envelope](../api/javadocs/org/apache/samza/system/IncomingMessageEnvelope.html) contains these 3 fields: -### Event format +- The key, which is empty -Samza's HDFS consumer wraps each avro record read from HDFS into a message-envelope. The [Envelope](../api/javadocs/org/apache/samza/system/IncomingMessageEnvelope.html) contains three fields of interest: +- The value, which is set to the avro [GenericRecord](https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericRecord.html) -1. The key, which is empty -2. The message, which is set to the avro [GenericRecord](https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericRecord.html) -3. The stream partition, which is set to the name of the HDFS file +- The partition, which is set to the name of the HDFS file -To support input formats which are not avro, you can implement the [SingleFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java) interface (example: [AvroFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java)) +To support non-avro input formats, you can implement the [SingleFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java) interface. -### End of stream support +### EndOfStream -While streaming sources like Kafka are unbounded, files on HDFS have finite data and have a notion of end-of-file. +While streaming sources like Kafka are unbounded, files on HDFS have finite data and have
[7/9] samza git commit: Cleanup docs for HDFS connector
Cleanup docs for HDFS connector Author: Jagadish Reviewers: Jagadish Closes #793 from vjagadish1989/website-reorg30 Project: http://git-wip-us.apache.org/repos/asf/samza/repo Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/3e397022 Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/3e397022 Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/3e397022 Branch: refs/heads/1.0.0 Commit: 3e397022a5a54630d21a1cbbc0c273016592a0c2 Parents: ac5f948 Author: Jagadish Authored: Fri Nov 2 17:35:20 2018 -0700 Committer: Jagadish Committed: Tue Nov 13 19:33:26 2018 -0800 -- .../documentation/versioned/connectors/hdfs.md | 134 +++ 1 file changed, 50 insertions(+), 84 deletions(-) -- http://git-wip-us.apache.org/repos/asf/samza/blob/3e397022/docs/learn/documentation/versioned/connectors/hdfs.md -- diff --git a/docs/learn/documentation/versioned/connectors/hdfs.md b/docs/learn/documentation/versioned/connectors/hdfs.md index 9692d18..9b79f24 100644 --- a/docs/learn/documentation/versioned/connectors/hdfs.md +++ b/docs/learn/documentation/versioned/connectors/hdfs.md @@ -21,133 +21,99 @@ title: HDFS Connector ## Overview -Samza applications can read and process data stored in HDFS. Likewise, you can also write processed results to HDFS. - -### Environment Requirement - -Your job needs to run on the same YARN cluster which hosts the HDFS you want to consume from (or write into). +The HDFS connector allows your Samza jobs to read data stored in HDFS files. Likewise, you can write processed results to HDFS. +To interact with HDFS, Samza requires your job to run on the same YARN cluster. ## Consuming from HDFS +### Input Partitioning -You can configure your Samza job to read from HDFS files with the [HdfsSystemConsumer](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java). Avro encoded records are supported out of the box and it is easy to extend to support other formats (plain text, csv, json etc). See Event Format section below. - -### Partitioning +Partitioning works at the level of individual directories and files. Each directory is treated as its own stream and each of its files is treated as a _partition_. For example, Samza creates 5 partitions when it's reading from a directory containing 5 files. There is no way to parallelize the consumption when reading from a single file - you can only have one container to process the file. -Partitioning works at the level of individual directories and files. Each directory is treated as its own stream, while each of its files is treated as a partition. For example, when reading from a directory on HDFS with 10 files, there will be 10 partitions created. This means that you can have up-to 10 containers to process them. If you want to read from a single HDFS file, there is currently no way to break down the consumption - you can only have one container to process the file. +### Input Event format +Samza supports avro natively, and it's easy to extend to other serialization formats. Each avro record read from HDFS is wrapped into a message-envelope. The [envelope](../api/javadocs/org/apache/samza/system/IncomingMessageEnvelope.html) contains these 3 fields: -### Event format +- The key, which is empty -Samza's HDFS consumer wraps each avro record read from HDFS into a message-envelope. The [Envelope](../api/javadocs/org/apache/samza/system/IncomingMessageEnvelope.html) contains three fields of interest: +- The value, which is set to the avro [GenericRecord](https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericRecord.html) -1. The key, which is empty -2. The message, which is set to the avro [GenericRecord](https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericRecord.html) -3. The stream partition, which is set to the name of the HDFS file +- The partition, which is set to the name of the HDFS file -To support input formats which are not avro, you can implement the [SingleFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java) interface (example: [AvroFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java)) +To support non-avro input formats, you can implement the [SingleFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java) interface. -### End of stream support +### EndOfStream -While streaming sources like Kafka are unbounded, files on HDFS have finite data and have a notion of end-of-file. +Wh
samza git commit: Cleanup docs for HDFS connector
Repository: samza Updated Branches: refs/heads/master 743903272 -> 859f1b646 Cleanup docs for HDFS connector Author: Jagadish Reviewers: Jagadish Closes #793 from vjagadish1989/website-reorg30 Project: http://git-wip-us.apache.org/repos/asf/samza/repo Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/859f1b64 Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/859f1b64 Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/859f1b64 Branch: refs/heads/master Commit: 859f1b646a75d499405d470116a227d83a5d506d Parents: 7439032 Author: Jagadish Authored: Fri Nov 2 17:35:20 2018 -0700 Committer: Jagadish Committed: Fri Nov 2 17:35:20 2018 -0700 -- .../documentation/versioned/connectors/hdfs.md | 134 +++ 1 file changed, 50 insertions(+), 84 deletions(-) -- http://git-wip-us.apache.org/repos/asf/samza/blob/859f1b64/docs/learn/documentation/versioned/connectors/hdfs.md -- diff --git a/docs/learn/documentation/versioned/connectors/hdfs.md b/docs/learn/documentation/versioned/connectors/hdfs.md index 9692d18..9b79f24 100644 --- a/docs/learn/documentation/versioned/connectors/hdfs.md +++ b/docs/learn/documentation/versioned/connectors/hdfs.md @@ -21,133 +21,99 @@ title: HDFS Connector ## Overview -Samza applications can read and process data stored in HDFS. Likewise, you can also write processed results to HDFS. - -### Environment Requirement - -Your job needs to run on the same YARN cluster which hosts the HDFS you want to consume from (or write into). +The HDFS connector allows your Samza jobs to read data stored in HDFS files. Likewise, you can write processed results to HDFS. +To interact with HDFS, Samza requires your job to run on the same YARN cluster. ## Consuming from HDFS +### Input Partitioning -You can configure your Samza job to read from HDFS files with the [HdfsSystemConsumer](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java). Avro encoded records are supported out of the box and it is easy to extend to support other formats (plain text, csv, json etc). See Event Format section below. - -### Partitioning +Partitioning works at the level of individual directories and files. Each directory is treated as its own stream and each of its files is treated as a _partition_. For example, Samza creates 5 partitions when it's reading from a directory containing 5 files. There is no way to parallelize the consumption when reading from a single file - you can only have one container to process the file. -Partitioning works at the level of individual directories and files. Each directory is treated as its own stream, while each of its files is treated as a partition. For example, when reading from a directory on HDFS with 10 files, there will be 10 partitions created. This means that you can have up-to 10 containers to process them. If you want to read from a single HDFS file, there is currently no way to break down the consumption - you can only have one container to process the file. +### Input Event format +Samza supports avro natively, and it's easy to extend to other serialization formats. Each avro record read from HDFS is wrapped into a message-envelope. The [envelope](../api/javadocs/org/apache/samza/system/IncomingMessageEnvelope.html) contains these 3 fields: -### Event format +- The key, which is empty -Samza's HDFS consumer wraps each avro record read from HDFS into a message-envelope. The [Envelope](../api/javadocs/org/apache/samza/system/IncomingMessageEnvelope.html) contains three fields of interest: +- The value, which is set to the avro [GenericRecord](https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericRecord.html) -1. The key, which is empty -2. The message, which is set to the avro [GenericRecord](https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericRecord.html) -3. The stream partition, which is set to the name of the HDFS file +- The partition, which is set to the name of the HDFS file -To support input formats which are not avro, you can implement the [SingleFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java) interface (example: [AvroFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java)) +To support non-avro input formats, you can implement the [SingleFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java) interface. -### End of stream support +### EndOfStream -While streaming sources like Kafka are