[GitHub] samza pull request #45: SAMZA-859: Create a simple join example in hello-sam...

2017-01-25 Thread banecogic
GitHub user banecogic opened a pull request:

https://github.com/apache/samza/pull/45

SAMZA-859: Create a simple join example in hello-samza tutorail (docs)

Documentation for Hello Samza [stream join 
example](https://github.com/apache/samza-hello-samza/pull/9).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/banecogic/samza SAMZA-859

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/samza/pull/45.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #45


commit d3a7b420f6943ef5141932d22b639ccb389296b8
Author: banecogic 
Date:   2017-01-25T09:46:13Z

SAMZA-859: Create a simple join example in hello-samza tutorail (docs)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-25 Thread Navina Ramesh

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/#review163036
---



Thanks for adding the documentation!


docs/learn/documentation/versioned/hdfs/consumer.md (line 26)


Can you include the diagram from your design document?  Or something 
similar to elaborate how the setup should look like?



docs/learn/documentation/versioned/hdfs/consumer.md (line 42)


Can you add a snippet of the interface here as well? It is easier for the 
user to skim through.



docs/learn/documentation/versioned/hdfs/consumer.md (line 50)


Replace "users" with "user application". 

We provide the capability for the user application to get notified when ...

Rephrase "To do so, simply implement the interface 
[EndOfStreamListenerTask]" as "In order to receieve notications on EndOfStream 
with the task, the user application should simply implement the interface ..."



docs/learn/documentation/versioned/hdfs/consumer.md (line 54)


I think you can skip the "job.properties" file. The readers may easily 
assume there is a separate properties file.



docs/learn/documentation/versioned/hdfs/consumer.md (line 75)


Typo "configs are required"



docs/learn/documentation/versioned/hdfs/consumer.md (line 77)


I don't think you need to mentioned the JIRA that introduced a feature. If 
there is documentation related to security in Samza, you can link to it. You 
can link to the javadoc for SamzaContainerSecurityManager.

You can add a brief description of the feature. For example, the 
SamzaContainer fetches and renews the Kerberos delegation tokens when the job 
is running in a secure environment. User application needs to specify ..



docs/learn/documentation/versioned/hdfs/consumer.md (line 93)


Can you elaborate the "advanced partitioning" feature here  and remove the 
link for design doc? If it helps, you can just copy-and-paste the design doc 
content here and edit it :)



docs/learn/documentation/versioned/hdfs/consumer.md (line 102)


Looks like there are more configurations that are mentioned in the 
configuration table. Can you please add the link to configuration table to 
imply that?



docs/learn/documentation/versioned/hdfs/consumer.md (line 105)


You can add a link to the design doc pdf here , instead of the JIRA link.


- Navina Ramesh


On Jan. 24, 2017, 2:07 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52570/
> ---
> 
> (Updated Jan. 24, 2017, 2:07 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-1025
> https://issues.apache.org/jira/browse/SAMZA-1025
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> documentation for hdfs system consumer
> 
> 
> Diffs
> -
> 
>   docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
>   docs/learn/documentation/versioned/hdfs/producer.md 
> b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
>   docs/learn/documentation/versioned/index.html 
> d0b14ece94341e2cb937cf32db480e69f93303c2 
>   docs/learn/documentation/versioned/jobs/configuration-table.html 
> ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 
> 
> Diff: https://reviews.apache.org/r/52570/diff/
> 
> 
> Testing
> ---
> 
> N/A
> 
> 
> Thanks,
> 
> Hai Lu
> 
>



Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-25 Thread Jagadish Venkatraman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/#review163011
---




docs/learn/documentation/versioned/hdfs/consumer.md (line 22)


Can re-word docs for clarity.
1. Remove `Now` here: s/Now// - "You can configure your Samza jobs to read 
data from HDFS files". 
2. No need to mention that it's implemented in `samza-hdfs`.
3. Substitute: "Implemented in `samza-hdfs`, the `HdfsSystemConsumer` will 
read from HDFS files (`avro` records only for now. /->> "Only avro encoded 
records are supported for now."
4. "very easy to extend to plain text, csv, json, etc. in the future" -> 
How is this supported? A hint on how to do this is useful. If not, we can nuke 
this.
5. Reword: "Deliver messages to your tasks" can be left.



docs/learn/documentation/versioned/hdfs/consumer.md (line 26)


Thanks for calling this out explicitly.



docs/learn/documentation/versioned/hdfs/consumer.md (line 30)


Do you think it can be more concise:
- "The way `HDFSSystemConsumer` does partitioning is basically to treat 
each HDFS file as a partition. Using Kafka as an analogy, a HDFS directory is 
close to the concept of a Kafka topic, and all the files within the directory 
are like topic partitions in the sense of Kafka. "

Reword it to be: (something on the lines of.)
"Partitioning works at the level of individual HDFS files. Each file is 
treated as a stream partition."

- This line can be removed completely. The subsequent examples expands on 
the idea.



docs/learn/documentation/versioned/hdfs/consumer.md (line 34)


Reword:

You can configure upto 10 Samza containers to process these partitions.



docs/learn/documentation/versioned/hdfs/consumer.md (line 40)


1. Not sure we have a noun called `payload` anywhere else in the Samza docs.

2. Can be more precise in wording this.
The received IncomingMessageEnvelope contains three significant fields.
- The key which is empty.
- The message which is set to the avro GenericRecord.
- The stream partition which is set to the name of the HDFS file.

3. Please link the IncomingMessageEnvelope API.



docs/learn/documentation/versioned/hdfs/consumer.md (line 42)


1. Please link the javadocs of SingleFileHDFSReader
2. Suggest re-word:
2.1 one can implement/ you can implement (Since, the rest of the 
documentation is directed at the user)
2.2 The second line on "few methods need to be implemented" can be ignored.



docs/learn/documentation/versioned/hdfs/consumer.md (line 43)


It's not super-obvious how the implementation of the `SingleFileHDFSReader` 
ties to the underlying consumer. Am I missing something?



docs/learn/documentation/versioned/hdfs/consumer.md (line 46)


1. Not sure we want to introduce a new noun called `data stream` in the 
docs.

Suggest re-word to be precise:

While a kafka topic has an unbounded stream of messages, HDFS files are 
bounded and have a notion of EOF.



docs/learn/documentation/versioned/hdfs/consumer.md (line 48)


Reword docs to be concise:

You can choose to implement `EndOfStreamListenerTask` to receive a callback 
when all partitions are at end of stream. When all partitions being processed 
by the task are at end of stream (ie. EOF has been reached for all files), the 
Samza job exits automatically.

Remove `What happens when we finish reading all data from the HDFS files? 
The behavior is that the Samza job will shut itself down once the job is done 
with all files.`



docs/learn/documentation/versioned/hdfs/consumer.md (line 67)


The relationship between whitelist and blacklist was not very obvious to me.

Is the behavior that the whitelist is applied first, and the blacklist is 
applied to the matched files later? (to determine which files are to be 
ignored).



docs/learn/documentation/versioned/hdfs/consumer.md (line 75)


1.fix typo: required
2.For HDFS clusters that have kerberos enabled,...



docs/learn/documentation/versioned/hdfs/consumer.md (line 77)


Suggestions on docs:

1. Not needed to mention the jira.
2. Not needed to mention the class name.

The follo

Re: Review Request 52476: Do not load task store which are older than delete tombstones.

2017-01-25 Thread Prateek Maheshwari

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52476/#review163035
---




samza-core/src/main/scala/org/apache/samza/storage/TaskStorageManager.scala 
(line 97)


"Got default storage ..."



samza-core/src/main/scala/org/apache/samza/storage/TaskStorageManager.scala 
(line 137)


No space before ":" here and elsewhere, including method type annotations 
(e.g. line 169, 185).



samza-core/src/main/scala/org/apache/samza/storage/TaskStorageManager.scala 
(line 152)


We should log which directory we're using for the store here at INFO.


- Prateek Maheshwari


On Oct. 22, 2016, 3:06 p.m., Shanthoosh Venkataraman wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52476/
> ---
> 
> (Updated Oct. 22, 2016, 3:06 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Every local task store is backed up by a kafka changelog topic. Due to log 
> compaction, delete tombstones of the changelog topic have a ttl of 
> delete.retention.ms. Replaying the events from the changelog that has missing 
> delete tombstones, would result in creation of an inconsistent local 
> store(due to the missing of some delete events). This patch deletes the local 
> stores in which difference between current time and last modified time of the 
> offset file is greater than delete.retention.ms during the container startup.
> 
> 
> Diffs
> -
> 
>   samza-core/src/main/java/org/apache/samza/storage/StorageRecovery.java 
> 9329edf7d724f3a0d9235354bb77936f713e3b5f 
>   samza-core/src/main/scala/org/apache/samza/config/StorageConfig.scala 
> be3f1068f7921c9c8f8697577efea6580672813e 
>   samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
> 05a996c98075ea8ed3767af666b9beeb1933f2a6 
>   samza-core/src/main/scala/org/apache/samza/storage/TaskStorageManager.scala 
> 0b7bcdda1639eea8239a69c31bdf42558e9077d2 
>   
> samza-core/src/test/scala/org/apache/samza/storage/TestTaskStorageManager.scala
>  4d40f520e54beb643acd8410c772b75e2f6a9162 
>   samza-kafka/src/main/scala/org/apache/samza/config/KafkaConfig.scala 
> 973ab8cfb3d248bec7efe5e338f5e667f097556d 
> 
> Diff: https://reviews.apache.org/r/52476/diff/
> 
> 
> Testing
> ---
> 
> Unit testing and manual testing has been done to verify the functionality.
> 
> 
> Thanks,
> 
> Shanthoosh Venkataraman
> 
>



Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-25 Thread Xinyu Liu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/#review162971
---


Fix it, then Ship it!




Looks good. Have a few minor comments about adding javadoc links. Please test 
the changes locally after finalizing.


docs/learn/documentation/versioned/hdfs/consumer.md (line 22)


Please link HdfsSystemConsumer to java doc.



docs/learn/documentation/versioned/hdfs/consumer.md (line 32)


camel case HdfsSystemConsumer? link to Java doc.



docs/learn/documentation/versioned/hdfs/consumer.md (line 34)


replace 'SystemStreamPartition" with partitions?



docs/learn/documentation/versioned/hdfs/consumer.md (line 40)


java doc link to HdfsSystemConsumer, IncomingMessageEvelope and 
GenericRecord.



docs/learn/documentation/versioned/hdfs/consumer.md (line 42)


please link java doc to SingleFileHdfsReader, instead of putting the file 
name here.


- Xinyu Liu


On Jan. 24, 2017, 2:07 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52570/
> ---
> 
> (Updated Jan. 24, 2017, 2:07 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-1025
> https://issues.apache.org/jira/browse/SAMZA-1025
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> documentation for hdfs system consumer
> 
> 
> Diffs
> -
> 
>   docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
>   docs/learn/documentation/versioned/hdfs/producer.md 
> b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
>   docs/learn/documentation/versioned/index.html 
> d0b14ece94341e2cb937cf32db480e69f93303c2 
>   docs/learn/documentation/versioned/jobs/configuration-table.html 
> ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 
> 
> Diff: https://reviews.apache.org/r/52570/diff/
> 
> 
> Testing
> ---
> 
> N/A
> 
> 
> Thanks,
> 
> Hai Lu
> 
>