Re: Data consistency and check-pointing

2016-01-18 Thread Yi Pan
Hi, Michael,

Your use case sounds much like a "customized checkpointing" to me. We have
similar cases in LinkedIn and the following are the solution in production:
1) disable Samza auto-checkpoint by setting the commit_ms to -1
2) explicitly calling TaskCoordinator.commit() in sync with closing the
transaction batch

The above procedure works well and gives user to ability to control the
commit of checkpoint together w/ your transaction batch. In case of system
crash between the closing of transaction batch and the checkpoint commit (I
am assuming this sequence of actions), we would follow the at-least-once
semantics and re-play the messages from the last commit.

Please let us know whether that satisfies your use case.

Thanks!

-Yi

On Sun, Jan 17, 2016 at 11:09 AM, Michael Sklyar 
wrote:

> Hi,
>
> We have a Samza job reading messages from Kafka and inserting to hive via
> the Hive Streaming API. With Hive Streaming we are using
> "TransactionBatch", closing the Transaction batch closes the file on HDFS.
> We close the transaction batch after reaching the a. Maximum messages per
> transaction batch or b. time threshold (for example - every 20K messages or
> every 10 seconds).
>
> It works well, but in cases the job will terminate in the middle of a
> transaction batch we will have data inconsistency in hive, either:
>
> 1. Duplication: Data that was already inserted to hive will be processed
> again (since the checkpoint was taken earlier than the latest message
> written to hive).
>
>
>
>
>
>
>
>
>
> 2. Missing Data: Messages that were not committed to hive yet will not be
> reprocessed (since the checkpoint was written after
>
> What would be the recommended method of synchronizing hive/hdfs insertion
> with Samza checkpointing? I am thinking of overriding the
> *KafkaCheckpointManager* & *KafkaCheckpointManagerFactory* and
> synchronize check-pointing with
> committing the data to hive. Is it a good idea?
>
> Thanks in advance,
> Michael Sklyar
>


Review Request 42483: Minor doc fixes

2016-01-18 Thread Randall Britten

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42483/
---

Review request for samza.


Bugs: SAMZA-858
https://issues.apache.org/jira/browse/SAMZA-858


Repository: samza


Description
---

Minor documentation fixes


Diffs
-

  docs/learn/documentation/versioned/container/samza-container.md a7236a6 
  docs/learn/documentation/versioned/introduction/concepts.md 25ef5ee 

Diff: https://reviews.apache.org/r/42483/diff/


Testing
---

N/A


Thanks,

Randall Britten



Re: Review Request 42483: Minor doc fixes

2016-01-18 Thread Randall Britten

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42483/
---

(Updated Jan. 19, 2016, 4:16 a.m.)


Review request for samza.


Changes
---

Updated patch to include another minor doc fix.


Bugs: SAMZA-858
https://issues.apache.org/jira/browse/SAMZA-858


Repository: samza


Description
---

Minor documentation fixes


Diffs (updated)
-

  docs/learn/documentation/versioned/container/samza-container.md a7236a6 
  docs/learn/documentation/versioned/introduction/concepts.md 25ef5ee 

Diff: https://reviews.apache.org/r/42483/diff/


Testing
---

N/A


Thanks,

Randall Britten



Re: Review Request 42484: hello-samza 0.10.0 changes for CDH 5.4 distribution

2016-01-18 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42484/#review115105
---



pom.xml (line 146)


Samza 0.10.0 is officially released. Hence, this change should be made in 
hello-samza master branch, not latest. The latest branch is used to track and 
keep in-sync with samza trunk, which is under-development branch.


- Yi Pan (Data Infrastructure)


On Jan. 19, 2016, 4:16 a.m., Yi Pan (Data Infrastructure) wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42484/
> ---
> 
> (Updated Jan. 19, 2016, 4:16 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-851
> https://issues.apache.org/jira/browse/SAMZA-851
> 
> 
> Repository: samza-hello-samza
> 
> 
> Description
> ---
> 
> hello-samza 0.10.0 changes for CDH 5.4 distribution
> 
> 
> Diffs
> -
> 
>   pom.xml c1d552f 
> 
> Diff: https://reviews.apache.org/r/42484/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Yi Pan (Data Infrastructure)
> 
>



Re: Review Request 42484: hello-samza 0.10.0 changes for CDH 5.4 distribution

2016-01-18 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42484/#review115107
---



pom.xml (line 289)


We have observed an issue with YARN 2.6.0 AM client that would not refresh 
the token (YARN-3103). Hence, we have updated the minimum required YARN version 
for Samza 0.10.0 to YARN 2.6.0. Does CDH 5.4 not having this issue?


- Yi Pan (Data Infrastructure)


On Jan. 19, 2016, 4:16 a.m., Yi Pan (Data Infrastructure) wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42484/
> ---
> 
> (Updated Jan. 19, 2016, 4:16 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-851
> https://issues.apache.org/jira/browse/SAMZA-851
> 
> 
> Repository: samza-hello-samza
> 
> 
> Description
> ---
> 
> hello-samza 0.10.0 changes for CDH 5.4 distribution
> 
> 
> Diffs
> -
> 
>   pom.xml c1d552f 
> 
> Diff: https://reviews.apache.org/r/42484/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Yi Pan (Data Infrastructure)
> 
>



Review Request 42485: SAMZA-857: Missing break in RocksDbOptionsHelper#options()

2016-01-18 Thread Tao Feng

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42485/
---

Review request for samza.


Repository: samza


Description
---

SAMZA-857: Missing break in RocksDbOptionsHelper#options()


Diffs
-

  
samza-kv-rocksdb/src/main/java/org/apache/samza/storage/kv/RocksDbOptionsHelper.java
 e474231f48c2d9ba0c9a73291afcc19b52ce8da1 

Diff: https://reviews.apache.org/r/42485/diff/


Testing
---


Thanks,

Tao Feng