-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20022/#review39588
-----------------------------------------------------------



samza-core/src/main/scala/org/apache/samza/system/SystemConsumers.scala
<https://reviews.apache.org/r/20022/#comment72019>

    Is it worth exposing this property as configurable in the job properties? 
Since SystemConsumers tries to actually keep this number of messages per 
partition in memory, a job that uses large messages or many partitions may find 
itself using a lot of memory. For example, if messages are 100kB each, this 
default will use up to 1 GB per partition.
    
    Or perhaps we can find some way of auto-tuning this parameter based on the 
number of partitions and the average message size, so as to use a fixed amount 
of memory per container for unprocessed messages.



samza-core/src/main/scala/org/apache/samza/system/SystemConsumers.scala
<https://reviews.apache.org/r/20022/#comment72024>

    I would find the documentation more helpful if it didn't describe what the 
code does to this variable (I can find that out by reading the code itself), 
but rather *why* it exists and what problem it solves.



samza-core/src/main/scala/org/apache/samza/system/SystemConsumers.scala
<https://reviews.apache.org/r/20022/#comment72020>

    Not clear whether you mean that maxMsgsPerStreamPartition defaults to 1000, 
or refreshThreshold defaults to 1000.



samza-core/src/main/scala/org/apache/samza/system/SystemConsumers.scala
<https://reviews.apache.org/r/20022/#comment72021>

    At first I didn't understand how this was different from fetchThresholdPct 
(had to read the code to understand that fetchThresholdPct was per 
topic-partition and refreshThreshold was for total unprocessed messages). I 
wonder if this could be made clearer somehow, or even folded into a single 
parameter.



samza-core/src/main/scala/org/apache/samza/system/SystemConsumers.scala
<https://reviews.apache.org/r/20022/#comment72022>

    I think this is a slight change of semantics. The previous computation 
returned the number of messages that had not yet been given to the chooser, 
whereas totalUnprocessedMessages is only decremented after the chooser has 
chosen the message. Don't think that's a problem, just wanted to point it out 
in case it wasn't deliberate.



samza-core/src/main/scala/org/apache/samza/system/SystemConsumers.scala
<https://reviews.apache.org/r/20022/#comment72029>

    I have difficulty convincing myself that this logic is correct -- is it 
always the case that the partition is neededByChooser in this case? It's 
probably correct, but it's very subtle logic that's easy to get wrong. Would it 
be possible to express this in a way that's easier to reason about?
    
    For example, mapping each systemStreamPartition to one of enum { 
IN_CHOOSER, NEEDED_BY_CHOOSER, SKIPPING_CHOOSER } would make clear that each 
partition is in exactly one of those three states at any one time.


- Martin Kleppmann


On April 4, 2014, 1:09 a.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20022/
> -----------------------------------------------------------
> 
> (Updated April 4, 2014, 1:09 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> don't hard code the refresh threshold
> 
> 
> bump default max msgs per stream partition up to 10k
> 
> 
> add an unprocessed message counter
> 
> 
> black list empty system stream partitions
> 
> 
> Diffs
> -----
> 
>   samza-core/src/main/scala/org/apache/samza/system/SystemConsumers.scala 
> bbbacb59866c2374853052f7cc11826552f5fb01 
> 
> Diff: https://reviews.apache.org/r/20022/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>

Reply via email to