Re: Thoughts and obesrvations on Samza

2015-06-30 Thread Jay Kreps
Looks like gmail mangled the code example, it was supposed to look like
this:

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:4242");
StreamingConfig config = new StreamingConfig(props);
config.subscribe("test-topic-1", "test-topic-2");
config.processor(ExampleStreamProcessor.class);
config.serialization(new StringSerializer(), new StringDeserializer());
KafkaStreaming container = new KafkaStreaming(config);
container.run();

-Jay

On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps  wrote:

> Hey guys,
>
> This came out of some conversations Chris and I were having around whether
> it would make sense to use Samza as a kind of data ingestion framework for
> Kafka (which ultimately lead to KIP-26 "copycat"). This kind of combined
> with complaints around config and YARN and the discussion around how to
> best do a standalone mode.
>
> So the thought experiment was, given that Samza was basically already
> totally Kafka specific, what if you just embraced that and turned it into
> something less like a heavyweight framework and more like a third Kafka
> client--a kind of "producing consumer" with state management facilities.
> Basically a library. Instead of a complex stream processing framework this
> would actually be a very simple thing, not much more complicated to use or
> operate than a Kafka consumer. As Chris said we thought about it a lot of
> what Samza (and the other stream processing systems were doing) seemed like
> kind of a hangover from MapReduce.
>
> Of course you need to ingest/output data to and from the stream
> processing. But when we actually looked into how that would work, Samza
> isn't really an ideal data ingestion framework for a bunch of reasons. To
> really do that right you need a pretty different internal data model and
> set of apis. So what if you split them and had an api for Kafka
> ingress/egress (copycat AKA KIP-26) and a separate api for Kafka
> transformation (Samza).
>
> This would also allow really embracing the same terminology and
> conventions. One complaint about the current state is that the two systems
> kind of feel bolted on. Terminology like "stream" vs "topic" and different
> config and monitoring systems means you kind of have to learn Kafka's way,
> then learn Samza's slightly different way, then kind of understand how they
> map to each other, which having walked a few people through this is
> surprisingly tricky for folks to get.
>
> Since I have been spending a lot of time on airplanes I hacked up an
> ernest but still somewhat incomplete prototype of what this would look
> like. This is just unceremoniously dumped into Kafka as it required a few
> changes to the new consumer. Here is the code:
>
> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org/apache/kafka/clients/streaming
>
> For the purpose of the prototype I just liberally renamed everything to
> try to align it with Kafka with no regard for compatibility.
>
> To use this would be something like this:
> Properties props = new Properties(); props.put("bootstrap.servers",
> "localhost:4242"); StreamingConfig config = new StreamingConfig(props); 
> config.subscribe("test-topic-1",
> "test-topic-2"); config.processor(ExampleStreamProcessor.class); 
> config.serialization(new
> StringSerializer(), new StringDeserializer()); KafkaStreaming container =
> new KafkaStreaming(config); container.run();
>
> KafkaStreaming is basically the SamzaContainer; StreamProcessor is
> basically StreamTask.
>
> So rather than putting all the class names in a file and then having the
> job assembled by reflection, you just instantiate the container
> programmatically. Work is balanced over however many instances of this are
> alive at any time (i.e. if an instance dies, new tasks are added to the
> existing containers without shutting them down).
>
> We would provide some glue for running this stuff in YARN via Slider,
> Mesos via Marathon, and AWS using some of their tools but from the point of
> view of these frameworks these stream processing jobs are just stateless
> services that can come and go and expand and contract at will. There is no
> more custom scheduler.
>
> Here are some relevant details:
>
>1. It is only ~1300 lines of code, it would get larger if we
>productionized but not vastly larger. We really do get a ton of leverage
>out of Kafka.
>2. Partition management is fully delegated to the new consumer. This
>is nice since now any partition management strategy available to Kafka
>consumer is also available to Samza (and vice versa) and with the exact
>same configs.
>3. It supports state as well as state reuse
>
> Anyhow take a look, hopefully it is thought provoking.
>
> -Jay
>
>
>
> On Tue, Jun 30, 2015 at 6:55 PM, Chris Riccomini 
> wrote:
>
>> Hey all,
>>
>> I have had some discussions with Samza engineers at LinkedIn and Confluent
>> and we came up with a few observations and would like to propose some
>> changes.
>>
>> We'

Re: Thoughts and obesrvations on Samza

2015-06-30 Thread Jay Kreps
Hey guys,

This came out of some conversations Chris and I were having around whether
it would make sense to use Samza as a kind of data ingestion framework for
Kafka (which ultimately lead to KIP-26 "copycat"). This kind of combined
with complaints around config and YARN and the discussion around how to
best do a standalone mode.

So the thought experiment was, given that Samza was basically already
totally Kafka specific, what if you just embraced that and turned it into
something less like a heavyweight framework and more like a third Kafka
client--a kind of "producing consumer" with state management facilities.
Basically a library. Instead of a complex stream processing framework this
would actually be a very simple thing, not much more complicated to use or
operate than a Kafka consumer. As Chris said we thought about it a lot of
what Samza (and the other stream processing systems were doing) seemed like
kind of a hangover from MapReduce.

Of course you need to ingest/output data to and from the stream processing.
But when we actually looked into how that would work, Samza isn't really an
ideal data ingestion framework for a bunch of reasons. To really do that
right you need a pretty different internal data model and set of apis. So
what if you split them and had an api for Kafka ingress/egress (copycat AKA
KIP-26) and a separate api for Kafka transformation (Samza).

This would also allow really embracing the same terminology and
conventions. One complaint about the current state is that the two systems
kind of feel bolted on. Terminology like "stream" vs "topic" and different
config and monitoring systems means you kind of have to learn Kafka's way,
then learn Samza's slightly different way, then kind of understand how they
map to each other, which having walked a few people through this is
surprisingly tricky for folks to get.

Since I have been spending a lot of time on airplanes I hacked up an ernest
but still somewhat incomplete prototype of what this would look like. This
is just unceremoniously dumped into Kafka as it required a few changes to
the new consumer. Here is the code:
https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org/apache/kafka/clients/streaming

For the purpose of the prototype I just liberally renamed everything to try
to align it with Kafka with no regard for compatibility.

To use this would be something like this:
Properties props = new Properties(); props.put("bootstrap.servers",
"localhost:4242"); StreamingConfig config = new
StreamingConfig(props); config.subscribe("test-topic-1",
"test-topic-2"); config.processor(ExampleStreamProcessor.class);
config.serialization(new
StringSerializer(), new StringDeserializer()); KafkaStreaming container =
new KafkaStreaming(config); container.run();

KafkaStreaming is basically the SamzaContainer; StreamProcessor is
basically StreamTask.

So rather than putting all the class names in a file and then having the
job assembled by reflection, you just instantiate the container
programmatically. Work is balanced over however many instances of this are
alive at any time (i.e. if an instance dies, new tasks are added to the
existing containers without shutting them down).

We would provide some glue for running this stuff in YARN via Slider, Mesos
via Marathon, and AWS using some of their tools but from the point of view
of these frameworks these stream processing jobs are just stateless
services that can come and go and expand and contract at will. There is no
more custom scheduler.

Here are some relevant details:

   1. It is only ~1300 lines of code, it would get larger if we
   productionized but not vastly larger. We really do get a ton of leverage
   out of Kafka.
   2. Partition management is fully delegated to the new consumer. This is
   nice since now any partition management strategy available to Kafka
   consumer is also available to Samza (and vice versa) and with the exact
   same configs.
   3. It supports state as well as state reuse

Anyhow take a look, hopefully it is thought provoking.

-Jay



On Tue, Jun 30, 2015 at 6:55 PM, Chris Riccomini 
wrote:

> Hey all,
>
> I have had some discussions with Samza engineers at LinkedIn and Confluent
> and we came up with a few observations and would like to propose some
> changes.
>
> We've observed some things that I want to call out about Samza's design,
> and I'd like to propose some changes.
>
> * Samza is dependent upon a dynamic deployment system.
> * Samza is too pluggable.
> * Samza's SystemConsumer/SystemProducer and Kafka's consumer APIs are
> trying to solve a lot of the same problems.
>
> All three of these issues are related, but I'll address them in order.
>
> Deployment
>
> Samza strongly depends on the use of a dynamic deployment scheduler such as
> YARN, Mesos, etc. When we initially built Samza, we bet that there would be
> one or two winners in this area, and we could support them, and the rest
> would go away. In reality, there are many variat

Re: [VOTE] Apache Samza 0.9.1 RC1

2015-06-30 Thread Chris Riccomini
+1

Verified MD5, and asc signature. Build locally, and all tests pass.

Cheers,
Chris

On Tue, Jun 30, 2015 at 1:26 PM, Milinda Pathirage 
wrote:

> +1 (non-binding)
>
> Verified signature. Tested locally using ./bin/check-all.sh.
>
> Thanks
> Milinda
>
> On Tue, Jun 30, 2015 at 2:10 AM, Yan Fang  wrote:
>
> > +1
> >
> > Verified MD5, Signature.
> >
> > Tested locally.
> >
> > Thanks,
> >
> > Fang, Yan
> > yanfang...@gmail.com
> >
> > On Sun, Jun 28, 2015 at 12:31 PM, Yi Pan  wrote:
> >
> > > Hey all,
> > >
> > > This is a call for a vote on a release of Apache Samza 0.9.1. This is a
> > > bug-fix release against 0.9.0.
> > >
> > > The release candidate can be downloaded from here:
> > >
> > > http://people.apache.org/~nickpan47/samza-0.9.1-rc1/
> > >
> > > The release candidate is signed with pgp key 911402D8, which is
> > > included in the repository's KEYS file:
> > >
> > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=samza.git;a=blob_plain;f=KEYS;hb=6f5bafb6cd93934781161eb6b1868d11ea347c95
> > >
> > > and can also be found on keyservers:
> > >
> > > http://pgp.mit.edu/pks/lookup?op=get&search=0x911402D8
> > >
> > > The git tag is release-0.9.1-rc1 and signed with the same pgp key:
> > >
> > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=samza.git;a=tag;h=e78b9e7f34650538b4bb68b338eb472b98a5709e
> > >
> > > Test binaries have been published to Maven's staging repository, and
> are
> > > available here:
> > >
> https://repository.apache.org/content/repositories/orgapachesamza-1007/
> > >
> > > Note release 0.9.1 is still supporting JDK6 and the binaries were built
> > > with JDK6 without incident.
> > >
> > > 6 critical bugs were resolved for this release:
> > >
> > >
> > >
> >
> https://issues.apache.org/jira/browse/SAMZA-715?jql=project%20%3D%20SAMZA%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20%28Resolved%2C%20Closed%29
> > >
> > > The vote will be open for 72 hours ( end in 12:00pm Wed, 07/01/2015 ).
> > > Please download the release candidate, check the hashes/signature,
> build
> > it
> > > and test it, and then please vote:
> > >
> > > [ ] +1 approve
> > > [ ] +0 no opinion
> > > [ ] -1 disapprove (and reason why)
> > >
> >
>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org
>


Thoughts and obesrvations on Samza

2015-06-30 Thread Chris Riccomini
Hey all,

I have had some discussions with Samza engineers at LinkedIn and Confluent
and we came up with a few observations and would like to propose some
changes.

We've observed some things that I want to call out about Samza's design,
and I'd like to propose some changes.

* Samza is dependent upon a dynamic deployment system.
* Samza is too pluggable.
* Samza's SystemConsumer/SystemProducer and Kafka's consumer APIs are
trying to solve a lot of the same problems.

All three of these issues are related, but I'll address them in order.

Deployment

Samza strongly depends on the use of a dynamic deployment scheduler such as
YARN, Mesos, etc. When we initially built Samza, we bet that there would be
one or two winners in this area, and we could support them, and the rest
would go away. In reality, there are many variations. Furthermore, many
people still prefer to just start their processors like normal Java
processes, and use traditional deployment scripts such as Fabric, Chef,
Ansible, etc. Forcing a deployment system on users makes the Samza start-up
process really painful for first time users.

Dynamic deployment as a requirement was also a bit of a mis-fire because of
a fundamental misunderstanding between the nature of batch jobs and stream
processing jobs. Early on, we made conscious effort to favor the Hadoop
(Map/Reduce) way of doing things, since it worked and was well understood.
One thing that we missed was that batch jobs have a definite beginning, and
end, and stream processing jobs don't (usually). This leads to a much
simpler scheduling problem for stream processors. You basically just need
to find a place to start the processor, and start it. The way we run grids,
at LinkedIn, there's no concept of a cluster being "full". We always add
more machines. The problem with coupling Samza with a scheduler is that
Samza (as a framework) now has to handle deployment. This pulls in a bunch
of things such as configuration distribution (config stream), shell scrips
(bin/run-job.sh, JobRunner), packaging (all the .tgz stuff), etc.

Another reason for requiring dynamic deployment was to support data
locality. If you want to have locality, you need to put your processors
close to the data they're processing. Upon further investigation, though,
this feature is not that beneficial. There is some good discussion about
some problems with it on SAMZA-335. Again, we took the Map/Reduce path, but
there are some fundamental differences between HDFS and Kafka. HDFS has
blocks, while Kafka has partitions. This leads to less optimization
potential with stream processors on top of Kafka.

This feature is also used as a crutch. Samza doesn't have any built in
fault-tolerance logic. Instead, it depends on the dynamic deployment
scheduling system to handle restarts when a processor dies. This has made
it very difficult to write a standalone Samza container (SAMZA-516).

Pluggability

In some cases pluggability is good, but I think that we've gone too far
with it. Currently, Samza has:

* Pluggable config.
* Pluggable metrics.
* Pluggable deployment systems.
* Pluggable streaming systems (SystemConsumer, SystemProducer, etc).
* Pluggable serdes.
* Pluggable storage engines.
* Pluggable strategies for just about every component (MessageChooser,
SystemStreamPartitionGrouper, ConfigRewriter, etc).

There's probably more that I've forgotten, as well. Some of these are
useful, but some have proven not to be. This all comes at a cost:
complexity. This complexity is making it harder for our users to pick up
and use Samza out of the box. It also makes it difficult for Samza
developers to reason about what the characteristics of the container (since
the characteristics change depending on which plugins are use).

The issues with pluggability are most visible in the System APIs. What
Samza really requires to be functional is Kafka as its transport layer. But
we've conflated two unrelated use cases into one API:

1. Get data into/out of Kafka.
2. Process the data in Kafka.

The current System API supports both of these use cases. The problem is, we
actually want different features for each use case. By papering over these
two use cases, and providing a single API, we've introduced a ton of leaky
abstractions.

For example, what we'd really like in (2) is to have monotonically
increasing longs for offsets (like Kafka). This would be at odds with (1),
though, since different systems have different SCNs/Offsets/UUIDs/vectors.
There was discussion both on the mailing list and the SQL JIRAs about the
need for this.

The same thing holds true for replayability. Kafka allows us to rewind when
we have a failure. Many other systems don't. In some cases, systems return
null for their offsets (e.g. WikipediaSystemConsumer) because they have no
offsets.

Partitioning is another example. Kafka supports partitioning, but many
systems don't. We model this by having a single partition for those
systems. Still, other systems model partitioning dif

Re: [VOTE] Apache Samza 0.9.1 RC1

2015-06-30 Thread Milinda Pathirage
+1 (non-binding)

Verified signature. Tested locally using ./bin/check-all.sh.

Thanks
Milinda

On Tue, Jun 30, 2015 at 2:10 AM, Yan Fang  wrote:

> +1
>
> Verified MD5, Signature.
>
> Tested locally.
>
> Thanks,
>
> Fang, Yan
> yanfang...@gmail.com
>
> On Sun, Jun 28, 2015 at 12:31 PM, Yi Pan  wrote:
>
> > Hey all,
> >
> > This is a call for a vote on a release of Apache Samza 0.9.1. This is a
> > bug-fix release against 0.9.0.
> >
> > The release candidate can be downloaded from here:
> >
> > http://people.apache.org/~nickpan47/samza-0.9.1-rc1/
> >
> > The release candidate is signed with pgp key 911402D8, which is
> > included in the repository's KEYS file:
> >
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=samza.git;a=blob_plain;f=KEYS;hb=6f5bafb6cd93934781161eb6b1868d11ea347c95
> >
> > and can also be found on keyservers:
> >
> > http://pgp.mit.edu/pks/lookup?op=get&search=0x911402D8
> >
> > The git tag is release-0.9.1-rc1 and signed with the same pgp key:
> >
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=samza.git;a=tag;h=e78b9e7f34650538b4bb68b338eb472b98a5709e
> >
> > Test binaries have been published to Maven's staging repository, and are
> > available here:
> > https://repository.apache.org/content/repositories/orgapachesamza-1007/
> >
> > Note release 0.9.1 is still supporting JDK6 and the binaries were built
> > with JDK6 without incident.
> >
> > 6 critical bugs were resolved for this release:
> >
> >
> >
> https://issues.apache.org/jira/browse/SAMZA-715?jql=project%20%3D%20SAMZA%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20%28Resolved%2C%20Closed%29
> >
> > The vote will be open for 72 hours ( end in 12:00pm Wed, 07/01/2015 ).
> > Please download the release candidate, check the hashes/signature, build
> it
> > and test it, and then please vote:
> >
> > [ ] +1 approve
> > [ ] +0 no opinion
> > [ ] -1 disapprove (and reason why)
> >
>



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org


Re: Review Request 35397: Fix Samza-697

2015-06-30 Thread Guozhang Wang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35397/
---

(Updated June 30, 2015, 6:29 p.m.)


Review request for samza.


Summary (updated)
-

Fix Samza-697


Bugs: SAMZA-697
https://issues.apache.org/jira/browse/SAMZA-697


Repository: samza


Description
---

Address Boris and Yi's comments


Diffs
-

  checkstyle/import-control.xml 3374f0c432e61ac4cda275377604cfd481f0cddf 
  docs/learn/documentation/versioned/jobs/configuration-table.html 
405e2cea4fd1d037cc26b3537f6bb406eded202b 
  samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 
0b3a235b5ab1d6bd60669bfe6023f6b0b4e943d3 
  samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
cbacd183420e9d1d72b05693b55a8f0a62d59fc5 
  samza-core/src/main/scala/org/apache/samza/container/TaskInstance.scala 
c5a5ea5dea9a950fc741625238f5bf8b1f362180 
  samza-core/src/main/scala/org/apache/samza/job/JobRunner.scala 
1c178a661e449c6bdfc4ce431aef9bb2d261a6c2 
  samza-core/src/main/scala/org/apache/samza/job/local/ProcessJobFactory.scala 
4fac154709d72ab594485dad93c912b55fb1617e 
  samza-core/src/test/scala/org/apache/samza/container/TestSamzaContainer.scala 
9fb1aa98fcd14397e8a4cb00c67537482e95fa53 
  samza-core/src/test/scala/org/apache/samza/container/TestTaskInstance.scala 
7caad28c9298485753ab861da76793cf925953ed 

Diff: https://reviews.apache.org/r/35397/diff/


Testing
---

unit tests


Thanks,

Guozhang Wang



Re: Review Request 35397: v1

2015-06-30 Thread Guozhang Wang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35397/
---

(Updated June 30, 2015, 6:28 p.m.)


Review request for samza.


Summary (updated)
-

v1


Bugs: SAMZA-697
https://issues.apache.org/jira/browse/SAMZA-697


Repository: samza


Description (updated)
---

Address Boris and Yi's comments


Diffs (updated)
-

  checkstyle/import-control.xml 3374f0c432e61ac4cda275377604cfd481f0cddf 
  docs/learn/documentation/versioned/jobs/configuration-table.html 
405e2cea4fd1d037cc26b3537f6bb406eded202b 
  samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 
0b3a235b5ab1d6bd60669bfe6023f6b0b4e943d3 
  samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
cbacd183420e9d1d72b05693b55a8f0a62d59fc5 
  samza-core/src/main/scala/org/apache/samza/container/TaskInstance.scala 
c5a5ea5dea9a950fc741625238f5bf8b1f362180 
  samza-core/src/main/scala/org/apache/samza/job/JobRunner.scala 
1c178a661e449c6bdfc4ce431aef9bb2d261a6c2 
  samza-core/src/main/scala/org/apache/samza/job/local/ProcessJobFactory.scala 
4fac154709d72ab594485dad93c912b55fb1617e 
  samza-core/src/test/scala/org/apache/samza/container/TestSamzaContainer.scala 
9fb1aa98fcd14397e8a4cb00c67537482e95fa53 
  samza-core/src/test/scala/org/apache/samza/container/TestTaskInstance.scala 
7caad28c9298485753ab861da76793cf925953ed 

Diff: https://reviews.apache.org/r/35397/diff/


Testing
---

unit tests


Thanks,

Guozhang Wang



Re: Review Request 35397: Fix SAMZA-697

2015-06-30 Thread Guozhang Wang


> On June 19, 2015, 8:59 a.m., Yi Pan (Data Infrastructure) wrote:
> > docs/learn/documentation/versioned/jobs/configuration-table.html, line 443
> > 
> >
> > Users who extends and implements StreamTask usually have their 
> > implementation classes put in the same package as org.apache.samza.task. 
> > Wouldn't this default blacklist also ignore the user-implemented StreamTask 
> > classes?

Good point. I would expect though that if we cleanup the packaging in the 
future as suggested in building the import checklist, we will still blacklist 
"not samza.api." libraries.


> On June 19, 2015, 8:59 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-core/src/main/java/org/apache/samza/task/TaskClassLoader.java, line 
> > 156
> > 
> >
> > Question: does this findLoadedClass(name) return true if the same 
> > binary name of a class has been loaded by another classloader? Can we add a 
> > test here to make sure?

findLoadedClass() does not delegate to parent, it "returns the class with the 
given the binary name if this loader has been recorded by the Java virtual 
machine as an initiating loader of a class" as quoted in java doc.


> On June 19, 2015, 8:59 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-core/src/main/java/org/apache/samza/task/TaskClassLoader.java, line 71
> > 
> >
> > nit: for unit test, changing it to package default should work.

Not sure I understand clearly, could you elaborate how changing to package 
default can help accessing a private class function?


- Guozhang


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35397/#review88236
---


On June 18, 2015, 6:42 p.m., Guozhang Wang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35397/
> ---
> 
> (Updated June 18, 2015, 6:42 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-697
> https://issues.apache.org/jira/browse/SAMZA-697
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Address Yan's comments
> 
> 
> Diffs
> -
> 
>   checkstyle/import-control.xml 3374f0c432e61ac4cda275377604cfd481f0cddf 
>   docs/learn/documentation/versioned/jobs/configuration-table.html 
> 405e2cea4fd1d037cc26b3537f6bb406eded202b 
>   samza-core/src/main/java/org/apache/samza/task/TaskClassLoader.java 
> PRE-CREATION 
>   samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 
> 0b3a235b5ab1d6bd60669bfe6023f6b0b4e943d3 
>   samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
> cbacd183420e9d1d72b05693b55a8f0a62d59fc5 
>   samza-core/src/main/scala/org/apache/samza/container/TaskInstance.scala 
> c5a5ea5dea9a950fc741625238f5bf8b1f362180 
>   samza-core/src/main/scala/org/apache/samza/job/JobRunner.scala 
> 1c178a661e449c6bdfc4ce431aef9bb2d261a6c2 
>   
> samza-core/src/main/scala/org/apache/samza/job/local/ProcessJobFactory.scala 
> 4fac154709d72ab594485dad93c912b55fb1617e 
>   samza-core/src/test/java/org/apache/samza/task/TestTaskClassLoader.java 
> PRE-CREATION 
>   
> samza-core/src/test/scala/org/apache/samza/container/TestSamzaContainer.scala 
> 9fb1aa98fcd14397e8a4cb00c67537482e95fa53 
>   samza-core/src/test/scala/org/apache/samza/container/TestTaskInstance.scala 
> 7caad28c9298485753ab861da76793cf925953ed 
> 
> Diff: https://reviews.apache.org/r/35397/diff/
> 
> 
> Testing
> ---
> 
> unit tests
> 
> 
> Thanks,
> 
> Guozhang Wang
> 
>



Re: Review Request 35933: SAMZA-449 Expose RocksDB statistic

2015-06-30 Thread Gustavo Anatoly F . V . Solís


> On Junho 30, 2015, 3:25 p.m., Yan Fang wrote:
> > samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbStatistic.scala,
> >  line 25
> > 
> >
> > It's better to use the name RocksDbStatisticMetrics to indicate it is a 
> > metrics.

I going to rename the class. Thanks.


> On Junho 30, 2015, 3:25 p.m., Yan Fang wrote:
> > samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbKeyValueStore.scala,
> >  line 150
> > 
> >
> > it should accept the "metrics" (from line 143) as the third parameter.

The third parameter (metrics) will be added.


> On Junho 30, 2015, 3:25 p.m., Yan Fang wrote:
> > samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbKeyValueStore.scala,
> >  line 243
> > 
> >
> > add some doc here, saying something like, calling this method will 
> > expose RocksDB statistic to a metric, etc.

The doc will be written


> On Junho 30, 2015, 3:25 p.m., Yan Fang wrote:
> > samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbStatistic.scala,
> >  line 40
> > 
> >
> > can we use the lower case for the all names of metrics? It's better to 
> > keep the name convension consistent in the Samza.

ok, this is will be changed


- Gustavo Anatoly


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35933/#review89902
---


On Junho 26, 2015, 5:56 p.m., Gustavo Anatoly F. V. Solís wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35933/
> ---
> 
> (Updated Junho 26, 2015, 5:56 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> RocksDB statistic
> 
> 
> Diffs
> -
> 
>   
> samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbKeyValueStore.scala
>  a423f7bd6c43461e051b5fd1f880dd01db785991 
>   
> samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbStatistic.scala
>  PRE-CREATION 
>   
> samza-kv-rocksdb/src/test/scala/org/apache/samza/storage/kv/TestRocksDbKeyValueStore.scala
>  a428a16bc1e9ab4980a6f17db4fd810057d31136 
> 
> Diff: https://reviews.apache.org/r/35933/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Gustavo Anatoly F. V. Solís
> 
>



Re: Review Request 35397: Fix SAMZA-697

2015-06-30 Thread Guozhang Wang


> On June 18, 2015, 11:58 p.m., Boris Shkolnik wrote:
> > samza-core/src/main/java/org/apache/samza/task/TaskClassLoader.java, line 50
> > 
> >
> > Do we have test for this case?

We check this case in TestTaskClassLoader.testConstructUrlsFromClasspath


> On June 18, 2015, 11:58 p.m., Boris Shkolnik wrote:
> > samza-core/src/main/java/org/apache/samza/task/TaskClassLoader.java, line 72
> > 
> >
> > nit. may be if (blacklistClassnames == null) return false.

Wondering what is the difference for these two?


> On June 18, 2015, 11:58 p.m., Boris Shkolnik wrote:
> > samza-core/src/main/java/org/apache/samza/task/TaskClassLoader.java, line 
> > 140
> > 
> >
> > else log an error?

If getResource returns null an exception will usually be thrown.


- Guozhang


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35397/#review88455
---


On June 18, 2015, 6:42 p.m., Guozhang Wang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35397/
> ---
> 
> (Updated June 18, 2015, 6:42 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-697
> https://issues.apache.org/jira/browse/SAMZA-697
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Address Yan's comments
> 
> 
> Diffs
> -
> 
>   checkstyle/import-control.xml 3374f0c432e61ac4cda275377604cfd481f0cddf 
>   docs/learn/documentation/versioned/jobs/configuration-table.html 
> 405e2cea4fd1d037cc26b3537f6bb406eded202b 
>   samza-core/src/main/java/org/apache/samza/task/TaskClassLoader.java 
> PRE-CREATION 
>   samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 
> 0b3a235b5ab1d6bd60669bfe6023f6b0b4e943d3 
>   samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
> cbacd183420e9d1d72b05693b55a8f0a62d59fc5 
>   samza-core/src/main/scala/org/apache/samza/container/TaskInstance.scala 
> c5a5ea5dea9a950fc741625238f5bf8b1f362180 
>   samza-core/src/main/scala/org/apache/samza/job/JobRunner.scala 
> 1c178a661e449c6bdfc4ce431aef9bb2d261a6c2 
>   
> samza-core/src/main/scala/org/apache/samza/job/local/ProcessJobFactory.scala 
> 4fac154709d72ab594485dad93c912b55fb1617e 
>   samza-core/src/test/java/org/apache/samza/task/TestTaskClassLoader.java 
> PRE-CREATION 
>   
> samza-core/src/test/scala/org/apache/samza/container/TestSamzaContainer.scala 
> 9fb1aa98fcd14397e8a4cb00c67537482e95fa53 
>   samza-core/src/test/scala/org/apache/samza/container/TestTaskInstance.scala 
> 7caad28c9298485753ab861da76793cf925953ed 
> 
> Diff: https://reviews.apache.org/r/35397/diff/
> 
> 
> Testing
> ---
> 
> unit tests
> 
> 
> Thanks,
> 
> Guozhang Wang
> 
>



Re: Review Request 35397: Fix SAMZA-697

2015-06-30 Thread Guozhang Wang


> On June 17, 2015, 9:03 p.m., Yan Fang wrote:
> > samza-core/src/main/scala/org/apache/samza/container/TaskInstance.scala, 
> > lines 103-111
> > 
> >
> > a little concernted about this. This means we will load the class every 
> > time a message comes. Is this too much? 
> > 
> > My suggestion is to put this code in RunLoop.runs, before the loop 
> > starts. What do you think?
> 
> Guozhang Wang wrote:
> A class will only be loaded once, so I think the overhead should be 
> minimal.
> 
> My concern for putting this code in RunLoop is that some of the logic in 
> RunLoop like consumerMultiplexer.choose are platform code and any of its 
> referenced classes should not be loaded by the task class loader.
> 
> Yan Fang wrote:
> Oh, I guess my concern is more about the time spent in 
> "getContextClassLoader", "setContextClassLoader", etc. They actually are 
> useless after loading the class but called every time a message comes. Maybe 
> we want to optimize it a little.
> 
> Another question is, why do we need to have taskClassLoader here if we 
> already use it to instantiate the XXXStreamTask, which should reslove all the 
> dependencies in the XXXStreamTask? Guess I have some misunderstanding in the 
> classLoader.

After a second thought, there seems no need to set / get context classloader 
any more since we already explicitly use the task class loader for loading the 
enclosing class XXXTask. Will change that.


- Guozhang


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35397/#review88260
---


On June 18, 2015, 6:42 p.m., Guozhang Wang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35397/
> ---
> 
> (Updated June 18, 2015, 6:42 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-697
> https://issues.apache.org/jira/browse/SAMZA-697
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Address Yan's comments
> 
> 
> Diffs
> -
> 
>   checkstyle/import-control.xml 3374f0c432e61ac4cda275377604cfd481f0cddf 
>   docs/learn/documentation/versioned/jobs/configuration-table.html 
> 405e2cea4fd1d037cc26b3537f6bb406eded202b 
>   samza-core/src/main/java/org/apache/samza/task/TaskClassLoader.java 
> PRE-CREATION 
>   samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 
> 0b3a235b5ab1d6bd60669bfe6023f6b0b4e943d3 
>   samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
> cbacd183420e9d1d72b05693b55a8f0a62d59fc5 
>   samza-core/src/main/scala/org/apache/samza/container/TaskInstance.scala 
> c5a5ea5dea9a950fc741625238f5bf8b1f362180 
>   samza-core/src/main/scala/org/apache/samza/job/JobRunner.scala 
> 1c178a661e449c6bdfc4ce431aef9bb2d261a6c2 
>   
> samza-core/src/main/scala/org/apache/samza/job/local/ProcessJobFactory.scala 
> 4fac154709d72ab594485dad93c912b55fb1617e 
>   samza-core/src/test/java/org/apache/samza/task/TestTaskClassLoader.java 
> PRE-CREATION 
>   
> samza-core/src/test/scala/org/apache/samza/container/TestSamzaContainer.scala 
> 9fb1aa98fcd14397e8a4cb00c67537482e95fa53 
>   samza-core/src/test/scala/org/apache/samza/container/TestTaskInstance.scala 
> 7caad28c9298485753ab861da76793cf925953ed 
> 
> Diff: https://reviews.apache.org/r/35397/diff/
> 
> 
> Testing
> ---
> 
> unit tests
> 
> 
> Thanks,
> 
> Guozhang Wang
> 
>



Re: Review Request 35933: SAMZA-449 Expose RocksDB statistic

2015-06-30 Thread Yan Fang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35933/#review89913
---



samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbKeyValueStore.scala
 (line 237)


Besides we let users decide when to call the getStatistic() to update the 
metrics, we also can call updateRocksDbStatistic in the flush(). This means, 
everytime we flush the rocksdb, we will update the rocksDbstatistic.



samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbStatistic.scala
 (line 25)


After second thought, I think current approach has a drawback that, 
everytime users call getStatistic, we will generate a new Counter/Gauge. That 
is not ideal.

A better approach is, in this metrics class, to create Counter/Gauge, and 
wrap them together in a method, probabaly called updateRocksDbStatistic(). We 
will just call this updateRocksDbStatistic in RocksDbKeyValueStore.scala to 
"update" the metrics, instead of creating new metrics.


- Yan Fang


On June 26, 2015, 5:56 p.m., Gustavo Anatoly F. V. Solís wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35933/
> ---
> 
> (Updated June 26, 2015, 5:56 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> RocksDB statistic
> 
> 
> Diffs
> -
> 
>   
> samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbKeyValueStore.scala
>  a423f7bd6c43461e051b5fd1f880dd01db785991 
>   
> samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbStatistic.scala
>  PRE-CREATION 
>   
> samza-kv-rocksdb/src/test/scala/org/apache/samza/storage/kv/TestRocksDbKeyValueStore.scala
>  a428a16bc1e9ab4980a6f17db4fd810057d31136 
> 
> Diff: https://reviews.apache.org/r/35933/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Gustavo Anatoly F. V. Solís
> 
>



Re: Review Request 35933: SAMZA-449 Expose RocksDB statistic

2015-06-30 Thread Yan Fang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35933/#review89902
---



samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbKeyValueStore.scala
 (line 150)


it should accept the "metrics" (from line 143) as the third parameter.



samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbKeyValueStore.scala
 (line 243)


add some doc here, saying something like, calling this method will expose 
RocksDB statistic to a metric, etc.



samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbStatistic.scala
 (line 25)


It's better to use the name RocksDbStatisticMetrics to indicate it is a 
metrics.



samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbStatistic.scala
 (line 40)


can we use the lower case for the all names of metrics? It's better to keep 
the name convension consistent in the Samza.


- Yan Fang


On June 26, 2015, 5:56 p.m., Gustavo Anatoly F. V. Solís wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35933/
> ---
> 
> (Updated June 26, 2015, 5:56 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> RocksDB statistic
> 
> 
> Diffs
> -
> 
>   
> samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbKeyValueStore.scala
>  a423f7bd6c43461e051b5fd1f880dd01db785991 
>   
> samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbStatistic.scala
>  PRE-CREATION 
>   
> samza-kv-rocksdb/src/test/scala/org/apache/samza/storage/kv/TestRocksDbKeyValueStore.scala
>  a428a16bc1e9ab4980a6f17db4fd810057d31136 
> 
> Diff: https://reviews.apache.org/r/35933/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Gustavo Anatoly F. V. Solís
> 
>