Re: Question regarding Samza's Kafka consumer

2018-07-10 Thread Thomas Becker
Thanks for your reply Jagadish. We will certainly do some testing, I was just 
curious if anyone had tried or knew what we could expect. The broker format 
change is actually over 2 years old now, so I assumed someone had tried this by 
now ;)

-Tommy

On Mon, 2018-07-09 at 12:22 -0700, Jagadish Venkatraman wrote:

Hi Thomas,


Has Samza been tested against newer broker

versions using the new message format, and if so does it have a

significant performance impact?


We have not benchmarked Kafka broker performance with the new message

format.

Any benchmarking may not be reliably reproducible since there are many

variables (message sizes,

compressibility, how "saturated" the brokers are at the instant you run the

benchmark).


I'd suggest some general pointers on this.


   - First define the metric you're trying to optimize - broker-side

   throughput, Samza's throughput, broker-cpu utilization?

   - Specify what the acceptable value of the metric is for your current

   setup.

   - Then, measure it for your workload. For all you know, the performance

   might be "good enough"


Are their plans to move Samza to the

new consumer?


We certainly have plans to move to the "new" consumer for reasons unrelated

to throughput on the client-side

(eg: SSL, long-term support from the Kafka community). However, these plans

have not gotten enough traction.


Please let me know if there are further questions.


-- Jagadish


On Mon, Jul 9, 2018 at 9:47 AM, Thomas Becker 
mailto:thomas.bec...@tivo.com>>

wrote:


Anyone have any input here?


On Mon, 2018-07-02 at 11:50 +, Thomas Becker wrote:


Hey folks,


I have a question regarding potential performance impacts of running


Samza against newer Kafka brokers. We have languished on the old on-


disk message  format for Kafka for some time, and want to upgrade to


the newer format which supports timestamps. Samza currently accounts


for quite a bit of our message consumption and I am concerned that it


will cause a broker performance hit due to downconversion of messages.


I know Samza uses the old SimpleConsumer internally which does not


support the newer format. Has Samza been tested against newer broker


versions using the new message format, and if so does it have a


significant performance impact? Are their plans to move Samza to the


new consumer?



Regards,


Tommy Becker







This email and any attachments may contain confidential and privileged

material for the sole use of the intended recipient. Any review, copying,

or distribution of this email (or any attachments) by others is prohibited.

If you are not the intended recipient, please contact the sender

immediately and permanently delete this email and any attachments. No

employee or agent of TiVo Inc. is authorized to conclude any binding

agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo

Inc. may only be made by a signed written agreement.






This email and any attachments may contain confidential and privileged

material for the sole use of the intended recipient. Any review, copying,

or distribution of this email (or any attachments) by others is prohibited.

If you are not the intended recipient, please contact the sender

immediately and permanently delete this email and any attachments. No

employee or agent of TiVo Inc. is authorized to conclude any binding

agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo

Inc. may only be made by a signed written agreement.








This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.


Re: Question regarding Samza's Kafka consumer

2018-07-09 Thread Thomas Becker
Anyone have any input here?

On Mon, 2018-07-02 at 11:50 +, Thomas Becker wrote:

Hey folks,

I have a question regarding potential performance impacts of running

Samza against newer Kafka brokers. We have languished on the old on-

disk message  format for Kafka for some time, and want to upgrade to

the newer format which supports timestamps. Samza currently accounts

for quite a bit of our message consumption and I am concerned that it

will cause a broker performance hit due to downconversion of messages.

I know Samza uses the old SimpleConsumer internally which does not

support the newer format. Has Samza been tested against newer broker

versions using the new message format, and if so does it have a

significant performance impact? Are their plans to move Samza to the

new consumer?


Regards,

Tommy Becker





This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.




This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.


Question regarding Samza's Kafka consumer

2018-07-02 Thread Thomas Becker
Hey folks,
I have a question regarding potential performance impacts of running
Samza against newer Kafka brokers. We have languished on the old on-
disk message  format for Kafka for some time, and want to upgrade to
the newer format which supports timestamps. Samza currently accounts
for quite a bit of our message consumption and I am concerned that it
will cause a broker performance hit due to downconversion of messages.
I know Samza uses the old SimpleConsumer internally which does not
support the newer format. Has Samza been tested against newer broker
versions using the new message format, and if so does it have a
significant performance impact? Are their plans to move Samza to the
new consumer?

Regards,
Tommy Becker



This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.


Re: Steps to Upgrading Samza (0.9 to 0.12)

2017-03-30 Thread Thomas Becker
Thanks for the reply Yi, and I apologize if I came off a bit snarky.
I'm not sure I agree with the policy (removing migration code and
wanting people to upgrade seem at odds to me), but minimally I think we
should not assume people are upgrading to each new Samza version. We
have done so when features or fixes warrant, and even then on a per-job
basis, and I would expect this is a common practice.

-Tommy

On Thu, 2017-03-30 at 09:50 -0700, Yi Pan wrote:
> Hi, Thomas,
>
> Sorry to hear that you were hit by the removal of migration in Samza
> 0.11.
> The reason we removed it is following a deprecate-removal policy in
> two
> versions. We are not aware that people still using 0.9 after we
> released
> 0.11 and were not expecting a direct upgrade from 0.9 to 0.12.
> Document can
> be better to capture that. We are making changes to the design
> proposal
> s.t. it is more transparent and open to the whole community, through
> the
> newly proposed SEP process. These kind of breaking changes will go
> through
> the SEP discuss-vote process in the future and hopefully capture all
> these
> kind of concerns earlier.
>
> Best!
>
> -Yi
>
> On Thu, Mar 30, 2017 at 7:45 AM, Thomas Becker 
> wrote:
>
> >
> > Yes, we were burned by this. The changelog mapping will be
> > regenerated
> > instead of migrated and the result will completely hose the job
> > (because the mapping was not generated deterministically in
> > previous
> > versions of Samza). I don't understand why the migration code was
> > removed but it was, and to the best of my knowledge the necessity
> > to
> > not skip version 0.10.0 when upgrading was not documented, let
> > alone
> > enforced.
> >
> > On Mon, 2017-03-27 at 10:07 -0700, Jagadish Venkatraman wrote:
> > >
> > > Good observation Jake!
> > >
> > > The code for migration was removed in Samza 11. The migration
> > > would
> > > read
> > > change-log offsets from the checkpoint topic and write them to
> > > the
> > > coordinator stream.
> > >
> > > If you're using change-logged stores, I'd recommend upgrading
> > > from
> > > 0.9.1 to
> > > 0.10.0 first.
> > > Otherwise, you will loose offsets for change-logged stores.
> > >
> > > I suspect you should be okay for 0.10.0 to 0.12 upgrade.
> > >
> > > On Mon, Mar 27, 2017 at 9:30 AM, Jacob Maes  > > >
> > > wrote:
> > >
> > > >
> > > >
> > > > As I recall, samza 0.10 introduced the coordinator stream and
> > > > there
> > > > was
> > > > code to do an automatic migration to use that feature. @navina,
> > > > @yi, do you
> > > > know if that migration code is still in samza 12?
> > > >
> > > > If not, then it's probably better to update from 0.9.1 to
> > > > 0.10.0
> > > > and then
> > > > to 0.12.0. I don't think there were any changes requiring
> > > > migration
> > > > between
> > > > 0.10.and 0.12, so upgrading directly from 0.10 to 0.12 is
> > > > probably
> > > > less of
> > > > an issue.
> > > >
> > > > On Fri, Mar 24, 2017 at 11:05 PM, Jagadish Venkatraman <
> > > > jagadish1...@gmail.com> wrote:
> > > >
> > > > >
> > > > >
> > > > > Hi Xiaochuan,
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Do I need to upgrade Kafka and/or YARN?
> > > > > *Yarn version:*
> > > > >
> > > > >- Samza 0.12 supports Yarn 2.6.1 and 2.7.1.
> > > > >- If you already have 2.6.0 installed (as you have said),
> > > > > I
> > > > > believe
> > > > you
> > > > >
> > > > >
> > > > >will be fine. (but I'm not sure)
> > > > >
> > > > > *Kafka version: *
> > > > >
> > > > >- Samza 0.12 upgraded the version of Kafka to 0.10.
> > > > >- If your Kafka brokers are on an older version of Kafka,
> > > > > you
> > > > > should
> > > > >upgrade them to use at-least 0.10. Kafka clients are
> > > > > usually
> > > > >incompatible with older versions of brokers.
> > > > >
> > > > > 

Re: Steps to Upgrading Samza (0.9 to 0.12)

2017-03-30 Thread Thomas Becker
Yes, we were burned by this. The changelog mapping will be regenerated
instead of migrated and the result will completely hose the job
(because the mapping was not generated deterministically in previous
versions of Samza). I don't understand why the migration code was
removed but it was, and to the best of my knowledge the necessity to
not skip version 0.10.0 when upgrading was not documented, let alone
enforced.

On Mon, 2017-03-27 at 10:07 -0700, Jagadish Venkatraman wrote:
> Good observation Jake!
>
> The code for migration was removed in Samza 11. The migration would
> read
> change-log offsets from the checkpoint topic and write them to the
> coordinator stream.
>
> If you're using change-logged stores, I'd recommend upgrading from
> 0.9.1 to
> 0.10.0 first.
> Otherwise, you will loose offsets for change-logged stores.
>
> I suspect you should be okay for 0.10.0 to 0.12 upgrade.
>
> On Mon, Mar 27, 2017 at 9:30 AM, Jacob Maes 
> wrote:
>
> >
> > As I recall, samza 0.10 introduced the coordinator stream and there
> > was
> > code to do an automatic migration to use that feature. @navina,
> > @yi, do you
> > know if that migration code is still in samza 12?
> >
> > If not, then it's probably better to update from 0.9.1 to 0.10.0
> > and then
> > to 0.12.0. I don't think there were any changes requiring migration
> > between
> > 0.10.and 0.12, so upgrading directly from 0.10 to 0.12 is probably
> > less of
> > an issue.
> >
> > On Fri, Mar 24, 2017 at 11:05 PM, Jagadish Venkatraman <
> > jagadish1...@gmail.com> wrote:
> >
> > >
> > > Hi Xiaochuan,
> > >
> > > >
> > > > >
> > > > > Do I need to upgrade Kafka and/or YARN?
> > > *Yarn version:*
> > >
> > >- Samza 0.12 supports Yarn 2.6.1 and 2.7.1.
> > >- If you already have 2.6.0 installed (as you have said), I
> > > believe
> > you
> > >
> > >will be fine. (but I'm not sure)
> > >
> > > *Kafka version: *
> > >
> > >- Samza 0.12 upgraded the version of Kafka to 0.10.
> > >- If your Kafka brokers are on an older version of Kafka, you
> > > should
> > >upgrade them to use at-least 0.10. Kafka clients are usually
> > >incompatible with older versions of brokers.
> > >
> > > *Java version: *
> > >
> > >
> > >
> > >- Samza 0.12 binaries are compiled using Java 8.  Hence, they
> > > cannot
> > be
> > >
> > >run on older versions of the Java run-time.
> > >
> > >
> > > >
> > > > >
> > > > > I'm extremely new to Samza in terms of operations aspect. I'm
> > > > > not sure
> > > what
> > > information would be relevant in this case so please ask away.
> > >
> > > I'd first start by upgrading the Kafka brokers (assuming you're
> > > on Java
> > 8+
> > >
> > > already).
> > > Let us know how the migration goes!
> > >
> > > Thanks,
> > > Jagadish
> > >
> > >
> > > On Fri, Mar 24, 2017 at 8:23 PM, XiaoChuan Yu  > > om>
> > > wrote:
> > >
> > > >
> > > > Hi,
> > > >
> > > > What are the general steps for upgrading Samza from 0.9 to
> > > > 0.12?
> > > > Do I need to upgrade Kafka and/or YARN?
> > > >
> > > > I don't know how Samza was setup initially but we currently
> > > > have the
> > > > following setup:
> > > >
> > > > Samza version: 0.9.1
> > > > YARN version: Hadoop 2.6.0-cdh5.4.8
> > > > Kafka version: 0.9.0.1
> > > >
> > > > I think installation of Kafka and YARN were managed through
> > > > Puppet.
> > > > I'm extremely new to Samza in terms of operations aspect. I'm
> > > > not sure
> > > what
> > > >
> > > > information would be relevant in this case so please ask away.
> > > >
> > > > Thanks,
> > > > Xiaochuan Yu
> > > >
> > >
> > >
> > > --
> > > Jagadish V,
> > > Graduate Student,
> > > Department of Computer Science,
> > > Stanford University
> > >
>
>
--


Tommy Becker

Senior Software Engineer

O +1 919.460.4747

tivo.com




This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.


RE: Coordinator URL always 127.0.0.1

2015-07-30 Thread Thomas Becker
Ok, I thought there was some communication from the container to the AM, it 
sounds like you're saying it's in the other direction only?  Don't containers 
heartbeat to the AM?  Regardless, even if we can't get a better address for the 
AM from YARN, we could at least filter the addresses we get back from the JVM 
to exclude loopbacks.

-Tommy

From: Navina Ramesh [nram...@linkedin.com.INVALID]
Sent: Thursday, July 30, 2015 8:40 PM
To: dev@samza.apache.org
Subject: Re: Coordinator URL always 127.0.0.1

Hi Tommy,
Yi is right. Container start is coordinated by the AppMaster using an
NMClient. Container host name and port is provided by the RM during
allocation.
In Yarn (at least, afaik), when the node joins a cluster, the NM registers
itself with the RM. So, the NM might still be using
getLocalhost.getAddress().

I don't know of any other way to programmatically fetch the machine's
hostname (apart from some hacky shell commands).

Cheers,
Navina

On Thu, Jul 30, 2015 at 5:23 PM, Yi Pan  wrote:

> Hi, Tommy,
>
> Yeah, I agree that the current implementation is not bullet-proof to any
> different networking configuration on the host. As for the AM <-> container
> communication, if I am not mistaken, it is through the NMClient and the
> node HTTP address is wrapped within the Container object returned from RM.
> I am not very familiar with that part of source code. Navina may be able to
> help more here.
>
> -Yi
>
> On Thu, Jul 30, 2015 at 4:27 PM, Thomas Becker  wrote:
>
> > Hi Yi,
> > Thanks a lot for your reply.  I don't doubt we can get it to work by
> > mucking with the networking configuration, but to me this feels like a
> > workaround, not a solution.  InetAddress.getLocalHost().getHostAddress()
> is
> > not a reliable way of obtaining an IP that other machines can connect to.
> > Just today I tested on several Linux distros and it did not work on any
> of
> > them.  Can we do something more robust here?  How does the container
> > communicate status to the AM?
> >
> > -Tommy
> >
> > 
> > From: Yi Pan [nickpa...@gmail.com]
> > Sent: Thursday, July 30, 2015 6:48 PM
> > To: dev@samza.apache.org
> > Subject: Re: Coordinator URL always 127.0.0.1
> >
> > Hi, Tommy,
> >
> > I think that it might be a commonly asked question regarding to multiple
> > IPs on a single host. A common trick w/o changing code is (copied from
> SO:
> >
> >
> http://stackoverflow.com/questions/2381316/java-inetaddress-getlocalhost-returns-127-0-0-1-how-to-get-real-ip
> > )
> >
> > {code}
> >
> >1.
> >
> >Find your host name. Type: hostname. For example, you find your
> hostname
> >is mycomputer.xzy.com
> >2.
> >
> >Put your host name in your hosts file. /etc/hosts . Such as
> >
> >10.50.16.136 mycomputer.xzy.com
> >
> >
> > {code}
> >
> > -Yi
> >
> > On Thu, Jul 30, 2015 at 11:35 AM, Tommy Becker 
> wrote:
> >
> > > We are testing some jobs on a YARN grid and noticed they are often not
> > > starting up properly due to being unable to connect to the job
> > coordinator.
> > > After some investigation it seems as if the jobs are always getting a
> > > coordinator URL of http://127.0.0.1:  But my understanding is
> that
> > > the coordinator runs only in the AM, so I'd expect these URLs to more
> > often
> > > than not be to some other machine.  Looking at the code however, I'm
> not
> > > sure how that would ever happen since the URL for the coordinator
> always
> > > comes from InetAddress.getLocalHost().getHostAddress() in
> > > org.apache.samza.coordinator.server.HttpServer#getUrl
> > >
> > > Am I off base here?  Because I don't see how this is ever going to work
> > in
> > > scenarios where the AM is on a different node than the containers.
> > >
> > > --
> > > Tommy Becker
> > > Senior Software Engineer
> > >
> > > Digitalsmiths
> > > A TiVo Company
> > >
> > > www.digitalsmiths.com<http://www.digitalsmiths.com>
> > > tobec...@tivo.com<mailto:tobec...@tivo.com>
> > >
> > > 
> > >
> > > This email and any attachments may contain confidential and privileged
> > > material for the sole use of the intended recipient. Any review,
> copying,
> > > or distribution of this email (or any attachments) by others is
> > prohibited.
> >

RE: Coordinator URL always 127.0.0.1

2015-07-30 Thread Thomas Becker
Hi Yi,
Thanks a lot for your reply.  I don't doubt we can get it to work by mucking 
with the networking configuration, but to me this feels like a workaround, not 
a solution.  InetAddress.getLocalHost().getHostAddress() is not a reliable way 
of obtaining an IP that other machines can connect to.  Just today I tested on 
several Linux distros and it did not work on any of them.  Can we do something 
more robust here?  How does the container communicate status to the AM?

-Tommy


From: Yi Pan [nickpa...@gmail.com]
Sent: Thursday, July 30, 2015 6:48 PM
To: dev@samza.apache.org
Subject: Re: Coordinator URL always 127.0.0.1

Hi, Tommy,

I think that it might be a commonly asked question regarding to multiple
IPs on a single host. A common trick w/o changing code is (copied from SO:
http://stackoverflow.com/questions/2381316/java-inetaddress-getlocalhost-returns-127-0-0-1-how-to-get-real-ip
)

{code}

   1.

   Find your host name. Type: hostname. For example, you find your hostname
   is mycomputer.xzy.com
   2.

   Put your host name in your hosts file. /etc/hosts . Such as

   10.50.16.136 mycomputer.xzy.com


{code}

-Yi

On Thu, Jul 30, 2015 at 11:35 AM, Tommy Becker  wrote:

> We are testing some jobs on a YARN grid and noticed they are often not
> starting up properly due to being unable to connect to the job coordinator.
> After some investigation it seems as if the jobs are always getting a
> coordinator URL of http://127.0.0.1:  But my understanding is that
> the coordinator runs only in the AM, so I'd expect these URLs to more often
> than not be to some other machine.  Looking at the code however, I'm not
> sure how that would ever happen since the URL for the coordinator always
> comes from InetAddress.getLocalHost().getHostAddress() in
> org.apache.samza.coordinator.server.HttpServer#getUrl
>
> Am I off base here?  Because I don't see how this is ever going to work in
> scenarios where the AM is on a different node than the containers.
>
> --
> Tommy Becker
> Senior Software Engineer
>
> Digitalsmiths
> A TiVo Company
>
> www.digitalsmiths.com
> tobec...@tivo.com
>
> 
>
> This email and any attachments may contain confidential and privileged
> material for the sole use of the intended recipient. Any review, copying,
> or distribution of this email (or any attachments) by others is prohibited.
> If you are not the intended recipient, please contact the sender
> immediately and permanently delete this email and any attachments. No
> employee or agent of TiVo Inc. is authorized to conclude any binding
> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
> Inc. may only be made by a signed written agreement.
>



This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.


RE: Thoughts and obesrvations on Samza

2015-07-08 Thread Thomas Becker
>From my perspective as a user I like the direction that's being proposed.  
>Like apparently many others, we've found YARN to be the biggest hurdle to 
>operationalizing Samza, and it's a questionable fit for our deployment model 
>(AWS).  A standalone mode that provides the ability to dynamically start and 
>stop additional stream job instances and have the partitioning automagically 
>rebalance (which as I understand it is part of what is being proposed) seems 
>like a clear win in terms of both dependency reduction and functionality as 
>well.

Looking at Jay's POC code also excites me about potentially being able to 
utilize Samza as a library.  For all its configurability, one thing Samza does 
not allow is customization of how it's various components are instantiated and 
wired together. This inflexibility has required us to make a few unfortunate 
design decisions for the sake of efficiency in our stream jobs.

Finally, after reading through the "CopyCat" framework design, I understand how 
that could take the place of pluggable consumers and producers in Samza.  
Shedding that baggage that probably 95% of users won't use anyway feels like it 
could be a win.

-Tommy


From: Jay Kreps [j...@confluent.io]
Sent: Tuesday, July 07, 2015 2:35 PM
To: dev@samza.apache.org
Subject: Re: Thoughts and obesrvations on Samza

Hey Roger,

I couldn't agree more. We spent a bunch of time talking to people and that
is exactly the stuff we heard time and again. What makes it hard, of
course, is that there is some tension between compatibility with what's
there now and making things better for new users.

I also strongly agree with the importance of multi-language support. We are
talking now about Java, but for application development use cases people
want to work in whatever language they are using elsewhere. I think moving
to a model where Kafka itself does the group membership, lifecycle control,
and partition assignment has the advantage of putting all that complex
stuff behind a clean api that the clients are already going to be
implementing for their consumer, so the added functionality for stream
processing beyond a consumer becomes very minor.

-Jay

On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover 
wrote:

> Metamorphosis...nice. :)
>
> This has been a great discussion.  As a user of Samza who's recently
> integrated it into a relatively large organization, I just want to add
> support to a few points already made.
>
> The biggest hurdles to adoption of Samza as it currently exists that I've
> experienced are:
> 1) YARN - YARN is overly complex in many environments where Puppet would do
> just fine but it was the only mechanism to get fault tolerance.
> 2) Configuration - I think I like the idea of configuring most of the job
> in code rather than config files.  In general, I think the goal should be
> to make it harder to make mistakes, especially of the kind where the code
> expects something and the config doesn't match.  The current config is
> quite intricate and error-prone.  For example, the application logic may
> depend on bootstrapping a topic but rather than asserting that in the code,
> you have to rely on getting the config right.  Likewise with serdes, the
> Java representations produced by various serdes (JSON, Avro, etc.) are not
> equivalent so you cannot just reconfigure a serde without changing the
> code.   It would be nice for jobs to be able to assert what they expect
> from their input topics in terms of partitioning.  This is getting a little
> off topic but I was even thinking about creating a "Samza config linter"
> that would sanity check a set of configs.  Especially in organizations
> where config is managed by a different team than the application developer,
> it's very hard to get avoid config mistakes.
> 3) Java/Scala centric - for many teams (especially DevOps-type folks), the
> pain of the Java toolchain (maven, slow builds, weak command line support,
> configuration over convention) really inhibits productivity.  As more and
> more high-quality clients become available for Kafka, I hope they'll follow
> Samza's model.  Not sure how much it affects the proposals in this thread
> but please consider other languages in the ecosystem as well.  From what
> I've heard, Spark has more Python users than Java/Scala.
> (FYI, we added a Jython wrapper for the Samza API
>
> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
> and are working on a Yeoman generator
> https://github.com/Quantiply/generator-rico for Jython/Samza projects to
> alleviate some of the pain)
>
> I also want to underscore Jay's point about improving the user experience.
> That's a very important factor for adoption.  I think the goal should be to
> make Samza as easy to get started with as something like Logstash.
> Logstash is vastly inferior in terms of capabilities to Samza but it's easy
> to get started and that makes a big difference.

RE: Storing sensitive data in the Config

2015-03-09 Thread Thomas Becker
Thanks for the response Chris.  I opened 
https://issues.apache.org/jira/browse/SAMZA-589.  A prefix seems like the 
easiest thing.  Would be nice if the keys still show but the values appear 
masked.


From: Chris Riccomini [criccom...@apache.org]
Sent: Monday, March 09, 2015 7:27 PM
To: dev@samza.apache.org
Subject: Re: Storing sensitive data in the Config

Hey Tommy,

Yea, this has come up a few times. We don't currently have an answer for
it. The simplest thing to do would be to have a prefix. Any config with the
prefix could be stripped from the AM and logs. Another possibility is to
store the configs in an encrypted way, and have the code decrypt the
configs at runtime.

Can you open a JIRA up to track this? Do you have any other thoughts on the
best way to handle this?

Cheers,
Chris

On Mon, Mar 9, 2015 at 1:00 PM, Tommy Becker  wrote:

> We have some sensitive information that we are currently storing in the
> Samza config.  Our ops guys have some concern regarding where the config is
> displayed (e.g. in logs, app master UI, etc).  I'm curious if others have
> had similar concerns and if so what you did about it.  Seems like we might
> be able to use system properties for these things, albeit at a significant
> cost to convenience.  It would be nice if it were possible to mark config
> values as sensitive (perhaps via some sort of naming convention), and have
> such values be retrievable only via explicit get on the key so these sort
> of incidental exposures can't happen.
>
> --
> Tommy Becker
> Senior Software Engineer
>
> Digitalsmiths
> A TiVo Company
>
> www.digitalsmiths.com
> tobec...@tivo.com
>
> 
>
> This email and any attachments may contain confidential and privileged
> material for the sole use of the intended recipient. Any review, copying,
> or distribution of this email (or any attachments) by others is prohibited.
> If you are not the intended recipient, please contact the sender
> immediately and permanently delete this email and any attachments. No
> employee or agent of TiVo Inc. is authorized to conclude any binding
> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
> Inc. may only be made by a signed written agreement.
>



This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.