Re: what does the utilization metric in SamzaContainerMetrics show?

2015-06-25 Thread Luis Fernando De Pombo
Hi Shadi,
Thanks for asking. This metric tracks the utilization of the event loop

within a samza container, which uses a single thread, that is in charge of
reading and writing messages, flushing metrics, checkpointing, and
windowing. It is important to track the utilization (aka "duty cycle") of
any event loop, which is the sum of all the timings (activeMs) divided by
the window length (totalMs).

You are right in that most of the time this value will be close to 1, which
represents complete utilization. However when the event loop starts to have
idle time, this metric will give you an idea of how much headroom you have
before the job will start to seriously fall behind.

I hope that answers your question!

ᐧ

On Thu, Jun 25, 2015 at 6:39 PM, Shadi Noghabi <
snogh...@linkedin.com.invalid> wrote:

> Hi,
>
> I was wondering what does this utilization metric in the
> SamzaContainerMetrics show? I am asking this sine in the code it is
> calculated as below:
>
> while (!shutdownNow) {
>   val loopStartTime = clock();
>   process
>   window
>   commit
>   val totalMs = clock() - loopStartTime
>   metrics.utilization.set(activeMs.toFloat/totalMs)
>   activeMs = 0L
> }
>
> Where the totalMs is the time it takes to start calling process until
> commit is done which is equal to  the time it takes to run process, window,
> and commit. And they way activeMs is calculated is by summing up the time
> it takes to call process, window and commit, which means these two values
> are going to be almost the same and the utilization is always going to be
> almost 1.
>
> I was just wondering what is the point of doing this?
>
>
>
>


what does the utilization metric in SamzaContainerMetrics show?

2015-06-25 Thread Shadi Noghabi
Hi,

I was wondering what does this utilization metric in the SamzaContainerMetrics 
show? I am asking this sine in the code it is calculated as below:

while (!shutdownNow) {
  val loopStartTime = clock();
  process
  window
  commit
  val totalMs = clock() - loopStartTime
  metrics.utilization.set(activeMs.toFloat/totalMs)
  activeMs = 0L
}

Where the totalMs is the time it takes to start calling process until commit is 
done which is equal to  the time it takes to run process, window, and commit. 
And they way activeMs is calculated is by summing up the time it takes to call 
process, window and commit, which means these two values are going to be almost 
the same and the utilization is always going to be almost 1.

I was just wondering what is the point of doing this?





Re: monitoring best practices

2015-06-25 Thread Milinda Pathirage
Hi Christopher,

Recently I did something similar but for getting performance numbers out
from Samza. I used InfluxDB. I wrote a stream task which consumes the
metrics topics and deployed it as a another Samza job. From that job I
pushed metrics in to InfluxDB.

Thanks
Milinda

On Thu, Jun 25, 2015 at 5:22 PM, Christopher Chamberlin <
chris.chamber...@climate.com> wrote:

> I see in the Samza metrics documentation that there are two basic ways to
> get metrics from Samza to a metrics repository: 1) write a custom
> MetricsReporter to push the metrics directly, perhaps using an
> ExecutorService or similar to perform batching, or 2) consume the metrics
> Kafka queue and push them from there, letting the built-in
> MetricsSnapshotReporter do the batching.
>
> Can anyone running Samza in production provide any insight into which of
> these to prefer?
>
> I'm looking to get my metrics out to Prometheus, probably via a Pushgateway
> endpoint.
>
> I see the pending SAMZA-340 patch to add Graphite support using approach
> #1. I like going directly from the monitored container to the monitoring
> system (fewer moving parts than going via Kafka in method #2), but I'd
> rather not re-implement the batching and other logic in the existing
> SnapshotMetricsReporter.
>
> Thanks.
>



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org


Re: Review Request 35492: SAMZA-701 : Hello Samza - Port docker setup from hadoop-common

2015-06-25 Thread Yan Fang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35492/#review89451
---



README.md (line 9)


prefer to set use the name "start-docker-env.sh" to make it more speicific.



dev-support/docker/Dockerfile (line 29)


single line apt-get is not recommended. 

From https://docs.docker.com/articles/dockerfile_best-practices/ : "Don’t 
do RUN apt-get update on a single line. This will cause caching issues if the 
referenced archive gets updated, which will make your subsequent apt-get 
install fail without comment."


- Yan Fang


On June 16, 2015, 7:36 a.m., Darrell Taylor wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35492/
> ---
> 
> (Updated June 16, 2015, 7:36 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza-hello-samza
> 
> 
> Description
> ---
> 
> Take the really useful docker setup from avro and hadoop-common and make it 
> work for hello samza
> 
> 
> Diffs
> -
> 
>   README.md 4463454 
>   conf/yarn-site.xml 9028590 
>   dev-support/docker/Dockerfile PRE-CREATION 
>   dev-support/docker/hadoop_env_checks.sh PRE-CREATION 
>   start-env.sh PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/35492/diff/
> 
> 
> Testing
> ---
> 
> * Run ./start-env.sh from the top level directory
> * Followed the instructions from "Start a Grid" on thsi page : 
> http://samza.apache.org/startup/hello-samza/0.8/
> 
> 
> Thanks,
> 
> Darrell Taylor
> 
>



Re: [VOTE] Apache Samza 0.9.1 RC0

2015-06-25 Thread Yan Fang
no objection from me. :)

Thanks,

Fang, Yan
yanfang...@gmail.com

On Thu, Jun 25, 2015 at 4:18 PM, Yi Pan  wrote:

> Hi, all,
>
> I have been preparing for the new 0.9.1 RC1 and it is close to be done. I
> am going to cancel this VOTE, if no objections.
>
> Thanks!
>
> On Mon, Jun 22, 2015 at 5:41 PM, Yan Fang  wrote:
>
> > Hi Yi,
> >
> > This only publishes the artifacts to the staging repository for testing.
> > After completing the vote, you can "release" the artifacts to the public
> > repository by clicking the "release" button. :)
> >
> > Thanks,
> >
> > Fang, Yan
> > yanfang...@gmail.com
> >
> > On Mon, Jun 22, 2015 at 5:30 PM, Yi Pan  wrote:
> >
> > > Hi, Yan,
> > >
> > > Thanks for point out that! Actually I saw that last time and had the
> > > following question: should we publish the artifacts after the VOTE is
> > > completed or together w/ the VOTE?
> > > It seems like that we want to publish the binary artifacts together w/
> > the
> > > VOTE, right?
> > >
> > > -Yi
> > >
> > >
> > > On Mon, Jun 22, 2015 at 5:25 PM, Yan Fang 
> wrote:
> > >
> > > > Hi Yi Pan,
> > > >
> > > > " Is there any document regarding to how to publish the maven staging
> > > link?
> > > > "
> > > >
> > > >   -- Yes. Check the last part of the
> > > > https://github.com/apache/samza/blob/master/RELEASE.md . Not sure if
> > you
> > > > have seen this. I should have pointed it out earlier. *_*
> > > >
> > > > Thanks,
> > > >
> > > > Fang, Yan
> > > > yanfang...@gmail.com
> > > >
> > > > On Mon, Jun 22, 2015 at 3:47 PM, Yi Pan  wrote:
> > > >
> > > > > Hi, guys,
> > > > >
> > > > > I am working on the list of things posted by Yan:
> > > > >
> > > > > 1. I have the difficulty in building the 0.9.1 branch. I think this
> > is
> > > > > mainly related to SAMZA-721
> > > > > .
> > > > >
> > > > > This seems to be an invalid case in 0.9.1. We only need the
> > > > > joint-compilation option in master.
> > > > >
> > > > > 2. Also, https://issues.apache.org/jira/browse/SAMZA-712 seems
> > > bothering
> > > > > people as well.
> > > > >
> > > > > Committed to master and backported to 0.9.1.
> > > > >
> > > > > 3. https://issues.apache.org/jira/browse/SAMZA-720 is a critical
> bug
> > > we
> > > > > need to fix. Have already attached a patch.
> > > > >
> > > > > Plan to backport to 0.9.1.
> > > > >
> > > > > 4. There is no maven staging link.
> > > > >
> > > > > Is there any document regarding to how to publish the maven staging
> > > link?
> > > > >
> > > > > On Mon, Jun 22, 2015 at 3:02 PM, Naveen Somasundaram <
> > > > > nsomasunda...@linkedin.com.invalid> wrote:
> > > > >
> > > > > > Hey Yan,
> > > > > >SAMZA-721 might be because you checkout master and
> > > > > switched
> > > > > > to 0.9.1 branch, and you still have some files from master which
> > git
> > > is
> > > > > not
> > > > > > tracking. Can you try a git clean before you build 0.9.1 ?  AFAIK
> > you
> > > > > don't
> > > > > > need joint compilation for core in 0.9.1.
> > > > > >
> > > > > > On Mon, Jun 22, 2015 at 1:25 PM, Roger Hoover <
> > > roger.hoo...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Yan,
> > > > > > >
> > > > > > > I tested to patch locally and it looks good.  Creating a
> patched
> > > > > release
> > > > > > > for myself to test in our environment.  Thanks, again.
> > > > > > >
> > > > > > > Sent from my iPhone
> > > > > > >
> > > > > > > > On Jun 22, 2015, at 10:59 AM, Yi Pan 
> > > wrote:
> > > > > > > >
> > > > > > > > Hi, Yan,
> > > > > > > >
> > > > > > > > Thanks a lot for the quick fix on the mentioned bugs. It
> seems
> > > the
> > > > > fix
> > > > > > > for
> > > > > > > > SAMZA-720 is pretty localized and I am OK to push it into
> > 0.9.1.
> > > I
> > > > > will
> > > > > > > be
> > > > > > > > working on back porting those changes to 0.9.1 later today
> and
> > > fix
> > > > > all
> > > > > > > the
> > > > > > > > release related issues.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > -Yi
> > > > > > > >
> > > > > > > > On Mon, Jun 22, 2015 at 10:30 AM, Roger Hoover <
> > > > > roger.hoo...@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Yan,
> > > > > > > >>
> > > > > > > >> You rock.  Thank you so much for the quick fix.  I'm working
> > on
> > > > > > building
> > > > > > > >> and testing the patch.
> > > > > > > >>
> > > > > > > >> Cheers,
> > > > > > > >>
> > > > > > > >> Roger
> > > > > > > >>
> > > > > > > >>> On Mon, Jun 22, 2015 at 1:09 AM, Yan Fang <
> > > yanfang...@gmail.com>
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>> Hi guys,
> > > > > > > >>>
> > > > > > > >>> 1. I have the difficulty in building the 0.9.1 branch. I
> > think
> > > > this
> > > > > > is
> > > > > > > >>> mainly related to SAMZA-721
> > > > > > > >>> .
> > > > > > > >>>
> > > > > > > >>> 2. Also, https://issues.apache.org/jira/browse/SAMZA-712
> > seems
> > > 

Re: [VOTE] Apache Samza 0.9.1 RC0

2015-06-25 Thread Yi Pan
Hi, all,

I have been preparing for the new 0.9.1 RC1 and it is close to be done. I
am going to cancel this VOTE, if no objections.

Thanks!

On Mon, Jun 22, 2015 at 5:41 PM, Yan Fang  wrote:

> Hi Yi,
>
> This only publishes the artifacts to the staging repository for testing.
> After completing the vote, you can "release" the artifacts to the public
> repository by clicking the "release" button. :)
>
> Thanks,
>
> Fang, Yan
> yanfang...@gmail.com
>
> On Mon, Jun 22, 2015 at 5:30 PM, Yi Pan  wrote:
>
> > Hi, Yan,
> >
> > Thanks for point out that! Actually I saw that last time and had the
> > following question: should we publish the artifacts after the VOTE is
> > completed or together w/ the VOTE?
> > It seems like that we want to publish the binary artifacts together w/
> the
> > VOTE, right?
> >
> > -Yi
> >
> >
> > On Mon, Jun 22, 2015 at 5:25 PM, Yan Fang  wrote:
> >
> > > Hi Yi Pan,
> > >
> > > " Is there any document regarding to how to publish the maven staging
> > link?
> > > "
> > >
> > >   -- Yes. Check the last part of the
> > > https://github.com/apache/samza/blob/master/RELEASE.md . Not sure if
> you
> > > have seen this. I should have pointed it out earlier. *_*
> > >
> > > Thanks,
> > >
> > > Fang, Yan
> > > yanfang...@gmail.com
> > >
> > > On Mon, Jun 22, 2015 at 3:47 PM, Yi Pan  wrote:
> > >
> > > > Hi, guys,
> > > >
> > > > I am working on the list of things posted by Yan:
> > > >
> > > > 1. I have the difficulty in building the 0.9.1 branch. I think this
> is
> > > > mainly related to SAMZA-721
> > > > .
> > > >
> > > > This seems to be an invalid case in 0.9.1. We only need the
> > > > joint-compilation option in master.
> > > >
> > > > 2. Also, https://issues.apache.org/jira/browse/SAMZA-712 seems
> > bothering
> > > > people as well.
> > > >
> > > > Committed to master and backported to 0.9.1.
> > > >
> > > > 3. https://issues.apache.org/jira/browse/SAMZA-720 is a critical bug
> > we
> > > > need to fix. Have already attached a patch.
> > > >
> > > > Plan to backport to 0.9.1.
> > > >
> > > > 4. There is no maven staging link.
> > > >
> > > > Is there any document regarding to how to publish the maven staging
> > link?
> > > >
> > > > On Mon, Jun 22, 2015 at 3:02 PM, Naveen Somasundaram <
> > > > nsomasunda...@linkedin.com.invalid> wrote:
> > > >
> > > > > Hey Yan,
> > > > >SAMZA-721 might be because you checkout master and
> > > > switched
> > > > > to 0.9.1 branch, and you still have some files from master which
> git
> > is
> > > > not
> > > > > tracking. Can you try a git clean before you build 0.9.1 ?  AFAIK
> you
> > > > don't
> > > > > need joint compilation for core in 0.9.1.
> > > > >
> > > > > On Mon, Jun 22, 2015 at 1:25 PM, Roger Hoover <
> > roger.hoo...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Yan,
> > > > > >
> > > > > > I tested to patch locally and it looks good.  Creating a patched
> > > > release
> > > > > > for myself to test in our environment.  Thanks, again.
> > > > > >
> > > > > > Sent from my iPhone
> > > > > >
> > > > > > > On Jun 22, 2015, at 10:59 AM, Yi Pan 
> > wrote:
> > > > > > >
> > > > > > > Hi, Yan,
> > > > > > >
> > > > > > > Thanks a lot for the quick fix on the mentioned bugs. It seems
> > the
> > > > fix
> > > > > > for
> > > > > > > SAMZA-720 is pretty localized and I am OK to push it into
> 0.9.1.
> > I
> > > > will
> > > > > > be
> > > > > > > working on back porting those changes to 0.9.1 later today and
> > fix
> > > > all
> > > > > > the
> > > > > > > release related issues.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > -Yi
> > > > > > >
> > > > > > > On Mon, Jun 22, 2015 at 10:30 AM, Roger Hoover <
> > > > roger.hoo...@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Yan,
> > > > > > >>
> > > > > > >> You rock.  Thank you so much for the quick fix.  I'm working
> on
> > > > > building
> > > > > > >> and testing the patch.
> > > > > > >>
> > > > > > >> Cheers,
> > > > > > >>
> > > > > > >> Roger
> > > > > > >>
> > > > > > >>> On Mon, Jun 22, 2015 at 1:09 AM, Yan Fang <
> > yanfang...@gmail.com>
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>> Hi guys,
> > > > > > >>>
> > > > > > >>> 1. I have the difficulty in building the 0.9.1 branch. I
> think
> > > this
> > > > > is
> > > > > > >>> mainly related to SAMZA-721
> > > > > > >>> .
> > > > > > >>>
> > > > > > >>> 2. Also, https://issues.apache.org/jira/browse/SAMZA-712
> seems
> > > > > > bothering
> > > > > > >>> people as well.
> > > > > > >>>
> > > > > > >>> 3. https://issues.apache.org/jira/browse/SAMZA-720 is a
> > critical
> > > > bug
> > > > > > we
> > > > > > >>> need to fix. Have already attached a patch.
> > > > > > >>>
> > > > > > >>> 4. There is no maven staging link.
> > > > > > >>>
> > > > > > >>> Thanks,
> > > > > > >>>
> > > > > > >>> Fang, Yan
> > > > > > >>> yanfang...@gmail.com
> > > > > > >>>
> > > > > > >>

monitoring best practices

2015-06-25 Thread Christopher Chamberlin
I see in the Samza metrics documentation that there are two basic ways to
get metrics from Samza to a metrics repository: 1) write a custom
MetricsReporter to push the metrics directly, perhaps using an
ExecutorService or similar to perform batching, or 2) consume the metrics
Kafka queue and push them from there, letting the built-in
MetricsSnapshotReporter do the batching.

Can anyone running Samza in production provide any insight into which of
these to prefer?

I'm looking to get my metrics out to Prometheus, probably via a Pushgateway
endpoint.

I see the pending SAMZA-340 patch to add Graphite support using approach
#1. I like going directly from the monitored container to the monitoring
system (fewer moving parts than going via Kafka in method #2), but I'd
rather not re-implement the batching and other logic in the existing
SnapshotMetricsReporter.

Thanks.


Re: Hopping and tumbling windows in streaming SQL

2015-06-25 Thread Julian Hyde
Glad you like it. I've filled out the spec with some more examples, see below.

Here are the proposed functions:

* HOP(t, emit, retain)
* HOP(t, emit, retain, align)
* TUMBLE(t, emit)

TUMBLE(t, e) is equivalent to HOP(t, e, e).

HOP(t, e, r) is equivalent to HOP(t, e, r, TIME '00:00:00').

Q1. One hour tumbling window:

SELECT STREAM START(rowtime),
  COUNT(*)
FROM Orders
GROUP BY TUMBLE(rowtime, INTERVAL '1' HOUR)

Emits a row at 01:00 containing rows in [00:00, 01:00);
emits a row at 02:00 containing rows in [01:00, 02:00), etc.

Q2. Same as Q1, expressed using HOP:

SELECT STREAM START(rowtime),
  COUNT(*)
FROM Orders
GROUP BY HOP(rowtime, INTERVAL '1' HOUR, INTERVAL '1' HOUR)

Q3. Hopping window

SELECT STREAM START(rowtime),
  COUNT(*)
FROM Orders
GROUP BY HOP(rowtime,
INTERVAL '30' MINUTE,
INTERVAL '1:45' HOUR TO MINUTE)

Emits a row at 01:00 containing rows in [23:15, 01:00);
emits a row at 01:30 containing rows in [23:45, 01:30), etc.

Q4. Aligned tumbling window

SELECT STREAM START(rowtime),
  COUNT(*)
FROM Orders
GROUP BY HOP(rowtime,
INTERVAL '1:30' HOUR TO MINUTE,
INTERVAL '2' HOUR, TIME '0:30')

Emits a row at 00:30 containing rows in [22:30, 00:30);
emits a row at 02:00 containing rows in [00:00, 02:00), etc.

Q5. Aligned tumbling window

TUMBLE does not have an align argument, so you need to use HOP.

SELECT STREAM START(rowtime),
  COUNT(*)
FROM Orders
GROUP BY HOP(rowtime,
INTERVAL '1' HOUR,
INTERVAL '1' HOUR,
TIME '0:30')

Emits a row at 00:30 containing rows in [23:30, 00:30);
emits a row at 01:30 containing rows in [00:30, 01:30), etc.

Q6. Decaying average

SELECT STREAM END(rowtime),
  productId,
  SUM(unitPrice * EXP((rowtime - START(rowtime)) SECOND / INTERVAL '1' HOUR))
   / SUM(EXP((rowtime - START(rowtime)) SECOND / INTERVAL '1' HOUR))
FROM Orders
GROUP BY HOP(rowtime,
INTERVAL '1' SECOND,
INTERVAL '1' HOUR),
  productId

Emits a row at 00:00:00 containing rows in [23:00:00, 00:00:00);
emits a row at 00:00:01 containing rows in [23:00:01, 00:00:01).

The expression weighs recent orders more heavily than older orders.
Extending the window from 1 hour to 2 hours or 1 year would have
virtually no effect on the accuracy of the result (but use more memory
and compute).

Note that we use START inside an aggregate function (SUM) because it
is a value that is constant for all rows within a sub-total. This
would not be allowed for typical aggregate functions (SUM, COUNT
etc.). START and END behave more like the GROUPING() function in this
regard.

Q7. Non-streaming query

HOP and TUMBLE were devised for a use case that occurs in streaming
SQL, but they can be used in non-streaming queries. For example,

SELECT START(rowtime),
  COUNT(*)
FROM Orders
WHERE rowtime BETWEEN '2015-01-01 00:00:00'
AND '2015-01-18 00:00:00'
GROUP BY TUMBLE(rowtime,
INTERVAL '2' HOUR)

This is the same as Q1, but omits the STREAM keyword, so it means
query the table containing historical orders.

Q8. Grouping sets

It should be possible to mix HOP and TUMBLE in with GROUPING SETS but
I haven't devised an example.

While we're on the subject of GROUPING SETS, I should state for the record:
* GROUPING SETS is valid for a streaming query provided that every
grouping set contains a monotonic expression.
* CUBE and ROLLUP are not valid for streaming query, because they will
produce at least one grouping set that aggregates everything (like
"GROUP BY ()").

Maybe we should allow CUBE and ROLLUP with an understanding that some
levels of aggregation will never complete (because they have no
monotonic expressions) and thus will never be emitted.

Julian

On Thu, Jun 25, 2015 at 7:32 AM, Milinda Pathirage
 wrote:
> Hi Julian,
>
> This is a great improvement over the previous hopping window. Thanks for
> thinking through this. I also like if we can introduce a TUMBLE function
> with more control over how we define tumbling window size. With the current
> FLOOR based model we have to perform date/time arithmetic to have tumbling
> windows such as 5 minutes tumbling windows (May be there is a better way
> that I don't know). But TUMBLE function that can specify the parameters
> such as window size would be nice. I am +1 for other extensions as well.
>
> Thanks
> Milinda
>
> On Wed, Jun 24, 2015 at 6:32 PM, Julian Hyde  wrote:
>
>> Hi all,
>>
>> Forgive the cross-post. This is for Calcite devs interested in
>> streaming and Samza devs interested in SQL.
>>
>> I've been thinking some more about how to implement hopping and
>> tumbling windows in streaming SQL. I was previously at a loss to find
>> a concise syntax that is consistent with how SQL semantics, but I have
>> found a syntax that I think can please everyone.
>>
>> Recall that a hopping window emits a sub-total every X seconds of
>> records that have arrived over the last Y seconds. A tumbling window
>> is a hopping window where X and Y are equal.
>>
>> In https://calcite.incubator.apache.org/docs/stream.html#hopping-windo

Re: Review Request 35445: SAMZA-693: Very basic HDFS Producer service for Samza

2015-06-25 Thread Yan Fang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35445/#review89390
---



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemProducer.scala 
(line 47)


make this configurable as well?



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemProducer.scala 
(line 58)


aggree to get it pluggable.



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemProducer.scala 
(line 62)


I would reommend all the log msgs follow the same format with other Samza 
code by removing the "s".



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemProducer.scala 
(line 76)


We can have the debug information there.



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemProducer.scala 
(line 81)


same as above



samza-hdfs/src/test/resources/samza-hdfs-test-job.properties (line 1)


Is this used anywhere? We can put it in the testing class since it only has 
one property.


- Yan Fang


On June 14, 2015, 10:17 p.m., Eli Reisman wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35445/
> ---
> 
> (Updated June 14, 2015, 10:17 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> SAMZA-693: Very basic HDFS Producer service for Samza
> 
> 
> Diffs
> -
> 
>   build.gradle a5f54106a822dc91ff82270df27217a8765a0d80 
>   samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
> PRE-CREATION 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
> PRE-CREATION 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala
>  PRE-CREATION 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemProducer.scala
>  PRE-CREATION 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemProducerMetrics.scala
>  PRE-CREATION 
>   
> samza-hdfs/src/test/org/apache/samza/system/hdfs/TestHdfsSystemProducer.scala 
> PRE-CREATION 
>   samza-hdfs/src/test/resources/samza-hdfs-test-job.properties PRE-CREATION 
>   settings.gradle bb07a3b84b14dcef94da1bb166eab6aa3d0026bb 
> 
> Diff: https://reviews.apache.org/r/35445/diff/
> 
> 
> Testing
> ---
> 
> New unit test, but it's fairly rudimentary. Passes "./gradlew test" and 
> "./gradlew check"
> 
> This only supplies an HDFS Producer, and this producer only writes 
> SequenceFiles of ByteWriteables so far. If the patch were accepted as-is, I'd 
> suggest future tickets for a matching HDFS Consumer, and a pluggable set of 
> output formats, configurable via HdfsConfig settings.
> 
> On the upside, this patch has been tested on a real cluster with real data, 
> using several serdes, with good results.
> 
> 
> Thanks,
> 
> Eli Reisman
> 
>



Re: Installing Samza w/o internet connection

2015-06-25 Thread Yi Pan
Hi, Amos,

I assume that you are referring to "preparing the build environment for
Samza source code". As Milinda said, to set up the build environment, you
will need a) an Internet connection to download required packages from
Maven; b) a cached collection of required package on your local machine.
The downloading is only required once, and you will not need the Internet
to build Samza as long as you keep the local cached copy of the required
dependencies.

Thanks!

-Yi

On Thu, Jun 25, 2015 at 6:45 AM, Milinda Pathirage 
wrote:

> Hi Amos,
>
> Can you please let us know what do you mean by preparing Samza? Are you
> trying to build a Samza job package? Or Samza from sources?
>
> I don't think its possible to build from source (both Samza and a Samza job
> package) without internet in a new machine. One thing you can try is to
> build in a machine with internet and copy that machine's maven repository
> to the other machine or copy the relevant packages to the other machine.
>
> Thanks
> Milinda
>
> On Thu, Jun 25, 2015 at 6:45 AM, Amos Mosseri  wrote:
>
> > Hi,
> >
> > I'm trying to evaluate Samza, and  I can't figure out how to install it
> on
> > a computer w/o internet connection.
> >
> > I've already downloaded the samza source, but when I run the gradle build
> > command, I get errors about resolving:
> >
> > * De.obqo.gradle:gradle-lesscss-plugin:1.0-1.3.3
> >
> > * Org.gradle.api.plugins:gradle-nexus-plugin:0.7.1
> >
> > Isn't there a simpler way to prepare samza?
> > Maybe a bunch of relevant jars?
> >
> > Thanks in advance,
> > Amos.
> >
>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org
>


Re: [SAMZA-690] Changelog topic creation should not be in the container code

2015-06-25 Thread Yi Pan
Hi, Robert,

Thanks for digging into this. I am embedding my answers below:


On Thu, Jun 25, 2015 at 7:40 AM, Robert Zuljevic 
wrote:

>  1.   Is checkpoint topic referred to in the description coordinator
> stream/topic?
>

In the master branch, checkpoint topic is deprecated (except for migration
purpose). The checkpoints will be sent to coordinator stream. Hence, there
is no need to change the creation of checkpoint topic any more.


>  2.   Changelog topic creation is handled by TaskStorageManager class
> which is required by SamzaContainer. Would it be preferable to:
>
> a.   Create TaskStoreManager instance(s) in JobRunner and pass them
> to SamzaContainer
>
> b.  Create changelog stream in JobRunner/JobCoordinator and skip it
> in TaskStoreManager
>
I prefer option b in your above proposal, w/ a slight modification:
1. JobCoordinator now has a changelogManager which reads the changelog
partition to task mapping from the coordinator stream
2. JobCoordinator also has access to the job config and will be able to
figure out what are the changelog topics needed in the job
3. JobCoordinator start should have an additional step to create all the
changelog topics needed in the job. If exists, validate the partition
numbers
4. In TaskStoreManager, we should just get the changelog topic metadata and
validate the partitions are correct.

Does that sound reasonable?

-Yi


>
>
>
>
> Met vriendelijke groet / Kind regards,
>
> Robert Žuljević
>
> Software Developer
>
> [image: Title: Levi9 IT Services]
>  --
>
> Address: Trifkovicev trg 6, 21000 Novi Sad, Serbia
>
> Tel.: +31 20 6701 947 | +381 21 2155 500
>
> Mobile: +381 64 428 28 46
>
> Skype: robert.zuljevic
>
> Internet: www.levi9.com
>
>
>
> Chamber of commerce Levi9 Holding: 34221951
>
> Chamber of commerce Levi9 IT Services BV: 34224746
>  --
>
> This e-mail may contain confidential or privileged information. If you are
> not (one of) the intended recipient(s), please notify the sender
> immediately by reply e-mail and delete this message and any attachments
> permanently without retaining a copy. Any review, disclosure, copying,
> distribution or taking any action in reliance on the contents of this
> e-mail by persons or entities other than the intended recipient(s) is
> strictly prohibited and may be unlawful.
>
> The services of Levi9 are exclusively subject to its general terms and
> conditions. These general terms and conditions can be found on
> www.levi9.com and a copy will be promptly submitted to you on your
> request and free of charge.
>
>
>


[SAMZA-690] Changelog topic creation should not be in the container code

2015-06-25 Thread Robert Zuljevic
Hello all,

Description of SAMZA-690 (https://issues.apache.org/jira/browse/SAMZA-690) 
ticket says that creation of changelog and checkpoint topics should be moved 
from SamzaContainer class to JobRunner/JobCoordinator class. I investigated for 
a while, but seem to have hit a wall and require some help. I have a few 
questions regarding this task:


1.   Is checkpoint topic referred to in the description coordinator 
stream/topic?

2.   Changelog topic creation is handled by TaskStorageManager class which 
is required by SamzaContainer. Would it be preferable to:

a.   Create TaskStoreManager instance(s) in JobRunner and pass them to 
SamzaContainer

b.  Create changelog stream in JobRunner/JobCoordinator and skip it in 
TaskStoreManager



Met vriendelijke groet / Kind regards,
Robert Žuljević
Software Developer
[Title: Levi9 IT Services]

Address: Trifkovicev trg 6, 21000 Novi Sad, Serbia
Tel.: +31 20 6701 947 | +381 21 2155 500
Mobile: +381 64 428 28 46
Skype: robert.zuljevic
Internet: www.levi9.com

Chamber of commerce Levi9 Holding: 34221951
Chamber of commerce Levi9 IT Services BV: 34224746

This e-mail may contain confidential or privileged information. If you are not 
(one of) the intended recipient(s), please notify the sender immediately by 
reply e-mail and delete this message and any attachments permanently without 
retaining a copy. Any review, disclosure, copying, distribution or taking any 
action in reliance on the contents of this e-mail by persons or entities other 
than the intended recipient(s) is strictly prohibited and may be unlawful.
The services of Levi9 are exclusively subject to its general terms and 
conditions. These general terms and conditions can be found on 
www.levi9.com and a copy will be promptly submitted to 
you on your request and free of charge.



Re: Hopping and tumbling windows in streaming SQL

2015-06-25 Thread Milinda Pathirage
Hi Julian,

This is a great improvement over the previous hopping window. Thanks for
thinking through this. I also like if we can introduce a TUMBLE function
with more control over how we define tumbling window size. With the current
FLOOR based model we have to perform date/time arithmetic to have tumbling
windows such as 5 minutes tumbling windows (May be there is a better way
that I don't know). But TUMBLE function that can specify the parameters
such as window size would be nice. I am +1 for other extensions as well.

Thanks
Milinda

On Wed, Jun 24, 2015 at 6:32 PM, Julian Hyde  wrote:

> Hi all,
>
> Forgive the cross-post. This is for Calcite devs interested in
> streaming and Samza devs interested in SQL.
>
> I've been thinking some more about how to implement hopping and
> tumbling windows in streaming SQL. I was previously at a loss to find
> a concise syntax that is consistent with how SQL semantics, but I have
> found a syntax that I think can please everyone.
>
> Recall that a hopping window emits a sub-total every X seconds of
> records that have arrived over the last Y seconds. A tumbling window
> is a hopping window where X and Y are equal.
>
> In https://calcite.incubator.apache.org/docs/stream.html#hopping-windows
> I give an example, "emit, every hour, the number of each product
> ordered over the past three hours".
>
> That example gives a query in terms of a GROUP BY (in the HourlyTotals
> view) followed by a moving sum. I didn't think that it was possible to
> express using just one GROUP BY, because that would violate one of the
> principles of SQL: that each record entering a GROUP BY contributes to
> precisely one output record.
>
> But I've just realized that the CUBE, ROLLUP and GROUPING SETS
> operators (already in SQL) violate that principle. And if they can do
> it, we can do the same. So we could add another grouping function,
> HOP(t, emit, retain).
>
> The query would look like this:
>
> SELECT STREAM START(rowtime) AS rowtime,
>   productId,
>   SUM(units) AS sumUnits,
>   COUNT(*) AS c
> FROM Orders
> GROUP BY HOP(rowtime, INTERVAL '1' HOUR, INTERVAL '3' HOUR),
>   productId
>
> Much nicer than the one in stream.html!
>
> The "trick" is that the HOP function is returning a list of rowtime
> values. For example, for row 1 {rowtime: '09:33', ...} it will return
> ['09:00', '10:00', '11:00']; for row 2 {rowtime: '10:05', ...} it will
> return ['10:00', '11:00', '12:00']. The system adds each row to
> several sub-totals, and emits each sub-total when it is complete. The
> sub-total for '09:00' will contain only row 1, and will be emitted at
> '10:00'; the sub-total for '10:00' will contain row 1 and row 2, and
> will be emitted at '11:00', and so forth.
>
> Returning multiple values is related to the flatMap function in Spark
> (and earlier selectMany in LINQ) and makes HOP's semantics similar to
> GROUPING SETS and therefore sound.
>
> START is a new aggregate function that returns the lower bound of the
> current sub-total; END similarly.
>
> Note that the "retain" argument does not need to be a whole multiple
> of the "emit" argument. This was a major limitation in the previous
> proposal.
>
> There are some straightforward extensions:
> * Define a TUMBLE function;
> * Add an "align" argument to HOP, to allow windows to start at, say, 5
> minutes past each hour;
> * Apply HOP to windows based on row-counts;
> * Allow user-defined windowing functions that similarly return a list
> of interval start-end points.
>
> Julian
>



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org


Re: Installing Samza w/o internet connection

2015-06-25 Thread Milinda Pathirage
Hi Amos,

Can you please let us know what do you mean by preparing Samza? Are you
trying to build a Samza job package? Or Samza from sources?

I don't think its possible to build from source (both Samza and a Samza job
package) without internet in a new machine. One thing you can try is to
build in a machine with internet and copy that machine's maven repository
to the other machine or copy the relevant packages to the other machine.

Thanks
Milinda

On Thu, Jun 25, 2015 at 6:45 AM, Amos Mosseri  wrote:

> Hi,
>
> I'm trying to evaluate Samza, and  I can't figure out how to install it on
> a computer w/o internet connection.
>
> I've already downloaded the samza source, but when I run the gradle build
> command, I get errors about resolving:
>
> * De.obqo.gradle:gradle-lesscss-plugin:1.0-1.3.3
>
> * Org.gradle.api.plugins:gradle-nexus-plugin:0.7.1
>
> Isn't there a simpler way to prepare samza?
> Maybe a bunch of relevant jars?
>
> Thanks in advance,
> Amos.
>



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org


Installing Samza w/o internet connection

2015-06-25 Thread Amos Mosseri
Hi,

I'm trying to evaluate Samza, and  I can't figure out how to install it on a 
computer w/o internet connection.

I've already downloaded the samza source, but when I run the gradle build 
command, I get errors about resolving:

* De.obqo.gradle:gradle-lesscss-plugin:1.0-1.3.3

* Org.gradle.api.plugins:gradle-nexus-plugin:0.7.1

Isn't there a simpler way to prepare samza?
Maybe a bunch of relevant jars?

Thanks in advance,
Amos.