Re: Onboarding new Database Connectors to storm-external

2023-06-25 Thread Stephen Powis via user
I'm happy to take up the redis spout.  I think I have an outstanding PR
(might be 3 or 4 years old at this point tho...) to improve the existing
one, and also have this implementation I'd be willing to give to the
project after bringing it up to date:
https://github.com/SourceLabOrg/RedisStreams-StormSpout

Let me know what you think,
Stephen

On Sun, Jun 25, 2023 at 8:39 PM 6harat 
wrote:

> Hey,
>
> I have been accumulating a list of popular/trending db for which storm
> connectors are still missing and preparing a case for getting them
> prioritized and added to the repo based on user interest.
>
> For now, we have the below list of potential candidates (based solely on
> my own experience in the industry)
>
> A. Listed in Top 100 at https://db-engines.com/en/ranking
>
>1. ScyllaDB
>2. Aeropsike
>3. CockroachDB
>4. Ignite
>5. Hazelcast
>
> B. Popular Graph DBs:
>
>1. Neo4j
>2. ArangoDB
>3. OrientDB
>
> C. Trending Vector DBs:
>
>1. Milvus
>2. Weaviate
>
> D. Already present Improvement JIRAs pertaining to existing connectors:
>
>1. Mongo Spout: https://issues.apache.org/jira/browse/STORM-3336
>2. Redis Sentinel: https://issues.apache.org/jira/browse/STORM-3410
>3. Cassandra Spout: https://issues.apache.org/jira/browse/STORM-3298
>
>
> Would love to hear thoughts from the community on the same. Feel free to
> add other missing databases to the above list which you feel are also worth
> considering and/or volunteer to contribute to any of the aforementioned
> connectors.
>
> --
> Regards
> 6harat
>


Re: [DISCUSSION] Apache Storm and moving to the Attic

2023-02-01 Thread Stephen Powis via user
Ah Yea, also not opposed to the move to the attic if it's determined to be
most appropriate, but I had a similar experience as Richard where I
submitted some PRs, asked for comments/thoughts/review, and even offered to
take over ownership of one of the sub modules and got no responses.
Perhaps I just submitted under the wrong forum to get an answer, but
definitely a bummer, and would love to see the project/community get
revived.

- Stephen

On Wed, Feb 1, 2023 at 10:09 PM Richard Zowalla  wrote:

> Hi Aaron,
>
> I am CC the users@ list as they weren't contained in the initial
> proposal. Perhaps, there are people or institutions in the wild, who
> want to volunteer or give it a try (still at the ASF level).
>
> Of course it shouldn't prevent a VOTE for moving to the attic, but an
> additional try to get some more attention (if needed).
>
> Gruß
> Richard
>
>
> Am Dienstag, dem 31.01.2023 um 08:00 -0600 schrieb Aaron Niskode-
> Dossett:
> > Thank you to those who responded. Given that the low number of
> > responses I
> > am going to start a vote on moving to the attic tomorrow.
> >
> > One other note: A vote to move to the attic would not mean the end of
> > Storm, the project can be forked by anyone who wants to continue the
> > work
> > (just as it can be forked today), it would just be outside of the
> > ASF.
> >
> > On Thu, Jan 26, 2023 at 3:12 AM Richard Zowalla 
> > wrote:
> >
> > > Hi all,
> > >
> > > I also experienced, that dev@ activity is quite low. I did provide
> > > some
> > > PRs and asked questions a few months ago and didn't get any
> > > feedback,
> > > although the PRs got merged in the end. I think, that we need some
> > > sort
> > > of community rebuilding / revival, if we want to maintain Storm in
> > > a
> > > sustainable way (like it is done for TomEE or OpenNLP).
> > >
> > > Nevertheless, as I am also a committer on StormCrawler, which
> > > relies on
> > > Storm (obviously) _and_ is used in our research work, I am also
> > > happy
> > > to jump in as a volunteer, if some ppl are needed.
> > >
> > > Gruß
> > > Richard
> > >
> > >
> > > Am Mittwoch, dem 25.01.2023 um 22:00 + schrieb Bipin Prasad:
> > > > I am volunteering to take on the role of PMC chairFor Storm. Not
> > > > quite sure about the process.
> > > > Thanks—Bipin Prasad
> > > >
> > > >
> > > > Sent from Yahoo Mail for iPhone
> > > >
> > > >
> > > > On Wednesday, January 25, 2023, 4:50 PM, Aaron Niskode-Dossett <
> > > > doss...@gmail.com> wrote:
> > > >
> > > > Hello Storm developer community,
> > > >
> > > > In the past year or so this project has slowed down and the
> > > > Project
> > > > Management Committee [PMC] has almost no active members.  The PMC
> > > > chair
> > > > resigned in 2022 with due notice and noone has since volunteered
> > > > to
> > > > assume
> > > > those duties.  I myself am an inactive PMC member.
> > > >
> > > > This suggests to me that it's time to consider moving this
> > > > project to
> > > > the
> > > > Attic [1].  I would view this as the natural culmination of a
> > > > very
> > > > successful Apache project and not as a mark of failure.  There
> > > > are
> > > > many,
> > > > many successful and influential projects in the attic.
> > > >
> > > > The alternative to moving to the attic would be to reconstitute
> > > > the
> > > > PMC
> > > > with *new members* of the development community willing to take
> > > > on
> > > > the
> > > > responsibilities of the project and reporting to the Apache
> > > > Software
> > > > Foundation board.
> > > >
> > > > So, please respond to this thread if you are interested in that
> > > > role.
> > > > Please also respond if you believe moving to the attic is
> > > > appropriate.  A
> > > > vote on moving to the attic may be held on this list in the
> > > > future.
> > > >
> > > > Thanks you for your time,
> > > > Aaron
> > > >
> > > > [1]
> https://urldefense.com/v3/__https://attic.apache.org/__;!!DCbAVzZNrAf4!FDOUbKYpvz6tgLpyOZoJcYfuR1asy2k3NEI5NZVLzpP4SjhO8CmSTJ1eT37qrLCn1LaATZs3gwa_5Q$
> > > >
> > > >
> > > >
>
>


Re: How to Throttle in storm

2019-11-24 Thread Stephen Powis
JVM based there are lots of examples online, but something like this:
https://www.baeldung.com/httpclient-connection-management
Where you set the maximum number of outbound connections to some limited
number.  If you have more bolts trying to execute requests against a pool
configured with fewer connections, your bolts would block until a
connection became available, essentially throttling?   Note: This breaks
down once you have bolts running on multiple JVMs, as each JVM would have
its own connection pool.

For external proxy service, which would work better when you have multiple
JVMs involved, you could use something like the apache proxy module
<https://stackoverflow.com/questions/14493574/configure-apache-web-server-to-use-a-proxy-server>.
I want to say it's possible to configure apache to restrict the number of
outbound connections to a specific set of hosts.  There may be better
software to solve this, it's just what jumps to mind for me.

On Mon, Nov 25, 2019 at 4:41 PM Gowtham S  wrote:

> *Could you make use of an outbound HTTP connection pool? (FYI: Our bolts
> are span in multiple JVMs).*
>
> Is there any tool available for this? Can you please suggest if any
>
> With regards,
> Gowtham S
>
>
> On Mon, 25 Nov 2019 at 12:47, Stephen Powis  wrote:
>
>> Could you make use of an outbound http connection pool?  Either via an
>> external proxy service your bolts talk through or some JVM based http
>> connection pool (which might get tricky if your bolts span multiple JVMs).
>>
>> On Mon, Nov 25, 2019 at 3:48 PM Gowtham S 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> We have certain bolts that invokes a few web services.  However, those
>>> endpoints have limited throughput. So we wanted to find out if there any
>>> recommendations on how to throttle the calls so that they don't overload
>>> the downstream services.
>>>
>>> Please let me know if there any hooks available in Storm, what are the
>>> patterns that can be used and what are the best practices/pitfalls for
>>> using them.
>>>
>>> Thanks and regards,
>>> Gowtham S, MCA
>>> PH: 9597000673
>>>
>>


Re: How to Throttle in storm

2019-11-24 Thread Stephen Powis
Could you make use of an outbound http connection pool?  Either via an
external proxy service your bolts talk through or some JVM based http
connection pool (which might get tricky if your bolts span multiple JVMs).

On Mon, Nov 25, 2019 at 3:48 PM Gowtham S  wrote:

> Hi,
>
>
> We have certain bolts that invokes a few web services.  However, those
> endpoints have limited throughput. So we wanted to find out if there any
> recommendations on how to throttle the calls so that they don't overload
> the downstream services.
>
> Please let me know if there any hooks available in Storm, what are the
> patterns that can be used and what are the best practices/pitfalls for
> using them.
>
> Thanks and regards,
> Gowtham S, MCA
> PH: 9597000673
>


Re: Serialisation Issues Storm 1.2.1

2019-06-11 Thread Stephen Powis
Seems like you've exceeded your JVM's max heap settings.  Maybe review and
increase the limit?

On Tue, Jun 11, 2019 at 5:02 PM Shaik Asifullah 
wrote:

> Hi,
> I am using Storm Trident 1.2.1 and I am facing serialisation issues while
> using ReducerAggregator in my topology. Below is the error which I am
> getting causing my worker to restart.
>
> java.lang.OutOfMemoryError: Java heap space at
> java.util.Arrays.copyOf(Arrays.java:3236) at
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at
> org.apache.storm.serialization.SerializableSerializer.write(SerializableSerializer.java:38)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) at
> org.apache.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:44)
> at
> org.apache.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:44)
> at
> org.apache.storm.daemon.worker$mk_transfer_fn$transfer_fn__5200.invoke(worker.clj:203)
> at
> org.apache.storm.daemon.executor$start_batch_transfer__GT_worker_handler_BANG_$fn__4882.invoke(executor.clj:314)
> at
> org.apache.storm.disruptor$clojure_handler$reify__4475.onEvent(disruptor.clj:41)
> at
> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:509)
> at
> org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:487)
> at
> org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:74)
> at
> org.apache.storm.disruptor$consume_loop_STAR_$fn__4492.invoke(disruptor.clj:84)
> at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:484) at
> clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:748)
>
> Thanks,
> Shaik Asifullah
>


Re: Does storm support incremental windowing operation?

2017-12-21 Thread Stephen Powis
Is there a reason to use Guava cache to aggregate over just a plain old
Map?  Curious as aggregation is a common use case for us and never thought
to look to Guava for it.

Thanks!

On Fri, Dec 22, 2017 at 1:50 PM, Manish Sharma  wrote:

> We added a guava cache in the bolt's execute method to aggregate tuples and 
> wait for the tick signal.
>
> You can control the tick frequency with TOPOLOGY_TICK_TUPLE_FREQ_SECS in the 
> main topology.
>
>
> This is a Spark use case IMO.
>
>
> cheers, /Manish
>
>
>
>
> On Wed, Dec 20, 2017 at 9:09 PM, Jinhua Luo  wrote:
>
>> Hi All,
>>
>> The window may buffer many tuples before evaluation, does storm
>> support incremental aggregation upon the window, just like flink does?
>>
>
>


Re: Storm 2 features?

2017-12-12 Thread Stephen Powis
Not a core contributor, so maybe there's a better source for this info, but
you can always see what tickets are set against the 2.0.0 release in jira

https://issues.apache.org/jira/browse/STORM-2850?jql=project%20%3D%20STORM%20AND%20fixVersion%20%3D%202.0.0

On Wed, Dec 13, 2017 at 3:13 AM, Hannum, Daniel <
daniel_han...@premierinc.com> wrote:

> Is there somewhere that summarizes what Storm 2 will bring to the table?
> I’ve heard vague things like it will be X % faster, but in terms of new
> capabilities, I don’t know what it has. There are things I wish Storm could
> do, like dynamically scaling bolts under load, but I don’t know if Storm 2
> will include any of that.
>
>
>
> Thanks
>
> Dan
>


Re: JVM options

2017-12-12 Thread Stephen Powis
I believe for cluster you have these two configuration values:

topology.worker.childopts

and

worker.childopts

On Wed, Dec 13, 2017 at 8:47 AM, Ramin Farajollah (BLOOMBERG/ 731 LEX) <
rfarajol...@bloomberg.net> wrote:

> Hi,
>
> how can we pass along JVM command line options when we run Storm?
> For example -Xmx
>
> I'd like to know how to do this for both local and cluster modes please.
>
> Thanks
>


Re: Swing or JavaFX GUI in Storm Bolt

2017-12-11 Thread Stephen Powis
So you want a way to signal into a running topology to alter its behavior?
This may be what you're looking for if so:

https://github.com/ptgoetz/storm-signals



On Mon, Dec 11, 2017 at 5:38 PM, Stepan Urban 
wrote:

> Hi Bobby,
> thanks for replay. I do not want to ask user for a question about each
> tuple. I want to change some tresholds, trun on/off processing of some part
> of the topology. For example when debugging I want to compute much more
> numbers than during normal run. Next I can define some control spout/bolt
> which can send control tuples. Eg. in GUI I need to have a calendar in
> which I can set during which days/hours it shoul process specific tuples.
> This control bolt than can send control tuple to all bolts processing
> tuples. Is this architecture wrong? Can I solve it in different way?
>
> Stepan
>
> 2017-12-06 15:58 GMT+01:00 Bobby Evans :
>
>> Storm is not designed for this type of use case.  Storm is set up for
>> distributed processing on multiple nodes. A GUI running inside a bolt is
>> not something we really ever thought about.  If you need to interact with a
>> user typically you will have some external state store, like a SQL DB or
>> Zookeeper.  The bolt and the GUI can interact with each other over that,
>> but it should not be blocking.  Asking a user a question about each tuple
>> that needs to be processed is not going to really work in storm, not sure
>> if you are doing this or not.  What I have seen work are people that want
>> to trigger things in their topology like fail out of coloA for storing
>> results for the next 5 hours.
>>
>> - Bobby
>>
>>
>> On Wed, Dec 6, 2017 at 6:31 AM Stepan Urban 
>> wrote:
>>
>>> Hello,
>>> is it possible to use java Swing or JavaFX GUI in bolt? I need user
>>> interaction with bolt. In local mode there is no problem but how to solve
>>> it eg. on a single computer in cluster mode?
>>>
>>> Thanks
>>> Stepan
>>>
>>
>


Re: Regarding storm & Kafka Configuration.

2017-11-20 Thread Stephen Powis
1. Parallelism - You can set a maximum of 3, one for each partition in your
topic.  Typically, this will net you the fastest way to get messages out of
Kafka and into your topology, but doing your own testing/benchmarks would
be best to know for sure.
2. How many workers - This probably depends on what kind of work your
topology is doing.  Is it IO bound? Memory Bound? CPU Bound?
3. Max pending - Are you using timeouts/tracking tuples through your
topology?  Typically you want this high enough such that your bolts are not
starved for things to work on, but not so high that tuples are queued up
waiting to be processed and timeout before they can be worked on.  The
biggest trick here is your "total tuples in flight" is equal to (Number Of
Spout Instances * Your Configured Max Spout Pending).   For example, if you
set max pending to 1000, and have 3 spout instances, you can have ~3000
tuples in flight.

On Tue, Nov 21, 2017 at 12:55 PM, Mahabaleshwar <
mahabaleshwa...@trinitymobility.com> wrote:

> Hi,
>
>
>
> I am using 3 Node Kafka Cluster and i have created one topic called
> iot_gateway with 3 partition & 3 replication factor. My doubt is in storm
> Kafka spout configuration:
>
>
>
> 1.   How much parallelism hint should give?
>
> 2.   How much worker should give?
>
> 3.   How much max pending messages should configure?
>
> 4.   How should maintain task & partition relation?
>
>
>
> I need your help friends.
>
>
>
> Thanks,
>
> Mahabaleshwar
>
>
>


Re: Apache Storm - help in investigating cause of failures in ~20% of total events

2017-10-24 Thread Stephen Powis
If the storm ui doesn't show the fails being generated from specific bolts,
but instead only listing them as fails on the spout itself, you're looking
at timed out tuples.  I'd try lowering your max spout pending and/or
increasing your timeout value.  I'm not entirely sure how those play in if
you aren't anchoring tuples however.



On Wed, Oct 25, 2017 at 2:32 AM, Ambud Sharma 
wrote:

> Without anchoring at least once semantics is not honored, i.e. if event is
> lost Kafka spout doesn't replay it.
>
> On Oct 1, 2017 6:12 AM, "Yovav Waichman" 
> wrote:
>
>> Hi,
>>
>>
>>
>> We are using Apache Storm for a couple of years, and everything was fine
>> till now.
>>
>> For our spout we are using “storm-kafka-0.9.4.jar”.
>>
>>
>>
>> Lately, we started seeing that our “Failed” number of events has
>> increased dramatically, and currently almost 20% of our total events are
>> marked as Failed.
>>
>>
>>
>> We tried investigating our Topology logs, but we came up empty handed.
>> Also checking our DB logs didn’t give us any clue as for heavy load on our
>> system.
>>
>> Moreover, our spout complete latency is 25.996 ms, which overruled any
>> timeouts that might occur.
>>
>>
>>
>> Lowering our max pending value has produced a negative result.
>>
>> At some point, since we are not using anchoring, we thought about adding
>> anchoring, but we saw that the KafkaSpout handles failures by replaying
>> them, so we were not sure whether to add it or not.
>>
>>
>>
>> It would be helpful if you can direct us as to where we can find in Storm
>> logs the reason for these failures, if there’s an exception which is not
>> caught, maybe a time out, since we are a bit blind at the moment.
>>
>>
>>
>> We would appreciate any help with that.
>>
>>
>>
>> Thanks in advance,
>>
>> Yovav
>>
>


Re: Seek in KafkaSpout

2017-09-28 Thread Stephen Powis
Interesting.   Our original use case was dealing with consuming from a
multi-tenant topic and allow dynamically by passing specific tenants for a
period of time, and then being able to replay just those tenants data at a
later point in time without having to redeploy topologies or manually
making changes to the consumer state/spout instances... but what we've
built is fairly flexible and we're guessing it could support lots of use
cases we haven't even considered yet.

Here's a snippet from our github README that we're currently working
towards getting public hopefully in the next couple of weeks.

> Example use case: Multi-tenant processing
>
> When consuming a multi-tenant commit log you may want to postpone
> processing for one or more tenants. Imagine that a subset of your tenants
> database infrastructure requires downtime for maintenance. Using the
> Kafka-Spout implementation you really only have two options to deal with
> this situation:
>
>1.
>
>You could stop your entire topology for all tenants while the
>maintenance is performed for the small subset of tenants.
>2.
>
>You could filter these tenants out from being processed by your
>topology, then after the maintenance is complete start a separate
>topology/Kafka-Spout instance that somehow knows where to start and stop
>consuming, and by way of filter bolts on the consuming topology re-process
>only the events for the tenants that were previously filtered.
>
> Unfortunately both of these solutions are complicated, error prone and
> down right painful. The alternative is to represent a use case like this
> with a collection of spouts behind a single spout, or what we call a
> VirtualSpout instance behind a DynamicSpout that handled the management
> of starting and stopping those VirtualSpout instances.
> How
> does it work?
>
> The DynamicSpout is really a container of many VirtualSpout instances,
> which each handle processing messages from their defined Consumer and
> pass them into Apache Storm as a single stream.
>
> This spout implementation exposes two interfaces for controlling *WHEN*
> and *WHAT* messages from Kafka get skipped and marked for processing at a
> later point in time.
>
> The *Trigger Interface* allows you to hook into the spout so that you
> start and stop *WHEN* messages are delayed from processing, and *WHEN*
> the spout will resume processing messages that it previously delayed.
>
> The *Filter Interface* allows you to define *WHAT* messages the spout
> will mark for delayed processing.
>
> The spout implementation handles the rest for you! It tracks your filter
> criteria as well as offsets within Kafka topics to know where it started
> and stopped filtering. It then uses this metadata to replay only those
> messages which got filtered
>


On Fri, Sep 29, 2017 at 9:01 AM, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
mrathb...@bloomberg.net> wrote:

> Basically we are using Zookeeper to coordinate between a producer and
> consumer. When the consumer comes up, it needs a recap from the producer.
> The producer sends this recap to the consumer through Kafka in chunks.
> Ideally we wanted the consumer to be able to jump back to the start of the
> last recap in the queue if the producer is down and the last recap was
> recent. I think we have come up with some other ways around this that don't
> rely on "seek" functionality, but was just wondering if anyone else had
> done something similar already. It seems that the new implementation you
> mentioned would provide this functionality.
>
> From: user@storm.apache.org
> Subject: Re: Seek in KafkaSpout
>
> I'm curious to your use case around this?  It seems odd to need to adjust
> it on the fly while a topology is running, or I've misunderstood you!
>
> If you store your consumer state in Zookeeper, you CAN adjust it between
> topology deploys by manually modifying the stored state, and I've done this
> to deal w/ maintenance or service issues to roll back to a specific point
> in time.  Unsure if you're able to do this when consumer state is stored
> within Kafka itself.
>
> As a side note, I've been toying with a Kafka spout implementation that
> allows dynamically consuming arbitrary ranges from topics that is to be
> open sourced here soon.
>
> Stephen
>
> On Fri, Sep 29, 2017 at 8:06 AM, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
> mrathb...@bloomberg.net> wrote:
>
>> Looking through the documentation, it seems that KafkaSpout does not
>> expose any way to set the offset the spout reads from after the initial
>> poll. This functionality is supported in KafkaConsumer through the seek()
>> method. Am I correct that this isn't supported? Has anyone found a way to
>> mimic the behavior of seek() with KafkaSpout?
>>
>
>
>


Re: Seek in KafkaSpout

2017-09-28 Thread Stephen Powis
I'm curious to your use case around this?  It seems odd to need to adjust
it on the fly while a topology is running, or I've misunderstood you!

If you store your consumer state in Zookeeper, you CAN adjust it between
topology deploys by manually modifying the stored state, and I've done this
to deal w/ maintenance or service issues to roll back to a specific point
in time.  Unsure if you're able to do this when consumer state is stored
within Kafka itself.

As a side note, I've been toying with a Kafka spout implementation that
allows dynamically consuming arbitrary ranges from topics that is to be
open sourced here soon.

Stephen

On Fri, Sep 29, 2017 at 8:06 AM, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
mrathb...@bloomberg.net> wrote:

> Looking through the documentation, it seems that KafkaSpout does not
> expose any way to set the offset the spout reads from after the initial
> poll. This functionality is supported in KafkaConsumer through the seek()
> method. Am I correct that this isn't supported? Has anyone found a way to
> mimic the behavior of seek() with KafkaSpout?
>


Difficulties getting log viewer to work in 1.1.1

2017-09-25 Thread Stephen Powis
Hey!  I've been struggling getting logviewer to work in storm 1.1.1 and was
wondering if anyone could lend a hand w/ the required configuration and
help me understand where I've gone wrong.

First of all I have the logviewer daemon running on every supervisor host.

Second I've modified the worker.xml A1 appender as follows:


> 
> fileName="/static/path/log/storm/topologies/${sys:logfile.prefix}-${sys:worker.port}.log"
>
> filePattern="/static/path/log/storm/topologies/archived/${sys:logfile.prefix}-${sys:worker.port}.%i.gz">
> 
> ${pattern}
> 
> 
>  
> 
> 
> .
>

Deployed with each topology I have:
topology.worker.childopts  "-Dlogfile.prefix=topology-name-here"

It seems as tho logviewer / storm UI is unable to locate my log files.  I'm
guessing this because of how I have the log appender setup?  Is it possible
to have log viewer locate my log files w/o reverting to the standard
logging path?  Am I missing some configuration option somewhere?

Using the standard log layout of:
"${sys:workers.artifacts}/${sys:storm.id}/${sys:worker.port}/${sys:
logfile.name}" is difficult for us because of how we aggregate our logs and
ensuring that old logs from previous deploys are cleaned up appropriately.

Thanks!
Stephen


Re: Bolt execution at fixed rate

2017-08-16 Thread Stephen Powis
TickTuples can probably do what you're looking for:
http://kitmenke.com/blog/2014/08/04/tick-tuples-within-storm/

On Wed, Aug 16, 2017 at 1:29 PM, Ebrahim Khalil Abbasi <
ebrahim.khalil.abb...@gmail.com> wrote:

> Dear all,
>
> Is there a built-in configuration in Storm to schedule a bolt to execute
> at fixed rate?  I am using java.util's TimerTask combined with bolt, but i
> thought there may be a better solution.
>
> Thanks for any help
>
> Ebrahim
>


Re: Storm development : Using dependency injection

2017-02-03 Thread Stephen Powis
Also interested to know if anyone has come up with a clean pattern for this.

On Thu, Feb 2, 2017 at 11:37 PM, Bastien DINE 
wrote:

> Hi everyone,
>
>
>
> I’m trying to develop my new topologies using a proper design pattern,  to
> achieve :
>
> -  Reusability of class
>
> -  Unit testing / at least functional
>
> o   Be able to mock database interaction through interfaces
>
>
>
>
>
> I worked a lot with PHP & Symfony which is a great framework to achieve
> those goals using dependency injection pattern
>
>
>
> I want to apply it to Storm topology development, but here is my problem :
>
>
>
> How can I pass dependency in constructor (e.g Cassandra provider, or id
> resolver, or even object hydrator), the bolt are instantiated when calling
> “prepare” method
>
> If I’m using a DI framework (like google Guice), how can I Mixed it with
> storm topology builder ?
>
>
>
> One idea :
>
> I think I can pass factories to my constructor and instantiate object in
> prepare method of my bolt
>
> But I’m not sure if it a good way to do it..
>
>
>
> Did  anybody ever experience it ?
>
> Does anyone have some best practices to develop topologies ? (regarding
> code engineering and organization)
>
>
>
> Thanks in advance,
>
> Regards
>
> Bastien
>


Re: Zero Worker Bug in Topology on 1.0.2

2016-11-21 Thread Stephen Powis
On deploy do you see any errors/warnings in the Nimbus, Supervisor, or
Worker logs?

On Mon, Nov 21, 2016 at 6:28 PM, Joaquin Menchaca 
wrote:

> I am not sure what is causing this, but it seems that after 70 days of
> running the cluster, it will not be able to run a topology with any workers
> or tasks.
>
> I tried a sample topology, which used to work, but now it is broken?
>
> storm jar 
> /usr/lib/apache/storm/1.0.2/examples/storm-starter/storm-starter-topologies-1.0.2.jar
> org.apache.storm.starter.ExclamationTopology ExclamationTopology
>
>
> Topology_nameStatus Num_tasks  Num_workers  Uptime_secs
> ---
> ExclamationTopology   ACTIVE 0  09
>
> Any suggestions?
>
> --
>
> 是故勝兵先勝而後求戰,敗兵先戰而後求勝。
>


Re: How do you measure the stability of storm topology in production environment?

2016-10-27 Thread Stephen Powis
We use a statsd metric reporter into a graphite cluster, and have built out
extensive graphs shown in Grafana.  On top of that we use seyren to do
alerting.  Right now we have alerts on the following:

- Spout lag greater than our defined SLAs
- Null reported spout lag - IE if the topology stops reporting metrics (or
just isn't deployed) for a period of time.
- Failed tuple percentage, if this exceeds a threshold
- Thru-put / number of executes - Our topologies should always be doing
something, they're never completely idle.  If we see thru-put drop below a
threshold we'll be alerted.

Hope this helps!  Curious to what others monitor/alert on.

Stephen

On Thu, Oct 27, 2016 at 2:49 AM, Chen Junfeng  wrote:

> What specifications will you use to measure it ?
>
>
>
>
>
> Regard
>
> Junfeng Chen
>


Re: Storm 1.0.2 errors being reported coming from Nimbus instead of worker hosts

2016-09-10 Thread Stephen Powis
But the topology gets deployed out across multiple nodes in the cluster,
the webui definitely shows various bolts/tasks running on multiple nodes.

On Fri, Sep 9, 2016 at 2:17 PM, Joaquin Menchaca <jmench...@gobalto.com>
wrote:

> It sounds like you are running the topology in local mode, rather than
> remote mode.
>
> On Fri, Sep 9, 2016 at 6:20 AM, Stephen Powis <spo...@salesforce.com>
> wrote:
>
>> Hey!
>>
>> We've recently been piloting storm 1.0.2 in a pre-production environment
>> and have noticed something odd.  Whenever an exception is thrown by a
>> running topology, or an error reported up to the webUI, it always shows as
>> originating from our nimbus host, which doesn't even run supervisor/any
>> topologies on it.  This makes debugging difficult as its hard to track down
>> which host the error actually originated from.
>>
>> Has anyone else seen this behavior?  I tried to look thru jira but didn't
>> see any bugs reported relating this, tho I could have overlooked it.
>>
>> Thanks!
>> Stephen
>>
>
>
>
> --
>
> 是故勝兵先勝而後求戰,敗兵先戰而後求勝。
>


Storm 1.0.2 errors being reported coming from Nimbus instead of worker hosts

2016-09-09 Thread Stephen Powis
Hey!

We've recently been piloting storm 1.0.2 in a pre-production environment
and have noticed something odd.  Whenever an exception is thrown by a
running topology, or an error reported up to the webUI, it always shows as
originating from our nimbus host, which doesn't even run supervisor/any
topologies on it.  This makes debugging difficult as its hard to track down
which host the error actually originated from.

Has anyone else seen this behavior?  I tried to look thru jira but didn't
see any bugs reported relating this, tho I could have overlooked it.

Thanks!
Stephen


Re: Example of worker level metric?

2016-08-05 Thread Stephen Powis
Basically in your bolts prepare() method you'd want to do something like
the following:

// Create new counter metric to count how many XYZ
counterMetric = new CountMetric();
topologyContext.registerMetric("my_count_metric", counterMetric, 60);

where 60 is how often the metric is published, in seconds.


And then later within your bolt when you want to increment the count
you'd just do:

counterMetric.incrBy(1);


Where 1 is however many you want to increment the metric by.

On Fri, Aug 5, 2016 at 10:01 AM, Junguk Cho  wrote:

> Hi,
>
> Is there an example to show how to use worker level metric?
> There is a document (http://storm.apache.org/releases/2.0.0-SNAPSHOT/
> Metrics.html), but I am not clear how to use it.
>
> Thanks,
> Junguk
>


Re: Trident (Opaque Transactional Spout): multiple topologies sourcing from same kafka topic.

2016-07-26 Thread Stephen Powis
Do you have unique consumer Ids for each spout in each topology?

On Tue, Jul 26, 2016 at 8:40 AM, Amber Kulkarni 
wrote:

> So I want multiple storm topologies to read from the same kafka topic.
>
> So suppose topic contains: A B C (data)
>
> I want both the topologies to get A,B,C
>
> But currently what is happening is messaged are getting distributed.(eg.
> topology 1 is getting A and other topology is getting B and C)
>
> How do I achieve this is trident.
>
> I am using OpaqueTridentKafkaSpout.
>
> I think what needs to be done is both topologies need to specify different
> consumer group ids.
>
> But could not find how to specify them in trident(TridentKafkaConfig)
>
> Any ideas ?
>
> Regards,
> Amber Kulkarni
>


Re: Storm Shared Memory

2016-07-13 Thread Stephen Powis
You'd need to use some external data store (such as redis) to maintain
state that exists across multiple JVMs

On Wed, Jul 13, 2016 at 4:22 AM, jimmy tekli  wrote:

> Hello ,
> I developed a java project based on the storm topology that process
> incoming tuples in a certain way.I tested it locally and it worked
> perfectly in the context of streaming tuples of course. My topology is
>  formed of one spout and three bolts.In my code I used Data Structures
> (such as HashMaps,linkedHashmap and TreeMap) to store some information
> concerning the tuple's processing and I declared them static in the
> topology builder class so they could be shared by all the bolts because
> they need to accessed and varied by them frequently. The problem is that
> when i deployed it over a remote cluster these static attributes weren't
> visible (shared ) by the different bolts (I read about it online and I
> concluded because of multiple JVMs). My question is if  there is a certain
> concept concerning "shared memory" in Apache Storm where we can store these
> DataStructures for them to be shared and accessed by all the bolts at any
> given time.
>
> thanks in advance for your help.
>


Re: Usage of G1 Garbage collector

2016-06-03 Thread Stephen Powis
Here's the flags we're using:

-Xms3G -Xmx3G -XX:MaxPermSize=100M  -Xloggc:gc-%ID%.log
-XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime -XX:+DisableExplicitGC
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=50M -XX:+UseCompressedOops
-XX:+AlwaysPreTouch -XX:+UseG1GC
-XX:MaxGCPauseMillis=20 -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=OOMDump-%ID%.log

Here's a link to a GC log analysis (takes several minutes to load)
 
http://gceasy.io/my-gc-report?p=L2hvbWUvcmFtL3VwbG9hZC9pbnN0YW5jZTEvc2hhcmVkLzIwMTYtNi0zL2djbG9nLnppcC0wLTU3LTE3

This is definitely outside the realm of my knowledge, so don't use
this as any kind of benchmark.
I'm curious if anyone can comment on any suggested adjustments?  I
feel like we have a ton of churn
in our young gen, but really don't have the experience tuning java's
GC or any examples to compare it with.

Thanks!


On Fri, Jun 3, 2016 at 3:41 PM, Otis Gospodnetić  wrote:

> Hi,
>
> +1 for G1 for large heaps where you are seeing big GC pauses.  Works well
> for us.
> See:
> https://sematext.com/blog/2013/06/24/g1-cms-java-garbage-collector/
>
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
> On Tue, May 31, 2016 at 5:21 AM, Spico Florin 
> wrote:
>
>> Hello!
>>  I would like the community  the following:
>> 1. Are you using the G1 garbage collector for your workers/supervisors
>>  in production?
>> 2. Have you observed any improvement added by adding this GC style?
>> 3. What are the JVM options that you are using and are a good fit for you?
>>
>> Thank you in advance.
>>  Regards,
>>  Florin
>>
>>
>


Re: Issue with Storm WebUI - Frame size (X) larger than max length (Y)

2016-05-05 Thread Stephen Powis
That did the trick!  Thanks! :D

On Thu, May 5, 2016 at 2:16 PM, Stephen Powis <spo...@salesforce.com> wrote:

> Thanks!
>
> On Thu, May 5, 2016 at 2:06 PM, Aditya Rajan <aditya.ra...@whizdm.com>
> wrote:
>
>> Hey Stephen,
>>
>>  Increase nimbus.thrift.max_buffer_size in your storm.yaml file and
>> restart the cluster. It should work fine
>>
>> Thanks and Regards
>> Aditya Rajan
>> On May 5, 2016 11:08 PM, "Harsha" <st...@harsha.io> wrote:
>>
>>> HI Stephen,
>>>Can you try setting ui.header.buffer.bytes to higher
>>> value in storm.yaml.
>>> -Harsha
>>>
>>>
>>> On Thu, May 5, 2016, at 10:08 AM, Stephen Powis wrote:
>>>
>>> Hey!
>>>
>>> We've started getting this error frequently when trying to view our
>>> topology details via the webUI.  Does anyone have any idea how to solve
>>> this?  It seems like if we re-deploy the topology the error goes away (for
>>> awhile) or if we just wait eventually the error will go away (to reappear
>>> again later).
>>>
>>>
>>> Thanks!
>>>
>>> Stephen
>>>
>>> Internal Server Error
>>>
>>> org.apache.thrift7.transport.TTransportException: Frame size (1124200) 
>>> larger than max length (1048576)!
>>> at 
>>> org.apache.thrift7.transport.TFramedTransport.readFrame(TFramedTransport.java:137)
>>> at 
>>> org.apache.thrift7.transport.TFramedTransport.read(TFramedTransport.java:101)
>>> at org.apache.thrift7.transport.TTransport.readAll(TTransport.java:86)
>>> at 
>>> org.apache.thrift7.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
>>> at 
>>> org.apache.thrift7.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
>>> at 
>>> org.apache.thrift7.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
>>> at org.apache.thrift7.TServiceClient.receiveBase(TServiceClient.java:69)
>>> at 
>>> backtype.storm.generated.Nimbus$Client.recv_getTopologyInfoWithOpts(Nimbus.java:625)
>>> at 
>>> backtype.storm.generated.Nimbus$Client.getTopologyInfoWithOpts(Nimbus.java:611)
>>> at backtype.storm.ui.core$topology_page.invoke(core.clj:710)
>>> at backtype.storm.ui.core$fn__9644.invoke(core.clj:976)
>>> at 
>>> org.apache.storm.shade.compojure.core$make_route$fn__1829.invoke(core.clj:93)
>>> at 
>>> org.apache.storm.shade.compojure.core$if_route$fn__1817.invoke(core.clj:39)
>>> at 
>>> org.apache.storm.shade.compojure.core$if_method$fn__1810.invoke(core.clj:24)
>>> at 
>>> org.apache.storm.shade.compojure.core$routing$fn__1835.invoke(core.clj:106)
>>> at clojure.core$some.invoke(core.clj:2515)
>>> at org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:106)
>>> at clojure.lang.RestFn.applyTo(RestFn.java:139)
>>> at clojure.core$apply.invoke(core.clj:626)
>>> at 
>>> org.apache.storm.shade.compojure.core$routes$fn__1839.invoke(core.clj:111)
>>> at 
>>> org.apache.storm.shade.ring.middleware.json$wrap_json_params$fn__8916.invoke(json.clj:56)
>>> at 
>>> org.apache.storm.shade.ring.middleware.multipart_params$wrap_multipart_params$fn__2889.invoke(multipart_params.clj:103)
>>> at 
>>> org.apache.storm.shade.ring.middleware.reload$wrap_reload$fn__3610.invoke(reload.clj:22)
>>> at backtype.storm.ui.core$catch_errors$fn__9709.invoke(core.clj:1061)
>>> at 
>>> org.apache.storm.shade.ring.middleware.keyword_params$wrap_keyword_params$fn__2822.invoke(keyword_params.clj:27)
>>> at 
>>> org.apache.storm.shade.ring.middleware.nested_params$wrap_nested_params$fn__2861.invoke(nested_params.clj:65)
>>> at 
>>> org.apache.storm.shade.ring.middleware.params$wrap_params$fn__2794.invoke(params.clj:55)
>>> at 
>>> org.apache.storm.shade.ring.middleware.multipart_params$wrap_multipart_params$fn__2889.invoke(multipart_params.clj:103)
>>> at 
>>> org.apache.storm.shade.ring.middleware.flash$wrap_flash$fn__3070.invoke(flash.clj:14)
>>> at 
>>> org.apache.storm.shade.ring.middleware.session$wrap_session$fn__3059.invoke(session.clj:43)
>>> at 
>>> org.apache.storm.shade.ring.middleware.cookies$wrap_cookies$fn__2990.invoke(cookies.clj:160)
>>> at 
>>> org.apache.storm.shade.ring.util.servlet$make_service_method$fn__2707.invoke(servlet

Re: Issue with Storm WebUI - Frame size (X) larger than max length (Y)

2016-05-05 Thread Stephen Powis
Thanks!

On Thu, May 5, 2016 at 2:06 PM, Aditya Rajan <aditya.ra...@whizdm.com>
wrote:

> Hey Stephen,
>
>  Increase nimbus.thrift.max_buffer_size in your storm.yaml file and
> restart the cluster. It should work fine
>
> Thanks and Regards
> Aditya Rajan
> On May 5, 2016 11:08 PM, "Harsha" <st...@harsha.io> wrote:
>
>> HI Stephen,
>>Can you try setting ui.header.buffer.bytes to higher value
>> in storm.yaml.
>> -Harsha
>>
>>
>> On Thu, May 5, 2016, at 10:08 AM, Stephen Powis wrote:
>>
>> Hey!
>>
>> We've started getting this error frequently when trying to view our
>> topology details via the webUI.  Does anyone have any idea how to solve
>> this?  It seems like if we re-deploy the topology the error goes away (for
>> awhile) or if we just wait eventually the error will go away (to reappear
>> again later).
>>
>>
>> Thanks!
>>
>> Stephen
>>
>> Internal Server Error
>>
>> org.apache.thrift7.transport.TTransportException: Frame size (1124200) 
>> larger than max length (1048576)!
>>  at 
>> org.apache.thrift7.transport.TFramedTransport.readFrame(TFramedTransport.java:137)
>>  at 
>> org.apache.thrift7.transport.TFramedTransport.read(TFramedTransport.java:101)
>>  at org.apache.thrift7.transport.TTransport.readAll(TTransport.java:86)
>>  at 
>> org.apache.thrift7.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
>>  at 
>> org.apache.thrift7.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
>>  at 
>> org.apache.thrift7.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
>>  at org.apache.thrift7.TServiceClient.receiveBase(TServiceClient.java:69)
>>  at 
>> backtype.storm.generated.Nimbus$Client.recv_getTopologyInfoWithOpts(Nimbus.java:625)
>>  at 
>> backtype.storm.generated.Nimbus$Client.getTopologyInfoWithOpts(Nimbus.java:611)
>>  at backtype.storm.ui.core$topology_page.invoke(core.clj:710)
>>  at backtype.storm.ui.core$fn__9644.invoke(core.clj:976)
>>  at 
>> org.apache.storm.shade.compojure.core$make_route$fn__1829.invoke(core.clj:93)
>>  at 
>> org.apache.storm.shade.compojure.core$if_route$fn__1817.invoke(core.clj:39)
>>  at 
>> org.apache.storm.shade.compojure.core$if_method$fn__1810.invoke(core.clj:24)
>>  at 
>> org.apache.storm.shade.compojure.core$routing$fn__1835.invoke(core.clj:106)
>>  at clojure.core$some.invoke(core.clj:2515)
>>  at org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:106)
>>  at clojure.lang.RestFn.applyTo(RestFn.java:139)
>>  at clojure.core$apply.invoke(core.clj:626)
>>  at 
>> org.apache.storm.shade.compojure.core$routes$fn__1839.invoke(core.clj:111)
>>  at 
>> org.apache.storm.shade.ring.middleware.json$wrap_json_params$fn__8916.invoke(json.clj:56)
>>  at 
>> org.apache.storm.shade.ring.middleware.multipart_params$wrap_multipart_params$fn__2889.invoke(multipart_params.clj:103)
>>  at 
>> org.apache.storm.shade.ring.middleware.reload$wrap_reload$fn__3610.invoke(reload.clj:22)
>>  at backtype.storm.ui.core$catch_errors$fn__9709.invoke(core.clj:1061)
>>  at 
>> org.apache.storm.shade.ring.middleware.keyword_params$wrap_keyword_params$fn__2822.invoke(keyword_params.clj:27)
>>  at 
>> org.apache.storm.shade.ring.middleware.nested_params$wrap_nested_params$fn__2861.invoke(nested_params.clj:65)
>>  at 
>> org.apache.storm.shade.ring.middleware.params$wrap_params$fn__2794.invoke(params.clj:55)
>>  at 
>> org.apache.storm.shade.ring.middleware.multipart_params$wrap_multipart_params$fn__2889.invoke(multipart_params.clj:103)
>>  at 
>> org.apache.storm.shade.ring.middleware.flash$wrap_flash$fn__3070.invoke(flash.clj:14)
>>  at 
>> org.apache.storm.shade.ring.middleware.session$wrap_session$fn__3059.invoke(session.clj:43)
>>  at 
>> org.apache.storm.shade.ring.middleware.cookies$wrap_cookies$fn__2990.invoke(cookies.clj:160)
>>  at 
>> org.apache.storm.shade.ring.util.servlet$make_service_method$fn__2707.invoke(servlet.clj:127)
>>  at 
>> org.apache.storm.shade.ring.util.servlet$servlet$fn__2711.invoke(servlet.clj:136)
>>  at 
>> org.apache.storm.shade.ring.util.servlet.proxy$javax.servlet.http.HttpServlet$ff19274a.service(Unknown
>>  Source)
>>  at 
>> org.apache.storm.shade.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:654)
>>  at 
>> org

Re: [DISCUSS] Would like to make collective intelligence about Metrics on Storm

2016-05-02 Thread Stephen Powis
Oooh I'd love this as well!  I really dig the ease of the metric framework
in storm and have all the metrics go thru one centralized config.  But as
the number of storm hosts and number of tasks grow, I've found that
Graphite/Grafana has a hard time collecting up all the relevant metrics
across a lot of wildcarded keys for things like hostnames and taskIds to
properly display my graphs.

On Sun, May 1, 2016 at 8:17 AM, Kevin Conaway 
wrote:

> One thing I would like to see added (if not already present) is the
> ability to register metrics that are not tied to a component.
>
> As of now, the only non-component metrics are reported by the SystemBolt
> pseudo-component which feels like a work-around.  It reports JVM level
> metrics like GC time, heap size and other things that aren't associated
> with a given component.
>
> It would be great if application developers could expose similar metrics
> like this for things like connection pools and other JVM wide objects that
> aren't unique to a specific component.
>
> I don't think this is possible now, is it?
>
> On Wed, Apr 20, 2016 at 12:29 AM, Jungtaek Lim  wrote:
>
>> Let me start sharing my thought. :)
>>
>> 1. Need to enrich docs about metrics / stats.
>>
>> In fact, I couldn't see the fact - topology stats are sampled by default
>> and sample rate is 0.05 - from the docs when I was newbie of Apache
>> Storm. It made me misleading and made me saying "Why there're difference
>> between the counts?". I also saw some mails from user@ about same question.
>> If we include this to guide doc that would be better.
>>
>> And Metrics document page
>>  seems not well
>> written. I think it has appropriate headings but lacks contents on each
>> heading.
>> It should be addressed, and introducing some external metrics consumer
>> plugins (like storm-graphite 
>>  from Verisign) would be great, too.
>>
>> 2. Need to increase sample rate or (ideally) no sampling at all.
>>
>> Let's postpone considering performance hit at this time.
>> Ideally, we expect precision of metrics gets better when we increase
>> sample rate. It affects non-gauge kinds of metrics which are counter,
>> and latency, and so on.
>>
>> Btw, I would like to hear about opinions on latency since I'm not an
>> expert.
>> Storm provides only average latency and it's indeed based on sample rate.
>> Do we feel OK with this? If not how much having also percentiles can help
>> us?
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 2016년 4월 20일 (수) 오전 10:55, Jungtaek Lim 님이 작성:
>>
>>> Hi Storm users,
>>>
>>> I'm Jungtaek Lim, committer and PMC member of Apache Storm.
>>>
>>> If you subscribed dev@ mailing list, you may have seen that recently
>>> we're addressing the metrics feature on Apache Storm.
>>>
>>> For now, improvements are going forward based on current metrics feature.
>>>
>>> - Improve (Topology) MetricsConsumer
>>> 
>>> - Provide topology metrics in detail (metrics per each stream)
>>> 
>>> - (WIP) Introduce Cluster Metrics Consumer
>>>
>>> As I don't maintain large cluster for myself, I really want to collect
>>> the any ideas for improving, any inconveniences, use cases of Metrics with
>>> community members, so we're on the right way to go forward.
>>>
>>> Let's talk!
>>>
>>> Thanks in advance,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>


Re: thread safe output collector

2016-04-29 Thread Stephen Powis
You're probably right, if its an expensive operation to package your data
into a formatted tuple, it may make more sense for your spout to emit
something simple, and have a downstream bolt package it up.

In the situation I was describing our spout is executing a SQL statement to
gather rows that should be emitted as tuples, so the "processing time" of
the spout is more around how fast or slow that query statement ends up
being, and less about converting them to tuples -- we're actually querying
against somewhere around 100 different databases to find the data.  Doing
that in a single thread with the other spouts seemed not ideal, so thats
why we kicked it off to separate threads.

On Fri, Apr 29, 2016 at 8:53 AM, Hart, James W. <jwh...@seic.com> wrote:

> I’m working on a topology that will be similar to this application so I
> was thinking about this yesterday.
>
>
>
> I’m thinking that if there is any significant work to do on messages in
> making them into tuples, shouldn’t the message be emitted and the work be
> in a bolt?  I don’t think that bolt execute functions have the same
> limitations as spout nextTuple functions.  Now with that said, bolt
> executes should not be long running computations either, but can be longer
> than the spouts nextTuple function.
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Thursday, April 28, 2016 11:59 AM
> *To:* user@storm.apache.org
> *Subject:* Re: thread safe output collector
>
>
>
> So the Spout documentation (assuming its correct...) here (
> http://storm.apache.org/releases/current/Concepts.html#spouts) mentions
> this:
>
>
> "The main method on spouts is nextTuple. nextTuple either emits a new
> tuple into the topology or simply returns if there are no new tuples to
> emit. *It is imperative that **nextTuple** does not block for any spout
> implementation, because Storm calls all the spout methods on the same
> thread.*"
>
> When developing a custom spout we interpreted it to mean that any "real
> work" done by a spout should be done in a separate thread, and decided on
> the following pattern which seems some what relevant to what you are trying
> to do in your bolts.
>
> On Spout prepare, we create a concurrent/thread safe queue.  We then
> create a new Thread passing it a reference to our thread safe queue.  This
> thread handles finding new data that needs to be emitted.  When that thread
> finds data, it adds it to the shared queue.  When the spout's nextTuple()
> method is called, it looks for data on the shared queue and emits it.
>
> I imagine doing async processing in a bolt using one or more threads could
> work with a similar pattern.  On prepare you setup your thread(s) with
> references to a shared queue.  The bolt passes work to be completed to the
> thread(s), the thread(s) communicate back to the bolt the result via a
> shared queue.  Add in the concept of tick tuples to ensure your bolt checks
> for completed work on a regular basis?
>
> Is there a better way to do this?
>
>
>
> On Thu, Apr 28, 2016 at 11:22 AM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
>
> Thanks for the clarification
>
>
>
> On 28 April 2016 at 15:12, P. Taylor Goetz <ptgo...@gmail.com> wrote:
>
> The documentation is wrong. See:
>
>
>
> https://issues.apache.org/jira/browse/STORM-841
>
>
>
> At some point it looks like the change made there got reverted. I will
> reopen it to make sure the documentation is corrected.
>
>
>
> OutputCollector is NOT thread-safe.
>
>
>
> -Taylor
>
>
>
> On Apr 28, 2016, at 9:06 AM, Stephen Powis <spo...@salesforce.com> wrote:
>
>
>
> "Its perfectly fine to launch new threads in bolts that do processing
> asynchronously. OutputCollector
> <http://storm.apache.org/releases/current/javadocs/org/apache/storm/task/OutputCollector.html>
> is thread-safe and can be called at any time."
>
>
>
> From the docs for 0.9.6:
> http://storm.apache.org/releases/0.9.6/Concepts.html#bolts
>
>
>
> On Thu, Apr 28, 2016 at 9:03 AM, P. Taylor Goetz <ptgo...@gmail.com>
> wrote:
>
> IIRC there was discussion about making it thread safe, but I don't believe
> it was implemented.
>
>
>
> -Taylor
>
>
> On Apr 28, 2016, at 3:52 AM, Julien Nioche <lists.digitalpeb...@gmail.com>
> wrote:
>
> Hi Stephen
>
>
>
> I asked the same question in February but did not get a reply
>
>
>
>
> https://mail-archives.apache.org/mod_mbox/storm-user/201602.mbox/%3cca+-fm0urpf3fuerozywpzmxu-kdbgf-zj3wbyr8evsaqjc6...@mail.gmail.com%3E
>
>
>
> Anyone who could confirm this?
>
>
>
> T

Re: thread safe output collector

2016-04-28 Thread Stephen Powis
"Its perfectly fine to launch new threads in bolts that do processing
asynchronously. OutputCollector

is thread-safe and can be called at any time."


>From the docs for 0.9.6:
http://storm.apache.org/releases/0.9.6/Concepts.html#bolts

On Thu, Apr 28, 2016 at 9:03 AM, P. Taylor Goetz  wrote:

> IIRC there was discussion about making it thread safe, but I don't believe
> it was implemented.
>
> -Taylor
>
> On Apr 28, 2016, at 3:52 AM, Julien Nioche 
> wrote:
>
> Hi Stephen
>
> I asked the same question in February but did not get a reply
>
>
> https://mail-archives.apache.org/mod_mbox/storm-user/201602.mbox/%3cca+-fm0urpf3fuerozywpzmxu-kdbgf-zj3wbyr8evsaqjc6...@mail.gmail.com%3E
>
> Anyone who could confirm this?
>
> Thanks
>
> On 27 April 2016 at 14:05, Steven Lewis  wrote:
>
>> I have conflicting information, and have not checked personally but has
>> the output collector finally been made thread safe for emitting in version
>> 1.0 or 0.10? I know it was a huge problem in 0.9.5 when trying to do
>> threading in a bolt for async future calls and emitting once it returns.
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the individual or entity to whom they are addressed. If
>> you have received this email in error destroy it immediately. *** Walmart
>> Confidential ***
>>
>
>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble 
>
>


Re: Is ganglia like visualvm ?

2016-04-26 Thread Stephen Powis
There's no real "right" answer to this, it depends on your hardware,
utilization of your hardware, and all sorts of other kinds of parameters.

I've never used Ganglia myself, but my naive understanding is you have an
agent that runs on every node you want to monitor.  Then the agents "report
back" metrics to a central server which provides the view into the metrics
of all your nodes (ie all the graphs and what not).

So if it was me, I'd install the ganglia agent on every node I want to
monitor.  Where to install the central server?  You'll have to read up the
documentation on Ganglia and figure out where that fits best within the
servers you own.  You'd find the best info about Ganglia on their website
docs, or their mailing list.

On Mon, Apr 25, 2016 at 8:35 PM, sam mohel  wrote:

> Can you help please ?
>
>
> On Monday, April 25, 2016, sam mohel  wrote:
>
>> Thanks for helping , it helped me a lot
>> But how can i set up Ganglia i mean i have two machines
>> Should i install ganglia in machine 1 which i run on it nimbus  or on
>> worker machine ?
>>
>>
>> On Mon, Apr 25, 2016 at 12:29 AM, Manu Zhang 
>> wrote:
>>
>>> Hi Sam,
>>>
>>> Generally, Ganglia gives usage of CPU, Memory, Disk and Network of the
>>> whole cluster while visual VM shows CPU and Memory metrics per JVM.
>>>
>>> If the topology is the only running job on your cluster, Ganglia will
>>> let you know whether your topology has fully utilized the resources of the
>>> cluster and where the bottleneck is. For example, is the CPU abnormally
>>> high for very simple logics ? is the GbE network already full ?
>>>
>>> With visual VM, you might be able to pinpoint the problem. It shows the
>>> CPU condition of each thread at runtime, whether it's running, sleeping or
>>> blocking and the thread can be dumped. Plus, you can also do sampling or
>>> profiling for CPU / memory which can help you to find the hot spots in your
>>> code.
>>>
>>> Hope this helps.
>>>
>>> Regards,
>>> Manu
>>>
>>> On Sun, Apr 24, 2016 at 3:20 PM sam mohel  wrote:
>>>
 Can anybody help ?

 On Saturday, April 23, 2016, sam mohel  wrote:

 > I submitted topology and want to measure performance of CPU by like
 > getting graph
 >
 > Can I use ganglia ? or visual VM can do it ?
 >
 > Iam using storm -0.9.6 and it has already visulaization for spout and
 > bolts . how can I understand it well ? I don't understand numbers that
 > appeared in the graph . and how can I use it in evaluation of toplogy
 >
 > Thanks for any help
 >

>>>
>>


Re: Is ganglia like visualvm ?

2016-04-24 Thread Stephen Powis
I use the following metric reporting plugin to send metrics to graphite,
and use grafana to visualize them.

https://github.com/verisign/storm-graphite

On Sun, Apr 24, 2016 at 3:20 AM, sam mohel  wrote:

> Can anybody help ?
>
> On Saturday, April 23, 2016, sam mohel  wrote:
>
>> I submitted topology and want to measure performance of CPU by like
>> getting graph
>>
>> Can I use ganglia ? or visual VM can do it ?
>>
>> Iam using storm -0.9.6 and it has already visulaization for spout and
>> bolts . how can I understand it well ? I don't understand numbers that
>> appeared in the graph . and how can I use it in evaluation of toplogy
>>
>> Thanks for any help
>>
>


Re: [ANNOUNCE] Apache Storm 1.0 Released

2016-04-12 Thread Stephen Powis
Yay!  Can't wait to give it a whirl!

On Tue, Apr 12, 2016 at 9:41 AM,  wrote:

> It looks beautiful!  Thanks a million!
>
>
> Craig Charleton
> craig.charle...@gmail.com
>
>
> > On Apr 12, 2016, at 9:32 AM, P. Taylor Goetz  wrote:
> >
> > The Apache Storm community is pleased to announce the release of Apache
> Storm version 1.0.0.
> >
> > Storm is a distributed, fault-tolerant, and high-performance realtime
> computation system that provides strong guarantees on the processing of
> data. You can read more about Storm on the project website:
> >
> > http://storm.apache.org
> >
> > Downloads of source and binary distributions are listed in our download
> > section:
> >
> > http://storm.apache.org/downloads.html
> >
> > You can read more about this release in the following blog post:
> >
> > https://storm.apache.org/2016/04/12/storm100-released.html
> >
> > Distribution artifacts are available in Maven Central at the following
> coordinates:
> >
> > groupId: org.apache.storm
> > artifactId: storm-core
> > version: 1.0.0
> >
> > The full list of changes is available here[1]. Please let us know [2] if
> you encounter any problems.
> >
> > Regards,
> >
> > The Apache Storm Team
> >
> > [1]: https://github.com/apache/storm/blob/v1.0.0/CHANGELOG.md
> > [2]: https://issues.apache.org/jira/browse/STORM
>


Re: New Concurrent modification exception's after storm 0.10.0

2016-03-03 Thread Stephen Powis
So after spending another hour staring at the code, I've realized that we
are indeed emitting instances of a List object that is re-used elsewhere in
the bolt (doh!)  I'm guessing something did change in the framework
between 0.9.x and 0.10.x, but its definitely caught me doing something
stupid :)

Thanks!

On Thu, Mar 3, 2016 at 1:01 PM, P. Taylor Goetz <ptgo...@gmail.com> wrote:

> Hi Stephen,
>
> Can you provide a stack trace that indicates where this is occurring?
>
> -Taylor
>
>
> > On Mar 2, 2016, at 1:49 PM, Stephen Powis <spo...@salesforce.com> wrote:
> >
> > Hey!
> >
> > Did anything change between storm 0.9.5 and 0.10.0 regarding
> ConcurrentModificationExceptions and how they are detected?  We've had a
> topology running for the last 6months or so and never saw this exception.
> >
> > After upgrading to Storm 0.10.x which didn't require any changes to our
> topology/bolt/business logic, we're now seeing these intermittently and
> have been struggling to see where we've gone wrong -- We don't seem to be
> modifying values in the emitted tuples anywhere after emitting.
> >
> > Thanks!
> > Stephen
>
>


New Concurrent modification exception's after storm 0.10.0

2016-03-02 Thread Stephen Powis
Hey!

Did anything change between storm 0.9.5 and 0.10.0 regarding
ConcurrentModificationExceptions and how they are detected?  We've had a
topology running for the last 6months or so and never saw this exception.

After upgrading to Storm 0.10.x which didn't require any changes to our
topology/bolt/business logic, we're now seeing these intermittently and
have been struggling to see where we've gone wrong -- We don't seem to be
modifying values in the emitted tuples anywhere after emitting.

Thanks!
Stephen


Re: Storm 0.10.x defining log file for topology.

2016-03-01 Thread Stephen Powis
So is there no topologly level conf option to alter the target filename?
It seems like the default log4j2 config looks at a system property to
determine it.  Looking at source code where workers are launched, it looks
like that property is passed in on the command line here, but my clojure
isn't great

https://github.com/apache/storm/blob/master/storm-core/src/clj/org/apache/storm/daemon/supervisor.clj#L1203



On Mon, Feb 29, 2016 at 11:55 PM, Justin Hopper <justin.hop...@dev9.com>
wrote:

> 0.10.0 uses log4j2. Look under log4j2 directory and your will find a
> worker.xml file or something to that effect. In there you can define the
> output. Keep in mind that this will affect all topologies in that
> supervisor.
>
> On Feb 29, 2016, at 13:52, Stephen Powis <spo...@salesforce.com> wrote:
>
> Hey!
>
> I recently upgraded to Storm 0.10.0 (and log4j as a result).  I've noticed
> that topologies no longer log into worker-.log and instead into a
> file named -worker-.log
>
> I was curious as to if there's a way to control what the name of this log
> file is.  I've tried adding a topology config item named "logfile.name"
> but no luck.  I must be missing something.  Is there a way to hardcode the
> topology Id, or set this logfile to a static value on a per topology basis?
>
> Thanks!
>
>


Storm 0.10.x defining log file for topology.

2016-02-29 Thread Stephen Powis
Hey!

I recently upgraded to Storm 0.10.0 (and log4j as a result).  I've noticed
that topologies no longer log into worker-.log and instead into a
file named -worker-.log

I was curious as to if there's a way to control what the name of this log
file is.  I've tried adding a topology config item named "logfile.name" but
no luck.  I must be missing something.  Is there a way to hardcode the
topology Id, or set this logfile to a static value on a per topology basis?

Thanks!


Re: version 1.0?

2016-02-17 Thread Stephen Powis
I think typically in a production environment you would use the
releases/versions released as being "stable" vs building from the latest
code checked into source control.  Storm 0.9.x for sure is used by tons of
large enterprise companies -- 0.10.x doesn't seem like a huge departure
from 0.9.x as far as writing topologies go (in testing our topologies
didn't require changing any code going from 0.9.x to 0.10.x - we plan on
upgrading in the next few weeks).

Developers can chime in, but from my personal experience with Storm I
wouldn't exactly focus on the "1.0" labeling so much as the software being
consistent and stable from release to release.

On Wed, Feb 17, 2016 at 2:23 PM, Maciek Próchniak  wrote:

> Hi,
>
> Is there any timeline for 1.0 release?
> We're evaluating Storm (together with Flink) for our client and it'd be
> great for us to have sliding window support.
> Guess we could use version built from sources for some time - but we still
> need some estimates on 1.0 availability.
>
> thanks,
> maciek
>
>


Storm deploy time lifecycle hook?

2016-02-11 Thread Stephen Powis
Hey!

I have a question about the storm topology lifecycle.  Are there any hooks
in the storm deployment/topology lifecycle?  Basically I'd be great if
there was a way to run a block of code during deployment of a new topology
that got fired once on each JVM.

The issue we've seen so far is when deploying a topology, sometimes we need
to setup some kind of shared resource, maybe a connection pool of some type
(JDBC, HTTP, etc..) and its currently not trivial to do that via the bolt
prepare methods such that it only happens once per JVM.   Its possible, but
easy to get wrong :p

Does something already exist within the storm framework for this that we've
missed?  Is it a case of "you're doing it wrong"?

Thanks!
Stephen


Re: How to redeploy the topology

2016-01-19 Thread Stephen Powis
You would need to stop or kill the running topology first.  There is an
argument to the stop command that tells storm how long (in secs) to wait
before killing the topology.  My understanding of how this works is when
you issue the stop command, the topology simply disables the spouts in the
topology so no NEW tuples will be ingested at the top of the topology.
Then after your supplied timeout seconds, storm will actually kill the
topology.  The theory being that all of the tuples already being processed
within the topology should finish their processing completely within that
time window, and by the time it gets killed off, the topology is just
sitting idle.  Then you deploy your new version, and the spouts start
ingesting new tuples, and away you go.



On Tue, Jan 19, 2016 at 11:49 PM, Noppanit Charassinvichai <
noppani...@gmail.com> wrote:

> Right now I'm using storm jar command to deploy the topology to the storm
> cluster. I have setup Jenkins to deploy the code. However, if I want to
> redeploy again how can I deploy to not interrupt the current streaming
> because I would get the error saying the topology name already exists?
>
> Thanks
>


Re: topology.stats.sample.rate

2016-01-13 Thread Stephen Powis
http://stackoverflow.com/questions/26276096/performace-impact-of-using-setstatssamplerate-topology-stats-sample-rate

On Wed, Jan 13, 2016 at 9:51 PM, researcher cs 
wrote:

> what is this properties used for topology.stats.sample.rate ?
>


Re: Storm pipeline design doubt

2016-01-11 Thread Stephen Powis
I think if you want to ensure that writes happen in the same order that
they come in, you'll need to make your pipeline is sequential/serial in
nature (IE no parallelism) which may defeat the purpose of using storm.
You may be able to be a bit clever and use some other piece of information
as the Primary key for the record in your DB such that no matter what order
they get inserted, they're sorted correctly by that Pk, but its difficult
to say without knowing more about the data.

On Sat, Jan 9, 2016 at 2:24 PM, pradeep s 
wrote:

> Hi ,
> I have a requirement to read CDC messages which are landed in AWS SQS  and
> give the rwas message to S3 and do some processing and write to maria db.
> I have done a sample  spout which reads messages from SQS and two bolts
> which are for S3 and DB. I am using BaseRich Spout and bolt and doing
> reliable emit.
>
> I have a doubt on how the order of the message is guranteed when i write
> the records to database. I want to write in the same order in which the
> records are read from queue.
> Can someone give an example topology code where the order is guranteed
> while using storm processing.
> Regards
> Pradeep S
>


Storm 0.9.6 -> 0.10.0 upgrade docs?

2015-12-03 Thread Stephen Powis
Been poking around on the website and can't seem to find any information
about the upgrade process from 0.9.6 to 0.10.0.  Is this documented
anywhere the changes required in order to upgrade (both on the service
side, and on the development/topology side)?

Thanks
Stephen


Re: Using Storm to parse emails and creates batches

2015-12-01 Thread Stephen Powis
If you are using Storm's guaranteed message processing
<http://storm.apache.org/documentation/Guaranteeing-message-processing.html>
there is no need to 'persist' the collection anywhere other than in
memory.  IE List myListOfTuples = new ArrayList();  If the
third bolt crashes and loses its in memory collection, after
Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
<http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS>
the tuples will timeout and be replayed thru your entire storm topology and
your collection will be repopulated

On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <
andreas.kalogeropou...@emc.com> wrote:

> Hello Stephen,
>
>
>
> I think you got I correctly. Thanks a lot for the idea.
>
> If you have seen limitations, please send the disclaimers J . For
> example, how did you handle persistence of this collection ? If the third
> bolt failed while populating the collection (size and time has not been
> reached) we just lost everything, so I need to have a status loopback of
> what was really output. Right ?
>
>
>
> Of course, if you can send me the code of your third bolt (especially the
> collection handling), I’ll be grateful.
>
> In all cases, thanks a lot for your help, even without the code, you
> really give me example advice, and now I can start building something.
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Monday, November 30, 2015 5:55 PM
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> From what I understand from your description, you want bolt 3 to collect
> results from multiple tuples and build a single xml for them.  We've done
> this by essentially doing the following:
>
>
>
> Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
> collection and check the size of the collection.  Once the size of the
> collection exceeds some number, we then process all of the tuples in one
> go, and then ACK all of them after the processing completes.
>
> Building on that, we've implemented an additional constraint on time.  If
> the collection size > N OR if we've waited more than X seconds, process the
> batch.  This way your output won't stall out if your topology has a lull in
> data being ingested.
>
> And then lastly, there's a corner case where say 10 tuples come in and get
> held by our collection but then no other tuples come in for a long period
> of time.  If no tuples enter, that means the size and timeout checks are
> never executed and your bolt will hold onto those tuples for a long time
> (potentially causing timeouts).  To handle this, we made use of tick
> tuples.  Tick tuples essentially allow you to you to send a special tuple
> to your bolt every Y seconds.  We use that to trigger checking the time
> constraint is checked on a regular basis (example being send a tick tuple
> every 1, 5, or 10 seconds)
>
>
>
> On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <
> andreas.kalogeropou...@emc.com> wrote:
>
> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.   Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
> 2.   Add additional information (based on sender email)
>
> 3.   Create an XML based on this data, to inject in another solution
>
>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
>
>


Re: Using Storm to parse emails and creates batches

2015-12-01 Thread Stephen Powis
I'm not sure I follow/understand your question or what you're trying to do.

On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <
andreas.kalogeropou...@emc.com> wrote:

> You are right. Sorry for making you state the obvious J.
>
>
>
> Last question : If my spout has incoming information that I want to have
> in the same last bolt (the one creating the XML) for deduplication logic,
> what is the best way to achieve this ?  My instinct says to try to work
> with Fields grouping and the correct key (probably conversation since I am
> working with emails).
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 10:27 AM
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> If you are using Storm's guaranteed message processing
> <http://storm.apache.org/documentation/Guaranteeing-message-processing.html>
> there is no need to 'persist' the collection anywhere other than in
> memory.  IE List myListOfTuples = new ArrayList();  If the
> third bolt crashes and loses its in memory collection, after
> Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
> <http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS>
> the tuples will timeout and be replayed thru your entire storm topology and
> your collection will be repopulated
>
>
>
> On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <
> andreas.kalogeropou...@emc.com> wrote:
>
> Hello Stephen,
>
>
>
> I think you got I correctly. Thanks a lot for the idea.
>
> If you have seen limitations, please send the disclaimers J . For
> example, how did you handle persistence of this collection ? If the third
> bolt failed while populating the collection (size and time has not been
> reached) we just lost everything, so I need to have a status loopback of
> what was really output. Right ?
>
>
>
> Of course, if you can send me the code of your third bolt (especially the
> collection handling), I’ll be grateful.
>
> In all cases, thanks a lot for your help, even without the code, you
> really give me example advice, and now I can start building something.
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Monday, November 30, 2015 5:55 PM
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> From what I understand from your description, you want bolt 3 to collect
> results from multiple tuples and build a single xml for them.  We've done
> this by essentially doing the following:
>
>
>
> Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
> collection and check the size of the collection.  Once the size of the
> collection exceeds some number, we then process all of the tuples in one
> go, and then ACK all of them after the processing completes.
>
> Building on that, we've implemented an additional constraint on time.  If
> the collection size > N OR if we've waited more than X seconds, process the
> batch.  This way your output won't stall out if your topology has a lull in
> data being ingested.
>
> And then lastly, there's a corner case where say 10 tuples come in and get
> held by our collection but then no other tuples come in for a long period
> of time.  If no tuples enter, that means the size and timeout checks are
> never executed and your bolt will hold onto those tuples for a long time
> (potentially causing timeouts).  To handle this, we made use of tick
> tuples.  Tick tuples essentially allow you to you to send a special tuple
> to your bolt every Y seconds.  We use that to trigger checking the time
> constraint is checked on a regular basis (example being send a tick tuple
> every 1, 5, or 10 seconds)
>
>
>
> On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <
> andreas.kalogeropou...@emc.com> wrote:
>
> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.   Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
> 2.   Add additional information (based on sender email)
>
> 3.   Create an XML based on this data, to inject in another solution
>
>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
>
>
>
>


Re: Using Storm to parse emails and creates batches

2015-12-01 Thread Stephen Powis
So you want to eliminate duplicates or make sure that duplicates make it
into the same XML file (third bolt)?

On Tue, Dec 1, 2015 at 7:48 PM, Kalogeropoulos, Andreas <
andreas.kalogeropou...@emc.com> wrote:

> Hello Stephen,
>
>
>
> Imagine that the spout is providing me 300 000 emails per hour.
>
> The first bolt will parse/analyze the information (from, to, cc, subject,
> object, date, has of attachments, …  , and probably will find the same hash
> for some attachments (someone forwarding an email).
>
>
>
> The last bolt will create an XML based on all this information, but if I
> can have the tuples containing the same attachment (based on hash) in the
> same XML, I can actually apply a dedup logic : having multiple lines in my
> xml pointing to the same file
>
>
>
> Does this make more sense ?
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 11:36 AM
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> I'm not sure I follow/understand your question or what you're trying to do.
>
>
>
> On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <
> andreas.kalogeropou...@emc.com> wrote:
>
> You are right. Sorry for making you state the obvious J.
>
>
>
> Last question : If my spout has incoming information that I want to have
> in the same last bolt (the one creating the XML) for deduplication logic,
> what is the best way to achieve this ?  My instinct says to try to work
> with Fields grouping and the correct key (probably conversation since I am
> working with emails).
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 10:27 AM
>
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> If you are using Storm's guaranteed message processing
> <http://storm.apache.org/documentation/Guaranteeing-message-processing.html>
> there is no need to 'persist' the collection anywhere other than in
> memory.  IE List myListOfTuples = new ArrayList();  If the
> third bolt crashes and loses its in memory collection, after
> Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
> <http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS>
> the tuples will timeout and be replayed thru your entire storm topology and
> your collection will be repopulated
>
>
>
> On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <
> andreas.kalogeropou...@emc.com> wrote:
>
> Hello Stephen,
>
>
>
> I think you got I correctly. Thanks a lot for the idea.
>
> If you have seen limitations, please send the disclaimers J . For
> example, how did you handle persistence of this collection ? If the third
> bolt failed while populating the collection (size and time has not been
> reached) we just lost everything, so I need to have a status loopback of
> what was really output. Right ?
>
>
>
> Of course, if you can send me the code of your third bolt (especially the
> collection handling), I’ll be grateful.
>
> In all cases, thanks a lot for your help, even without the code, you
> really give me example advice, and now I can start building something.
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Monday, November 30, 2015 5:55 PM
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> From what I understand from your description, you want bolt 3 to collect
> results from multiple tuples and build a single xml for them.  We've done
> this by essentially doing the following:
>
>
>
> Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
> collection and check the size of the collection.  Once the size of the
> collection exceeds some number, we then process all of the tuples in one
> go, and then ACK all of them after the processing completes.
>
> Building on that, we've implemented an additional constraint on time.  If
> the collection size > N OR if we've waited more than X seconds, process the
> batch.  This way your output won't stall out if your topology has a lull in
> data being ingested.
>
> And then lastly, there's a corner case where say 10 tuples come in and get
> held by our collection but then no other tuples come in for a long period
> of time.  If no tuples enter, that means the size and timeout checks are
> never 

Re: Using Storm to parse emails and creates batches

2015-12-01 Thread Stephen Powis
Yep, sounds like you got it..you'd want to use field grouping and group
on a field that contains the hash.  Then every tuple that has that field
with the identical hashes would get sent to the same bolt instance.

On Tue, Dec 1, 2015 at 8:23 PM, Kalogeropoulos, Andreas <
andreas.kalogeropou...@emc.com> wrote:

> Making sure that duplicates make it in the same XML file (third bolt).
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 11:59 AM
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> So you want to eliminate duplicates or make sure that duplicates make it
> into the same XML file (third bolt)?
>
>
>
> On Tue, Dec 1, 2015 at 7:48 PM, Kalogeropoulos, Andreas <
> andreas.kalogeropou...@emc.com> wrote:
>
> Hello Stephen,
>
>
>
> Imagine that the spout is providing me 300 000 emails per hour.
>
> The first bolt will parse/analyze the information (from, to, cc, subject,
> object, date, has of attachments, …  , and probably will find the same hash
> for some attachments (someone forwarding an email).
>
>
>
> The last bolt will create an XML based on all this information, but if I
> can have the tuples containing the same attachment (based on hash) in the
> same XML, I can actually apply a dedup logic : having multiple lines in my
> xml pointing to the same file
>
>
>
> Does this make more sense ?
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 11:36 AM
>
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> I'm not sure I follow/understand your question or what you're trying to do.
>
>
>
> On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <
> andreas.kalogeropou...@emc.com> wrote:
>
> You are right. Sorry for making you state the obvious J.
>
>
>
> Last question : If my spout has incoming information that I want to have
> in the same last bolt (the one creating the XML) for deduplication logic,
> what is the best way to achieve this ?  My instinct says to try to work
> with Fields grouping and the correct key (probably conversation since I am
> working with emails).
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 10:27 AM
>
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> If you are using Storm's guaranteed message processing
> <http://storm.apache.org/documentation/Guaranteeing-message-processing.html>
> there is no need to 'persist' the collection anywhere other than in
> memory.  IE List myListOfTuples = new ArrayList();  If the
> third bolt crashes and loses its in memory collection, after
> Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
> <http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS>
> the tuples will timeout and be replayed thru your entire storm topology and
> your collection will be repopulated
>
>
>
> On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <
> andreas.kalogeropou...@emc.com> wrote:
>
> Hello Stephen,
>
>
>
> I think you got I correctly. Thanks a lot for the idea.
>
> If you have seen limitations, please send the disclaimers J . For
> example, how did you handle persistence of this collection ? If the third
> bolt failed while populating the collection (size and time has not been
> reached) we just lost everything, so I need to have a status loopback of
> what was really output. Right ?
>
>
>
> Of course, if you can send me the code of your third bolt (especially the
> collection handling), I’ll be grateful.
>
> In all cases, thanks a lot for your help, even without the code, you
> really give me example advice, and now I can start building something.
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Monday, November 30, 2015 5:55 PM
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> From what I understand from your description, you want bolt 3 to collect
> results from multiple tuples and build a single xml for them.  We've done
> this by essentially doing the following:
>
>
>
> Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
> coll

Re: Using Storm to parse emails and creates batches

2015-11-30 Thread Stephen Powis
>From what I understand from your description, you want bolt 3 to collect
results from multiple tuples and build a single xml for them.  We've done
this by essentially doing the following:

Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
collection and check the size of the collection.  Once the size of the
collection exceeds some number, we then process all of the tuples in one
go, and then ACK all of them after the processing completes.

Building on that, we've implemented an additional constraint on time.  If
the collection size > N OR if we've waited more than X seconds, process the
batch.  This way your output won't stall out if your topology has a lull in
data being ingested.

And then lastly, there's a corner case where say 10 tuples come in and get
held by our collection but then no other tuples come in for a long period
of time.  If no tuples enter, that means the size and timeout checks are
never executed and your bolt will hold onto those tuples for a long time
(potentially causing timeouts).  To handle this, we made use of tick
tuples.  Tick tuples essentially allow you to you to send a special tuple
to your bolt every Y seconds.  We use that to trigger checking the time
constraint is checked on a regular basis (example being send a tick tuple
every 1, 5, or 10 seconds)

On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <
andreas.kalogeropou...@emc.com> wrote:

> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.   Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
> 2.   Add additional information (based on sender email)
>
> 3.   Create an XML based on this data, to inject in another solution
>
>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>


Re: Store previous calculated result

2015-11-09 Thread Stephen Powis
Hey Craig,

Just out of curiosity, how are you interacting with mysql?  Via
hibernate or something else?

Thanks!


On Mon, Nov 9, 2015 at 9:32 PM,   wrote:
> I have been working on a project that requires a lot of calculation and
> retention of values in the bolts and here are some questions/considerations
> that I think will help you:
>
> - You should read this if you already haven't.  I must admit I had to read
> through it many times before I got the concept:
> http://storm.apache.org/documentation/Trident-state.html
>
> - When a bolt goes down, Storm will recover it automatically.  Any of the in
> memory values that have been calculated will be lost unless you persist the
> state using Trident.
>
> - When persisting the state in Trident (saving it somewhere so Storm can
> reconstitute the values when restarting the Bolt) you have to decide how
> accurate the values calculated by the bolt need to be.  This point is not
> discussed in the information that I found on Storm/Trident.  Without writing
> thousands of words, my project required that the values calculated in a
> Trident Bolt never be incorrect (complex financial). So I had to make sure
> that when Storm obtained the Trident state to place into a Bolt for recovery
> from a persistent store, that the values it used must be ACID compliant.
> Therefore, I couldn't use  Cassandra or any other non-ACID compliant
> persistent storage because of the risk (however large or small) of the
> values stored in Cassandra not being completely accurate.  After a lot of
> analysis and lost-sleep, I decided to use MySQL to persist the in-process
> state of any Bolts.  There are some other persistence solutions that will
> scale better than MySQL.  However, MySQL is still in use in huge
> implementations and I estimated that I don't need a solution that can
> process a million events a second but rather one that will process thousands
> of events a second and make sure that, during start-up and recovery, the
> values it uses reflect all the changes to the data.  There are some other
> persistence solutions that are ACID-compliant and say they can process
> faster than MySQL.  MemSQL and VoltDB looked promising.  However, they are
> nowhere near as mature as MySQL and I have a lot of MySQL experience.
>
> I would include more links to articles and git repos but I have to take my
> child to school :-)
>
>
>
> Craig Charleton
> craig.charle...@gmail.com
>
>
> On Nov 7, 2015, at 6:27 AM, Miguel Ángel Fernández Fernández
>  wrote:
>
> In a trident scenario, a realtime operation needs to know the previous
> calculated result.
>
> My current solution is very poor and probably incorrect (a hashmap in
> bolts). Now I'm thinking to incorporate a cache (redis, memcached ...)
>
> However, I suppose that there is a standard solution for this problem in
> Trident (maybe a special state).
>
> What do you think is the best approach?
>
> Thanks for your time


Re: Store previous calculated result

2015-11-09 Thread Stephen Powis
Awesome, sounds like we've come to similar conclusions on several
points.  At our company we prototyped with a single topology and as we
iterate we've been breaking it up into multiple smaller topologies
communicating via kafka topics.  We're using hibernate with c3p0 for
talking to mysql and also aren't passing these entities between bolts
to avoid serialization/session issues that come up with that.

Thanks for the insight :)

On Mon, Nov 9, 2015 at 11:13 PM,  <craig.charle...@gmail.com> wrote:
> Stephen,
>
> I originally looked at using the storm-jdbc external component but very
> quickly realized that it is only available in the Storm 10.x.  So, I looked
> at the source code for the storm-jdbc and it discussed using
> http://brettwooldridge.github.io/HikariCP/ as a high-performance connection
> pool for MySQL.  I have used JPA with Hibernate and EclipseLink before but I
> thought I would give HikariCP a try.  So, far it works really well.
> However, I haven't deployed it into production yet.
>
> JPA allows you to work with POJOs as entities and can be easier to code.
> However, I wanted to avoid any potential serialization issues that might
> arise in my system because I am already doing
> POJO->Avro->Kafka->Avro->Kryo->Storm.
>
> The way I am interacting with all of it together ( would share the code but
> it aint share-ready yet) is kind of in a pseudo-stateless manner.   I don't
> know it this will make sense but here-goes:
>
> I assume that when a calculation is performed in a bolt that it will not be
> able to persist its state.  So, I persist values at places where I would ack
> a tuple in Storm.  Ultimately, my tuples come from Kafka topics.  Therefore,
> if I don't ack a tuple, it will get replayed in the case of a failure.  In
> some places I have a bolt output its product to a Kafka topic as well as
> write it somewhere in MySQL.  This allows me to break up a big topology into
> smaller topologies that have different performance needs, calculation
> frequencies, and characteristics without losing the speed and scalability.
> (Think inbound data cleaning, filtering, transformation versus complex event
> processing)
>
> I am still working on elements of the whole solution.  However, it all adds
> up to Storm and Kafka are made for each other.  I am leveraging Kafka's
> speed, storage, scalability, etc to help Storm when something goes wrong.
> Storm is awesome but it was built for speed and scalability (which is
> super-awesome).  I just have to remind myself to use it for what I really
> need, which is to spread many processes across many commodity servers.
>
>
>
> Craig Charleton
> craig.charle...@gmail.com
>
>
> On Nov 9, 2015, at 8:36 AM, Stephen Powis <spo...@salesforce.com> wrote:
>
> Hey Craig,
>
> Just out of curiosity, how are you interacting with mysql?  Via
> hibernate or something else?
>
> Thanks!
>
>
> On Mon, Nov 9, 2015 at 9:32 PM,  <craig.charle...@gmail.com> wrote:
>
> I have been working on a project that requires a lot of calculation and
>
> retention of values in the bolts and here are some questions/considerations
>
> that I think will help you:
>
>
> - You should read this if you already haven't.  I must admit I had to read
>
> through it many times before I got the concept:
>
> http://storm.apache.org/documentation/Trident-state.html
>
>
> - When a bolt goes down, Storm will recover it automatically.  Any of the in
>
> memory values that have been calculated will be lost unless you persist the
>
> state using Trident.
>
>
> - When persisting the state in Trident (saving it somewhere so Storm can
>
> reconstitute the values when restarting the Bolt) you have to decide how
>
> accurate the values calculated by the bolt need to be.  This point is not
>
> discussed in the information that I found on Storm/Trident.  Without writing
>
> thousands of words, my project required that the values calculated in a
>
> Trident Bolt never be incorrect (complex financial). So I had to make sure
>
> that when Storm obtained the Trident state to place into a Bolt for recovery
>
> from a persistent store, that the values it used must be ACID compliant.
>
> Therefore, I couldn't use  Cassandra or any other non-ACID compliant
>
> persistent storage because of the risk (however large or small) of the
>
> values stored in Cassandra not being completely accurate.  After a lot of
>
> analysis and lost-sleep, I decided to use MySQL to persist the in-process
>
> state of any Bolts.  There are some other persistence solutions that will
>
> scale better than MySQL.  However, MySQL is still in use in huge
>
> implementations and I estimated that I don't nee

Re: How to test custom Kryo serializer locally?

2015-11-09 Thread Stephen Powis
Sorry I can't answer your question, but if you've written a serializer
that could potentially be used by others I encourage you to add it to
this repository: https://github.com/magro/kryo-serializers

On Mon, Nov 9, 2015 at 1:07 PM, Sean Bollin  wrote:
> Hi, we've written a custom Kryo serializer.  As far as I can tell,
> Storm doesn't use serialization if you're running it in local mode.
>
> Yet, we need to be able to test our custom Kryo serializer, preferably 
> locally.
>
> Is it possible to test the serializer locally to ensure that it actually 
> works?


Question about acker behavior and tuple timeouts

2015-11-07 Thread Stephen Powis
So this question may be better suited for the dev mailing list, but
I'll give this a shot.

I'm trying to better understand the acker behavior and how it deals
with tuple time outs.  I've read the documentation on the storm
website, but from my rudimentary knowledge of clojure, it seems like
the docs and the code don't completely align.

The short of it is I'm wondering if the line linked in acker.clj
resets the tuple timeout value every time a bolt acks a tuple:
https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/acker.clj#L65

It seems like for every tuple coming into the acker, it calls
rotatingMap.put() on the instance, which in turn resets its timeout
countdown.

Does this sound correct?


Re: Storm UI - topology visualization info

2015-10-29 Thread Stephen Powis
That page doesn't really say anything about the visualization bit.  I'm
curious too as to the meaning of the % number.  I always assumed it was the
% of  how many tuples acked vs how many tuples emitted, but really I have
no idea.

On Thu, Oct 29, 2015 at 5:18 PM, Artem Ervits  wrote:

>
> http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_storm-user-guide/content/deploy-and-manage-apache-storm-topologies.html
> On Oct 27, 2015 12:42 PM, "Adam Meyerowitz (BLOOMBERG/ 731 LEX)" <
> ameyerow...@bloomberg.net> wrote:
>
>> Hi, can someone help me understand the information that's presented in
>> the topology visualization annotations in the Storm UI? For each connection
>> it shows a stream name, then a number then a %. What does the number and %
>> represent?
>>
>> Thanks!
>> Adam
>>
>


Re: Obtain the information of storm's builtin metrics

2015-10-28 Thread Stephen Powis
You could write your own custom metrics consumer perhaps:
https://github.com/apache/storm/blob/master/storm-core/src/jvm/backtype/storm/metric/LoggingMetricsConsumer.java

On Wed, Oct 28, 2015 at 4:11 PM, #ZHANG SHUHAO# 
wrote:

> Hi,
>
>
>
> Thanks for your help, I will consider this solution, but it would better
> to directly retrieve that information, for example, I want one of my bolt
> periodically read that information.
>
> I will try to wrap it in to my bolt, this could be a solution but it’s not
> so **native**. Anyway, I will try it first,  thanks a lot for your help
> again.
>
>
>
> *From:* Santosh Pingale [mailto:pingalesant...@gmail.com]
> *Sent:* Wednesday, October 28, 2015 4:00 PM
> *To:* user@storm.apache.org
> *Subject:* Re: Obtain the information of storm's builtin metrics
>
>
>
> What about a REST API which you can consume directly from most of the
> languages.
>
> https://github.com/apache/storm/blob/master/STORM-UI-REST-API.md
>
>
>
>
>
>
>
> On Wed, Oct 28, 2015 at 12:08 PM, #ZHANG SHUHAO# 
> wrote:
>
> Any helps?
>
> Basically, I want my program read that information instead of looking at
> the storm UI all the time.
>
>
>
> *From:* #ZHANG SHUHAO#
> *Sent:* Sunday, October 25, 2015 12:44 PM
> *To:* 'user@storm.apache.org' 
> *Subject:* Obtain the information of storm's builtin metrics
>
>
>
> Hi,
>
>
>
> I wondering any methods to obtain those information (that appears in Storm
> UI) in my program (instead of exposing to storm UI).
>
> I have checked a few related blogs, but they are mainly discussing how to
> use the metric UI to build our own metric updates in spout or bolt and
> expose to say, log.
>
> Instead of building my own, I just want to get the information that are
> exposed (internally) to Storm UI, so I would like to see a simple trick to
> do that. Any hints? help?
>
>
>
> Thanks.
>
>
>
> Tony.
>
>
>


Re: Spring AMQP Integration with Storm

2015-09-29 Thread Stephen Powis
I'd recommend making use of a spout that others have already built and
battle tested.  A quick google search shows several including this one:
https://github.com/ppat/storm-rabbitmq

Unless you have a very special use case, I'm not sure re-inventing the
wheel is worth the time and effort.

Stephen

On Tue, Sep 29, 2015 at 1:39 AM, Ankur Garg  wrote:

> Hi,
>
> I want to consume the messages in my Storm Spout from a rabbitMq Queue.
>
> Now , we are using Spring AMQP to send and receive messages from RabbitMq
> asynchronously.
>
> Spring AMQP provides mechanism(either creating a listener or using
> annotation @RabbitListner) to read message from the queue aysnchronously.
>
> The problem is I can have a Listener to read the message from the Queue.
> But how do I send this message to my Storm Spout which is running on storm
> cluster ?
>
> The topology will start a cluster, but in my nextTuple() method of my
> spout , I need to read message from this Queue. Can Spring AMQP be used
> here ?
>
> I have a listener configured to read message from the queue:
>
> @RabbitListener(queues = "queueName")
> public void processMessage(QueueMessage message) {
>
> }
>
> How can the above message received at the listener be sent to my spout
> running on a cluster .
>
> Alternatively , how can a spout's nextTuple() method have this method
> inside it ? Is it possible
>
> Is there any integration there between Spring AMQP and Storm?
>
>
> Thanks
>
> Ankur
>
>
>
>


Re: Starting and stopping storm

2015-09-29 Thread Stephen Powis
I would imagine the safest way would be to elect to deactivate each running
topology, which should make your spouts stop emitting tuples.  You'd wait
for all of the currently processing tuples to finish processing, and then
kill the topology.

If tuples get processed quickly in your topologies, you can effectively do
this by selecting kill and giving it a long enough wait time.  IE --
Telling storm to kill your topology after 30 seconds means it will
deactivate your spouts for 30 seconds, waiting for existing tuples to
finish getting processed, and then kill off the topology.

Then bring down each node, upgrade it, bring it back online and resubmit
your topologies.

On Tue, Sep 29, 2015 at 10:02 AM, Garcia-Contractor, Joseph (CORP) <
joseph.garcia-contrac...@adp.com> wrote:

> I don't think I got my question across right or I am confused.
>
> Let me break this down in a more simple fashion.
>
> I have a Storm Cluster named "The Quiet Storm" ;) here is what it consists
> of:
>
> **
> Server ZK1: Running Zookeeper
> Server ZK2: Running Zookeeper
> Server ZK3: Running Zookeeper
>
> Server N1: SupervisorD running Storm Nimbus
>
> Server S1: SupervisorD running Storm Supervisor with 4 workers.
> Server S2: SupervisorD running Storm Supervisor with 4 workers.
> Server S3: SupervisorD running Storm Supervisor with 4 workers.
> **
>
> Now the "The Quiet Storm" can have 1-n number of topologies running on it.
>
> I need to shut down all the servers in the cluster for maintenance.  What
> is the procedure to do this without doing harm to the currently running
> topologies?
>
> Thank you,
>
> Joe
>
> -Original Message-
> From: Matthias J. Sax [mailto:mj...@apache.org]
> Sent: Monday, September 28, 2015 12:15 PM
> To: user@storm.apache.org
> Subject: Re: Starting and stopping storm
>
> Hi,
>
> as always: it depends. ;)
>
> Storm itself clear ups its own resources just fine. However, if the
> running topology needs to clean-up/release resources before it is shut
> down, Storm is not of any help. Even if there is a Spout/Bolt cleanup()
> method, Storm does not guarantee that it will be called.
>
> Thus, using "storm deactivate" is a good way to achieve proper cleanup.
> However, the topology must provide some code for it, too. On the call to
> Spout.deactivate(), it must emit a special "clean-up" message (that you
> have to design by yourself) that must propagate through the whole topology,
> ie, each bolt must forward this message to all its output streams.
> Furthermore, bolts must to the clean-up if they receive this message.
>
> Long story short: "storm deactivate" before "storm kill" makes only sense
> if the topology requires proper cleanup and if the topology itself can
> react/cleanup properly on Spout.deactivate().
>
> Using "storm activate" in not necessary in any case.
>
> -Matthias
>
>
> On 09/28/2015 05:08 PM, Garcia-Contractor, Joseph (CORP) wrote:
> > Hi all,
> >
> >
> >
> >I am a DevOps guy and I need implement a storm cluster
> > with the proper start and stop init scripts on a Linux server.  I
> > already went through the documentation and it seems simple enough.  I
> > am using supervisor as my process manager.  I am however having a
> > debate with one of the developers using Storm on the proper way to
> > shutdown Storm and I am hoping that you fine folks can help us out in
> this regard.
> >
> >
> >
> >The developer believes that before you tell supervisor
> > to kill (SIGTERM) the storm workers, supervisor, and nimbus, you must
> > first issue a "storm deactivate topology-name", then tell supervisor
> > to kill all the various processes.  He believes this because he
> > doesn't know if Storm will do an orderly shutdown on SIGTERM and that
> > there is a chance that something will get screwed up.  This also means
> > that when you start storm, after nimbus is up, you need to issue a
> > ""storm activate topology-name".
> >
> >
> >
> >I am of the belief that because of storms fast fail and
> > because it guarantees data processing, none of that is necessary and
> > that you can just tell supervisor to stop the process.
> >
> >
> >
> >So who is right here?
> >
> > --
> > -- This message and any attachments are intended only for the use of
> > the addressee and may contain information that is privileged and
> > confidential. If the reader of the message is not the intended
> > recipient or an authorized representative of the intended recipient,
> > you are hereby notified that any dissemination of this communication
> > is strictly prohibited. If you have received this communication in
> > error, notify the sender immediately by return email and delete the
> > message and any attachments from your system.
>
> --
> This message and any attachments are intended only for the use of the
> 

Re: Starting and stopping storm

2015-09-29 Thread Stephen Powis
I have no idea what happens if you bring down all of the nodes in the
cluster while the topologies are deactivated.  I'd suggest testing it and
seeing, or maybe someone else can speak up?

Also depending on the version of storm you're upgrading from, there may be
different steps involved that may complicate things.

See release notes around upgrading from 0.8.x to 0.9.0:
https://storm.apache.org/2013/12/08/storm090-released.html#api-compatibility-and-upgrading
for just an example.

Additionally depending on if the storm client API changes significantly
between versions, it may require recompiling existing topology code against
the new API version before it can run properly on the new storm cluster
version.  Taking a wild guess... this probably really only will be a
problem when upgrading major versions, and less of a concern for minor
version upgrades, but again I don't really know that for sure.


On Tue, Sep 29, 2015 at 1:36 PM, Garcia-Contractor, Joseph (CORP) <
joseph.garcia-contrac...@adp.com> wrote:

> Stephen,
>
>
>
> Thank you for the response!  Helps out a lot.
>
>
>
> So a further question.  And forgive my lack of knowledge here, I am not
> the one using Storm, only deploying and running it, so I don’t understand
> all the reasoning behind why something is done a certain way in Storm.
>
>
>
> Let’s say I have deactivated all the topologies.  Is it necessary to then
> kill the topology?  Could I not just wait a set amount of time to ensure
> the tuples have cleared, say 5 minutes, and then bring down the nodes?
>
>
>
> The reason I ask this is because it is a lot easier to activate the
> topologies after the nodes are back up with a non-interactive script.  I
> would like to avoid using “storm jar” to load the topology because that
> means I need to hard code stuff into my scripts or come up with a separate
> conf file for my script.  See my current code below:
>
>
>
> function deactivate_topos {
>
>   STORM_TOPO_STATUS=$(storm list | sed -n -e
> '/^---/,$p'
> | sed -e
> '/^---/d' |
> awk '{print $1 ":" $2}')
>
>
>
>   for i in $STORM_TOPO_STATUS
>
>   do
>
> IFS=':' read TOPO_NAME TOPO_STATUS <<< "$i"
>
>echo "$TOPO_NAME $TOPO_STATUS"
>
>if [ $TOPO_STATUS = 'ACTIVE' ]; then
>
>   storm deactivate ${TOPO_NAME}
>
>fi
>
> storm list | sed -n -e
> '/^---/,$p'
>
>   done
>
> }
>
>
>
> function activate_topos {
>
>   STORM_TOPO_STATUS=$(storm list | sed -n -e
> '/^---/,$p'
> | sed -e
> '/^---/d' |
> awk '{print $1 ":" $2}')
>
>   for i in $STORM_TOPO_STATUS
>
>   do
>
> IFS=':' read TOPO_NAME TOPO_STATUS <<< "$i"
>
> echo "$TOPO_NAME $TOPO_STATUS"
>
> if [ $TOPO_STATUS = 'INACTIVE' ]; then
>
>   storm activate ${TOPO_NAME}
>
> fi
>
> storm list | sed -n -e
> '/^---/,$p'
>
>   done
>
> }
>
>
>
> *From:* Stephen Powis [mailto:spo...@salesforce.com]
> *Sent:* Tuesday, September 29, 2015 12:45 PM
>
> *To:* user@storm.apache.org
> *Subject:* Re: Starting and stopping storm
>
>
>
> I would imagine the safest way would be to elect to deactivate each
> running topology, which should make your spouts stop emitting tuples.
> You'd wait for all of the currently processing tuples to finish processing,
> and then kill the topology.
>
> If tuples get processed quickly in your topologies, you can effectively do
> this by selecting kill and giving it a long enough wait time.  IE --
> Telling storm to kill your topology after 30 seconds means it will
> deactivate your spouts for 30 seconds, waiting for existing tuples to
> finish getting processed, and then kill off the topology.
>
> Then bring down each node, upgrade it, bring it back online and resubmit
> your topologies.
>
>
>
> On Tue, Sep 29, 2015 at 10:02 AM, Garcia-Contractor, Joseph (CORP) <
> joseph.garcia-contrac...@adp.com> wrote:
>
> I don't think I got my question across right or I am confused.
>
> Let me break this down in a more simple fashion.
>
> I have a Storm Cluster named "The Quiet Storm" ;) here is what it consists
> of:
>
> **
> Server ZK1: Running Zookeeper
> Server ZK2: R

Re: Netty Reconnect issues on 0.9.3, 0.9.4, 0.9.5

2015-09-13 Thread Stephen Powis
Kashyap -  I see this same issue on 0.9.5

On Sun, Sep 13, 2015 at 9:58 AM, Enno Shioji  wrote:

> There was a change in that area in 0.9.6 (
> https://issues.apache.org/jira/browse/STORM-763), although I'm not sure
> if it will help your issue.
>
>
> On Sun, Sep 13, 2015 at 2:35 PM, Kashyap Mhaisekar 
> wrote:
>
>> Hmm. Thanks for the lead. On storm UI, the uptime for each executor
>> except spout shows pretty much consistent values. Spout has crashed for
>> sure. But then never comes up. Will check this up again.
>>
>> But the other question is - Is the Netty reconnects issue solved in
>> 0.9.5? What is your storm version?
>>
>> Thanks
>> Kashyap
>> On Sep 13, 2015 08:04, "Martin Burian"  wrote:
>>
>>> They do restart after a while, yes. But if you don't see any error in
>>> the log, it's weird. I encountered a case of workers not starting because I
>>> configured the worker JVM to expose JMX interface for remote monitoring on
>>> a given port. Other workers on the same machine however could not start as
>>> they failed to bind to the already used port. No error messages whatsoever.
>>> Might any such thing be your case?
>>>
>>> Othervise the cause should be logged somewhere. A worker is definitely
>>> not running, or at least talking to the supervisor. You could try using
>>> less workers to find out when/where the error occurs.
>>>
>>> Martin
>>>
>>> ne 13. 9. 2015 v 13:43 odesílatel Kashyap Mhaisekar 
>>> napsal:
>>>
 All worker logs have the same log. Workers are up. I am using only one
 box with multiple workers to test.
 Workers should be restarted of they fail right? So ideally, this error
 should be gone in a while..

 Thanks


 Kashyap
 On Sep 13, 2015 05:10, "Martin Burian" 
 wrote:

> When this appears in worker log, it means that the worker is trying to
> connect to another worker, but the other is not running. What do you see 
> in
> worker-6707.log? Is the other worker runing?
> Matrin
>
> ne 13. 9. 2015 v 6:06 odesílatel Kashyap Mhaisekar <
> kashya...@gmail.com> napsal:
>
>> Also,
>> Is there a way to switch back to 0mq from Netty? If so, what needs to
>> be done?
>>
>> Thanks
>> kashyap
>>
>> On Sat, Sep 12, 2015 at 10:49 PM, Kashyap Mhaisekar <
>> kashya...@gmail.com> wrote:
>>
>>> Am having a Netty related issues in my storm cluster where the spout
>>> stops consuming after a while. The corresponding worker logs show -
>>> *2015-09-12T23:28:23.391-0400 b.s.m.n.Client [ERROR] connection
>>> attempt 26 to
>>> Netty-Client-trsttel2pascapp01.vm.itg.corp.us.shldcorp.com/10.2.70.18:6707
>>> 
>>> failed: java.lang.RuntimeException: Returned channel was actually not
>>> established*
>>> *2015-09-12T23:28:23.391-0400 b.s.m.n.Client [INFO] connection
>>> attempt 27 to Netty-Client-serverstorm1.myorg.com/10.2.70.18:6707
>>>  scheduled 
>>> to
>>> run in 392 ms*
>>> *2015-09-12T23:28:23.784-0400 b.s.m.n.Client [ERROR] connection
>>> attempt 27 to Netty-Client-**serverstorm1.myorg.com
>>> **/10.2.70.18:6707
>>>  failed: java.lang.RuntimeException: Returned
>>> channel was actually not established*
>>>
>>> The corresponding supervisor logs had
>>> *2015-09-12T23:28:23.018-0400 b.s.d.supervisor [INFO]
>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started*
>>> *2015-09-12T23:28:23.518-0400 b.s.d.supervisor [INFO]
>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started*
>>> *2015-09-12T23:28:24.019-0400 b.s.d.supervisor [INFO]
>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started*
>>>
>>> I had storm version 0.9.3 when this issue occurred and had upgraded
>>> to 0.9.4 and 0.9.5 to seek relief, but the issue still persists. Am not
>>> sure what else to do. Am not even sure why this issue occurs and what
>>> triggers it. Any help would be great and appreciated.
>>>
>>> Thanks
>>> Kashyap
>>>
>>>
>>
>


Re: Netty Reconnect issues on 0.9.3, 0.9.4, 0.9.5

2015-09-13 Thread Stephen Powis
We ran 0.9.2 previously and didn't see the issue.  We also saw that the
spout would seem to stop, and the entire topology stall out for 5-10 mins
at a time.

On Sun, Sep 13, 2015 at 2:57 PM, Kashyap Mhaisekar <kashya...@gmail.com>
wrote:

> Thanks Steve, Enno, Martin. Only common thing between teh worker was the
> gc logs that I configured. I dont find anything else. After i made the
> changes there, what I also is that spout stops consuming and there are no
> crashes of workers too. It just stops and nothing happens.
>
> I think it has to do with the number of messages being sent into the
> system. If I keep the message level low (adjust maxx spout pending), then
> the topology is up for 90 mins and counting. Otherwise, the system crashed
> in 15 mins. What I was expecting was that the topology crashes and then
> restarts, but that is exactly what was not happening.
>
> i tried it in 0.10.0-beta1 too and i found the same behavior. The last
> prod version i had was 0.9.0-wip16 and there the 0mq was used. I did not
> find issues there though.
>
> THanks
> kashyap
>
> On Sep 13, 2015 15:39, "Stephen Powis" <spo...@salesforce.com> wrote:
>
>> Kashyap -  I see this same issue on 0.9.5
>>
>> On Sun, Sep 13, 2015 at 9:58 AM, Enno Shioji <eshi...@gmail.com> wrote:
>>
>>> There was a change in that area in 0.9.6 (
>>> https://issues.apache.org/jira/browse/STORM-763), although I'm not sure
>>> if it will help your issue.
>>>
>>>
>>> On Sun, Sep 13, 2015 at 2:35 PM, Kashyap Mhaisekar <kashya...@gmail.com>
>>> wrote:
>>>
>>>> Hmm. Thanks for the lead. On storm UI, the uptime for each executor
>>>> except spout shows pretty much consistent values. Spout has crashed for
>>>> sure. But then never comes up. Will check this up again.
>>>>
>>>> But the other question is - Is the Netty reconnects issue solved in
>>>> 0.9.5? What is your storm version?
>>>>
>>>> Thanks
>>>> Kashyap
>>>> On Sep 13, 2015 08:04, "Martin Burian" <martin.buria...@gmail.com>
>>>> wrote:
>>>>
>>>>> They do restart after a while, yes. But if you don't see any error in
>>>>> the log, it's weird. I encountered a case of workers not starting because 
>>>>> I
>>>>> configured the worker JVM to expose JMX interface for remote monitoring on
>>>>> a given port. Other workers on the same machine however could not start as
>>>>> they failed to bind to the already used port. No error messages 
>>>>> whatsoever.
>>>>> Might any such thing be your case?
>>>>>
>>>>> Othervise the cause should be logged somewhere. A worker is definitely
>>>>> not running, or at least talking to the supervisor. You could try using
>>>>> less workers to find out when/where the error occurs.
>>>>>
>>>>> Martin
>>>>>
>>>>> ne 13. 9. 2015 v 13:43 odesílatel Kashyap Mhaisekar <
>>>>> kashya...@gmail.com> napsal:
>>>>>
>>>>>> All worker logs have the same log. Workers are up. I am using only
>>>>>> one box with multiple workers to test.
>>>>>> Workers should be restarted of they fail right? So ideally, this
>>>>>> error should be gone in a while..
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> Kashyap
>>>>>> On Sep 13, 2015 05:10, "Martin Burian" <martin.buria...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> When this appears in worker log, it means that the worker is trying
>>>>>>> to connect to another worker, but the other is not running. What do you 
>>>>>>> see
>>>>>>> in worker-6707.log? Is the other worker runing?
>>>>>>> Matrin
>>>>>>>
>>>>>>> ne 13. 9. 2015 v 6:06 odesílatel Kashyap Mhaisekar <
>>>>>>> kashya...@gmail.com> napsal:
>>>>>>>
>>>>>>>> Also,
>>>>>>>> Is there a way to switch back to 0mq from Netty? If so, what needs
>>>>>>>> to be done?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> kashyap
>>>>>>>>
>>>>>>>> On Sat, Sep 12, 2015 at 10:49 PM, Kashyap Mhaisekar &

Hibernate + Storm

2015-05-14 Thread Stephen Powis
Hello everyone!

I'm currently toying around with a prototype built ontop of Storm and have
been running into some not so easy going while trying to work with
Hibernate and storm.  I was hoping to get input on if this is just a case
of I'm doing it wrong or maybe get some useful tips.

In my prototype, I have a need to fan out a single tuple to several bolts
which do data retrieval from our database in parallel, which then get
merged back into a single stream.  These data retrieval bolts all find
various hibernate entities and pass them along to the merge bolt.  We've
written a kryo serializer that converts from the hibernate entities into
POJOs, which get sent to the merge bolt in tuples.  Once all the tuples get
to the merge bolt, it collects them all into a single tuple and passes it
downstream to a bolt which does processing using the entities.

So it looks something like this.

   (retrieve bolt a) 
/  (retrieve bolt b) \
   /--(retrieve bolt c) -\
--- (split bolt)--(retrieve bolt d)---(merge bolt) -
(processing bolt)

So dealing with detaching the hibernate entities from the session to
serialize them, and then further downstream when we want to work with the
entities again, we have to reattach them to a new sessionthis seems
kind of awkward.

Does doing the above make sense?  Has anyone attempted to do the above?
Any tips or things we should watch out for?  Basically looking for any kind
of input for this use case.

Thanks!