Re: Is there any way for my application code to get notified after it gets deserialized on a worker node and before spouts/bolts are opened/prepared ?

2014-06-02 Thread Marc Vaillant
The bolt base classes have a prepare method:

https://storm.incubator.apache.org/apidocs/backtype/storm/topology/base/BaseBasicBolt.html

and the spout base classes have a similar activate method:

https://storm.incubator.apache.org/apidocs/backtype/storm/topology/base/BaseRichSpout.html

Is that sufficient for your needs or were you thinking of something
different?

Marc

On Sun, Jun 01, 2014 at 04:47:03PM -0700, Chris Bedford wrote:
 Hi there -
 
 I would like to set up some state that spouts and bolts share, and I'd like to
 prepare this state when the StormTopology gets 'activated' on a worker.
 
 it would be great if the StormTopology had something like a prepare or open
 method to indicate when it is starting.  I looked but i could find no such 
 API.
   Maybe I should submit an enhancement request ?
 
 Thanks in advance for your responses,
   -  Chris
 
 
 
 [ if anyone is curious, the shared state is for all my application code to
 check or not check invariants.  the invariant checking takes additional time,
 so we don't want to do it in production.. but during testing/development it
 helps catch bugs].
 
 --
 Chris Bedford
 
 Founder  Lead Lackey
 Build Lackey Labs:  http://buildlackey.com
 Go Grails!: http://blog.buildlackey.com
 
 


Re: Interesting Comparison

2014-05-12 Thread Marc Vaillant
To play devil's advocate, if you believe the stream performance gains,
then the 40k will likely pay for itself in needing to deploy a fraction
of the resources for the same throughput.  

On Mon, May 12, 2014 at 09:02:53AM -0400, John Welcher wrote:
 Hi
 
 Streams also cost 40,000 US while Storm is free.
 
 John
 
 
 On Mon, May 12, 2014 at 3:49 AM, Klausen Schaefersinho 
 klaus.schaef...@gmail.com wrote:
 
 Hi,
 
 I found some interesting comparison of IBM Stream and Storm:
 
 https://www.ibmdw.net/streamsdev/2014/04/22/streams-apache-storm/
 
 It also includes an interesting comparison between ZeroMQ and the Netty
 Performance.
 
 
 Cheers,
 
 Klaus
 
 


Re: Doubts on Apache Storm

2014-05-06 Thread Marc Vaillant
On Tue, May 06, 2014 at 03:21:13PM +0530, milind.pa...@polarisft.com wrote:
 
 Hi,
 
Is Nimbus mandatory for storm? (Our development env is neither using
Nimbus nor any other cloud environment)

I think you might have misunderstood nimbus.  It is a daemon that is
part of storm, *not* nimbus from the nimbus project.

 
 (I am new to Apache storm, it would really help me if any basic document
 available on Apache storm)
 
 Regards
 Milind Patil | Intellect Liquidity cash management
 8SWS 031 | Silver metropolis | Western express highway | Goregaon (East) |
 Mumbai | 400 063. INDIA
 Board: 91-22-67801500  | 91-22-42029200 | Ext: 1734 Mobile: +91 9920612360
 | mail id: milind.pa...@polarisft.com
 
 
 
 This e-Mail may contain proprietary and confidential information and is sent 
 for the intended recipient(s) only.  If by an addressing or transmission 
 error this mail has been misdirected to you, you are requested to delete this 
 mail immediately. You are also hereby notified that any use, any form of 
 reproduction, dissemination, copying, disclosure, modification, distribution 
 and/or publication of this e-mail message, contents or its attachment other 
 than by its intended recipient/s is strictly prohibited.
 
 Visit us at http://www.polarisFT.com
 


Re: PDF processing use case in storm!!

2014-04-28 Thread Marc Vaillant
I think it's important to know whether or not some form of parallelism
(other than throughput) is required, otherwise a standard webservice
seems sufficient for this use case.

On Mon, Apr 28, 2014 at 07:46:35AM -0400, Andrew Perepelytsya wrote:
 You can build request response type topologies via DRPC. However, unless we're
 talking about processing numerous pdfs at once - bad fit, IMO.
 
 If there is parallelism required you might be better off with a custom yarn 
 app
 - looks like YAYA makes it tolerable top write.
 
 Andrew
 
 On Apr 28, 2014 2:41 AM, Deepak Sharma deepakmc...@gmail.com wrote:
 
 Hi All,
 Just wanted to check if this can be valid storm use case.
 I want to write 1 simple storm topology which can read pdf file , process
 it , make some changes like convert it to doc and save the new file.
 I know this can be easily done in batch mode using hadoop.But we want to 
 do
 it in real time ,i.e. when the user demands it.
 We already do it using some java api but it takes lot of time in all
 conversions.
 Can this be achieved in Storm?If yes , Is there any pointer to any 
 examples
 similar to this use case?
 
 
 --
 Thanks
 Deepak
 www.bigdatabig.com
 
 
 
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader of
 this message is not the intended recipient, you are hereby notified that any
 printing, copying, dissemination, distribution, disclosure or forwarding of
 this communication is strictly prohibited. If you have received this
 communication in error, please contact the sender immediately and delete it
 from your system. Thank You.


Re: PDF processing use case in storm!!

2014-04-28 Thread Marc Vaillant
Ok, so why isn't a standard webservice cgi/fascgi/mod_php, etc not
sufficient for that kind of parallelism?  With multiple users making
requests, they will all happen in parallel. 

On Mon, Apr 28, 2014 at 07:53:37PM +0530, Deepak Sharma wrote:
 We need parallelism here.
 As lot of users may be using the service at the same time.It may be for
 different file or the same file.
 
 Thanks
 Deepak
 
 
 On Mon, Apr 28, 2014 at 7:41 PM, Marc Vaillant vaill...@animetrics.com 
 wrote:
 
 I think it's important to know whether or not some form of parallelism
 (other than throughput) is required, otherwise a standard webservice
 seems sufficient for this use case.
 
 On Mon, Apr 28, 2014 at 07:46:35AM -0400, Andrew Perepelytsya wrote:
  You can build request response type topologies via DRPC. However, unless
 we're
  talking about processing numerous pdfs at once - bad fit, IMO.
 
  If there is parallelism required you might be better off with a custom
 yarn app
  - looks like YAYA makes it tolerable top write.
 
  Andrew
 
  On Apr 28, 2014 2:41 AM, Deepak Sharma deepakmc...@gmail.com wrote:
 
      Hi All,
      Just wanted to check if this can be valid storm use case.
      I want to write 1 simple storm topology which can read pdf file ,
 process
      it , make some changes like convert it to doc and save the new file.
      I know this can be easily done in batch mode using hadoop.But we 
 want
 to do
      it in real time ,i.e. when the user demands it.
      We already do it using some java api but it takes lot of time in all
      conversions.
      Can this be achieved in Storm?If yes , Is there any pointer to any
 examples
      similar to this use case?
 
 
      --
      Thanks
      Deepak
      www.bigdatabig.com
 
 
 
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the 
 reader
 of
  this message is not the intended recipient, you are hereby notified that
 any
  printing, copying, dissemination, distribution, disclosure or forwarding
 of
  this communication is strictly prohibited. If you have received this
  communication in error, please contact the sender immediately and delete
 it
  from your system. Thank You.
 
 
 
 
 --
 Thanks
 Deepak
 www.bigdatabig.com
 www.keosha.net


One solution to the stdio redirect issue

2014-03-24 Thread Marc Vaillant
I put together a more complete solution to the insidious STDOUT/STDERR
buffer filling issue.  Basically, if STDOUT/STDERR is not
redirected/consumed in cluster mode it will fill the buffer and
eventually take down your topology.  The original thread on this issue
was not migrated to JIRA but related issues can be found here:

https://issues.apache.org/jira/browse/STORM-202?jql=project%20%3D%20STORM%20AND%20text%20~%20%22stdout%22

Early on, Nathan added a redirect to Log4j that will work for Java, but
not for STDOUT/STDERR coming from native code.  So if your spouts and/or
bolts use a native JNI library and that library writes to STDOUT or
STDERR, you will encounter this problem.  I put together a library that
will redirect all (including native) STDOUT and STDERR to Log4j.  We
have been using it now for a couple weeks in our topologies.
https://github.com/animetrics/STDIORedirect 

I hope someone else can find it useful.

Best,
Marc


Re: Wirbelsturm released, 1-click deployments of Storm clusters

2014-03-19 Thread Marc Vaillant
Hi Michael,

Thanks very much for your hard work on this, your puppet scripts have
been very helpful.   We are having a specific issue with supervision of
zookeeper and I wonder if you have encountered something similar or if
we are doing something wrong.  Even with the stopasgroup=true
supervisord option, there still seems to be a problem with orphaned
child processes when the parent process (zookeeper-server script) goes
down from an external event.  Although running supervisorctl stop
zookeeper will take down the zookeeper-server script and its child
processes, issuing a killall zookeeper-server will take down only the
script, leaving the child processes running.  This sends supervisord
into an infinite loop of attempting to restart zookeeper, but failing
because the child processes are still alive and occupying the required
ports.  

A fix we've found (refer to the last answer here
http://stackoverflow.com/questions/9090683/supervisord-stopping-child-processesis)
is to put 

trap kill -- -$$ EXIT

at the top of the zookeeper-server script.  However, it seems like the
stopasgroup=true setting was designed to handle this case.  I know that
stopasgroup was part of the 3.0b2 (05.28.2013) release of supervisord.  We
are using the 3.0 (07.30.2013) release from your RPM
https://github.com/miguno/wirbelsturm-rpm-supervisord so I believe it
should be available.  

Thanks,
Marc


On Mon, Mar 17, 2014 at 09:02:11PM +0100, Michael G. Noll wrote:
 Hi everyone,
 
 I have released a tool called Wirbelsturm
 (https://github.com/miguno/wirbelsturm) that allows you to perform local
 and remote deployments of Storm.  It's also a small way of saying a big
 thank you to the Storm community.
 
 Wirbelsturm uses Vagrant for creating and managing machines, and Puppet
 for provisioning the machines once they're up and running.  You can also
 use Ansible to interact with deployed machines.  Deploying Storm is but
 one example, of course -- you can deploy other software with Wirbelsturm
 as well (e.g. Graphite, Kafka, Redis, ZooKeeper).
 
 I also wrote a quick intro and behind-the-scenes blog post at [1], which
 covers, for instance, the motivation behind building Wirbelsturm and
 lessons learned along the way (read: mistakes made :-P).
 
 Enjoy!
 Michael
 
 
 [1]
 http://www.michael-noll.com/blog/2014/03/17/wirbelsturm-one-click-deploy-storm-kafka-clusters-with-vagrant-puppet/
 
 


which heartbeat(s) to modify so that debug sessions don't timeout?

2014-02-27 Thread Marc Vaillant
I'm trying to debug some native code that runs in a task using gdb.
When I attach to the process, storm holds me to one or more of its 30s
heartbeat timeouts while stepping through code, at which point it kills
the process and therefore prematurely ends my debugging session.  I'm
having trouble figuring out how to extend this timeout so that I can
debug effectively.  I've tried setting supervisor.worker.timeout.secs
but that doesn't do it (I can tell from ps on the worker process that it
is indeed set to the value I've given).  What other timeout should I
need to set?

Thanks,
Marc


Can a topology be configured to force a maximum of 1 executor per worker?

2014-02-05 Thread Marc Vaillant
Suppose that you have a bolt whose tasks are not thread safe but you
still want parallelism.  It seems that this could be achieved via
multiprocessing by forcing a maximium of 1 executor per worker.  With
this constraint, if you chose a parallelism hint of 4 (with default
executors) you would get 4 tasks in 4 executors each running in a
separate worker.  Can this constraint be configured?

Thanks,
Marc