Re: multilingual tuples via kafka

2015-05-27 Thread Sergio Fernández
Perfect. Thanks, Taylor. That explains the basics.

So now I'm taking the string and parsing it as json. What should be the
best practice to do it directly in a scheme?

Cheers,

On Tue, May 26, 2015 at 6:12 PM, P. Taylor Goetz  wrote:

> The data coming from Kafka to the Kafka spout is just a byte array
> containing the raw data. To consume it, you need to define a `Scheme`
> implementation that knows how to parse the byte array to produce tuples.
>
> For example, the `StringScheme` class included in storm-kafka just
> converts the byte array to a string and puts that value in the tuple with
> the key “str”:
>
>
> https://github.com/apache/storm/blob/master/external/storm-kafka/src/jvm/storm/kafka/StringScheme.java
>
> -Taylor
>
> On May 22, 2015, at 11:51 AM, Sergio Fernández  wrote:
>
> Hi,
>
> I'm experimenting on feeding the KafkaSpout from another language
> different than Jaba, but I guess I have conceptual error...
>
> From Python I'm sending two values:
>
> producer.send_messages("test", "val1", "val2")
>
> But when from a Java bolt I try to handle it:
>
> execute(Tuple input) {
>   String val1 = input.getString(0);
>   String val2 = input.getString(1);
>   ...
> }
>
> I'm getting a IndexOutOfBoundsException: Index: 1, Size: 1.
>
> I'd appreciate any advise how to correctly send tuples.
>
> Thanks!
>
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co
>
>
>


-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co


Re: Aeolus 0.1 available

2015-05-27 Thread Manu Zhang
Hi Matthias,

The project looks interesting. Any detailed performance data compared with
latest storm versions (0.9.3 / 0.9.4) ?

Thanks,
Manu Zhang

On Tue, May 26, 2015 at 11:52 PM, Matthias J. Sax <
mj...@informatik.hu-berlin.de> wrote:

> Dear Storm community,
>
> we would like to share our project Aeolus with you. While the project is
> not finished, our first component --- a transparent batching layer ---
> is available now.
>
> Aeolus' batching component, is a transparent layer that can increase
> Storm's throughput by an order of magnitude while keeping tuple-by-tuple
> processing semantics. Batching happens transparent to the system and the
> user code. Thus, it can be used without changing existing code.
>
> Aeolus is available using Apache License 2.0 and would be happy to any
> feedback. If you like to try it out, you can download Aeolus from our
> git repository:
> https://github.com/mjsax/aeolus
>
>
> Happy hacking,
>   Matthias
>
>


What will Storm do when a supervisor is manually killed?

2015-05-27 Thread Xunyun Liu
​Hi Storm fellows, ​I've got a simple question and would like to have a
quick answer.

Let's say a storm topology is running on a cluster without any supervision,
at the beginning it is behaving properly and have a balanced distribution.
But you know, sometimes errors may occur and bring down the supervisor
daemon or even the whole machine. I am just wondering in such kind of
situations what action storm will take to guarantee the fault resilience?
E.g. when I use "Ctrl+C" to terminate the supervisor or manually shut down
the machine to simulate those two different kinds of crash, I found that
storm will automatically allocate the lost slots to another machine and
just keeping running,  is it a implicit invocation of rebalancing command?
It is a transparent way to deal with supervisor error but what if I got the
lost machine back after several minutes of downtime ? Is there any  out of
box method that could automatically  rebalance the topology again and put
the revival supervisor or machine back to work?

Any answer to this will be greatly appreciated, Thanks. :-)


RE: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-05-27 Thread Rajesh_Kalluri
Dell - Internal Use - Confidential
The previous reply was from March 10th so Bill may have resolved this issue but 
here is some info that can be  useful to other folks in the future.

You can set the JVM flags to do a heapdump on OOM using 
-XX:+HeapDumpOnOutOfMemoryError.

You can then analyze the heapdump in a tool like VisualVM, which has a leak 
detector to tell you where you are spending all your memory.

Make sure you set the worker childopts –Xmx to a reasonable value like 2G so 
that you can comfortably analyze the heapdump.

You can also turn on jmxremote monitoring and use tools like Java Mission 
Control(ships with JDK) to do flight recordings that give you a wealth of 
information on whats happening in your JVM, thread dumps at periodic times and 
such.

From: Binh Nguyen Van [mailto:binhn...@gmail.com]
Sent: Wednesday, May 27, 2015 3:03 PM
To: user@storm.apache.org
Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Not sure if you fixed the issue but I think the problem may come from the max 
spout pending.
You are using Trident and this value is the max number of pending "BATCHES" and 
not number
of "tuples" so let say your topic has 10 partitions and max spout pending is 
set to 10 and the max
fetch size is set to 1MB then you will have 10*10*1 = 100MB input data in your 
topology at a moment
and this will blow up your heap really quick.
I think go with max spout pending set to 1 and then tune it is better way to go.

Hope this help
-Binh

On Tue, Mar 10, 2015 at 3:56 AM, Brunner, Bill 
mailto:bill.brun...@baml.com>> wrote:
Once you’ve profiled your app, you should also play around with different 
garbage collectors.  Considering you’re reaching max heap, I assume your tuples 
are probably pretty large.  If that’s the case and you’re using the CMS garbage 
collector, you’re going to blow out your heap regularly.  I found with large 
tuples and/or memory intensive computations that the old parallel GC works the 
best because it compresses old gen every time it collects… CMS doesn’t and each 
sweep it  tries to jam more into the heap until it can’t any longer and then 
blows up.  There is also a great article by Michael Knoll about storm’s message 
buffers and how to tweak them depending on your needs.  
http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/



From: Sa Li [mailto:sa.in.v...@gmail.com]
Sent: Monday, March 09, 2015 10:15 PM
To: user@storm.apache.org
Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded


I have not done that yet, not quite familiar with this, but I will try to do 
that tomorrow, thanks.
On Mar 9, 2015 7:10 PM, "Nathan Leung" 
mailto:ncle...@gmail.com>> wrote:
Have you profiled you spout / bolt logic as recommended earlier in this thread?

On Mon, Mar 9, 2015 at 9:49 PM, Sa Li 
mailto:sa.in.v...@gmail.com>> wrote:

You are right , I have already increased the heap in yaml to 2 G for each 
worker, but still have the issue, so I doubt I may running into some other 
causes,  receive,send buffer size? And in general, before I see the GC overhead 
in storm ui,  I came cross other errors in worker log as well, like Netty 
connection, null pointer,etc, as I show in another post.

Thanks
On Mar 9, 2015 5:36 PM, "Nathan Leung" 
mailto:ncle...@gmail.com>> wrote:
I still think you should try running with a larger heap.  :)  Max spout pending 
determines how many tuples can be pending (tuple tree is not fully acked) per 
spout task.  If you have many spout tasks per worker this can be a large amount 
of memory.  It also depends on how big your tuples are.

On Mon, Mar 9, 2015 at 6:14 PM, Sa Li 
mailto:sa.in.v...@gmail.com>> wrote:
Hi, Nathan

We have played around max spout pending in dev, if we set it as 10, it is OK, 
but if we set it more than 50, GC overhead starts to come out. We are finally 
writing tuples into postgresqlDB, the highest speed for writing into DB is 
around 40Krecords/minute, which is supposed to be very slow, maybe that is why 
tuples getting accumulated in memory before dumped into DB. But I think 10 is 
too small, does that mean, only 10 tuples are allowed in the flight?

thanks

AL

On Fri, Mar 6, 2015 at 7:39 PM, Nathan Leung 
mailto:ncle...@gmail.com>> wrote:
I've not modified netty so I can't comment on that.  I would set max spout 
pending; try 1000 at first.  This will limit the number of tuples that you can 
have in flight simultaneously and therefore limit the amount of memory used by 
these tuples and their processing.

On Fri, Mar 6, 2015 at 7:03 PM, Sa Li 
mailto:sa.in.v...@gmail.com>> wrote:
Hi, Nathan

THe log size of that kafka topic is 23515541, each record is about 3K,  I check 
the yaml file, I don't have max spout pending set, so I assume it is should be 
default: topology.max.spout.pending: null

Should I set it to a certain value? Also I sometimes seeing the 
java.nio.channels.ClosedChannelException: null

SF / East Bay Area Stream Processing Meetup next Thursday (6/4)

2015-05-27 Thread Siva Jagadeesan
http://www.meetup.com/Bay-Area-Stream-Processing/events/219086133/

Thursday, June 4, 2015

6:45 PM
TubeMogul


1250 53rd
St #1
Emeryville, CA

6:45PM to 7:00PM - Socializing

7:00PM to 8:00PM - Talks

8:00PM to 8:30PM - Socializing

Speaker :

*Bill Zhao (from TubeMogul)*

Bill was working as a researcher in the UC Berkeley AMP lab during the
creation of Spark and Tachyon, and worked on improving Spark memory
utilization and Spark Tachyon integration.  The AMP lab Working at the
intersection of three massive trends: powerful machine learning, cloud
computing, and crowdsourcing, the AMPLab is integrating Algorithms,
Machines, and People to make sense of Big Data.

Topic:

*Introduction to Spark and Tachyon*

Description:

Spark is a fast and general processing engine compatible with Hadoop data.
It can run in Hadoop clusters through YARN or Spark's standalone mode, and
it can process data in HDFS, etc.  It is designed to perform both batch
processing (similar to MapReduce).  Tachyon is a memory-centric distributed
storage system enabling reliable data sharing at memory-speed across
cluster frameworks, such as Spark and MapReduce.  It achieves high
performance by leveraging lineage information and using memory
aggressively. Tachyon caches working set files in memory, thereby avoiding
going to disk to load datasets that are frequently read. This enables
different jobs/queries and frameworks to access cached files at memory
speed.


Re: Load /Cache Data inside Bolt

2015-05-27 Thread 임정택
If it is really static thing, you can serialize it and pass to Bolt's
constructor.
If you don't want to spend serialization cost, you can store it to file
(and include it to resources of jar) and load it from preparation of Bolt.
If you want to have small jar, you can use external storage. (RDB, HDFS,
NOSQL, and so on)

Hope this helps.

Thanks.
Jungtaek Lim (HeartSaVioR)

2015년 5월 28일 목요일, Ashish Soni님이 작성한 메시지:

> I need to load some static data inside Bolts and cached it , any
> recommendation as what is the best way to do it.
>
> Thanks,
>


-- 
Name : 임 정택
Blog : http://www.heartsavior.net / http://dev.heartsavior.net
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior


Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-05-27 Thread Binh Nguyen Van
Not sure if you fixed the issue but I think the problem may come from the
max spout pending.
You are using Trident and this value is the max number of pending "BATCHES"
and not number
of "tuples" so let say your topic has 10 partitions and max spout pending
is set to 10 and the max
fetch size is set to 1MB then you will have 10*10*1 = 100MB input data in
your topology at a moment
and this will blow up your heap really quick.
I think go with max spout pending set to 1 and then tune it is better way
to go.

Hope this help
-Binh

On Tue, Mar 10, 2015 at 3:56 AM, Brunner, Bill 
wrote:

>  Once you’ve profiled your app, you should also play around with
> different garbage collectors.  Considering you’re reaching max heap, I
> assume your tuples are probably pretty large.  If that’s the case and
> you’re using the CMS garbage collector, you’re going to blow out your heap
> regularly.  I found with large tuples and/or memory intensive computations
> that the old parallel GC works the best because it compresses old gen every
> time it collects… CMS doesn’t and each sweep it  tries to jam more into the
> heap until it can’t any longer and then blows up.  There is also a great
> article by Michael Knoll about storm’s message buffers and how to tweak
> them depending on your needs.
> http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/
>
>
>
>
>
>
>
> *From:* Sa Li [mailto:sa.in.v...@gmail.com]
> *Sent:* Monday, March 09, 2015 10:15 PM
> *To:* user@storm.apache.org
> *Subject:* Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
>
>
>
> I have not done that yet, not quite familiar with this, but I will try to
> do that tomorrow, thanks.
>
> On Mar 9, 2015 7:10 PM, "Nathan Leung"  wrote:
>
> Have you profiled you spout / bolt logic as recommended earlier in this
> thread?
>
>
>
> On Mon, Mar 9, 2015 at 9:49 PM, Sa Li  wrote:
>
> You are right , I have already increased the heap in yaml to 2 G for each
> worker, but still have the issue, so I doubt I may running into some other
> causes,  receive,send buffer size? And in general, before I see the GC
> overhead in storm ui,  I came cross other errors in worker log as well,
> like Netty connection, null pointer,etc, as I show in another post.
>
> Thanks
>
> On Mar 9, 2015 5:36 PM, "Nathan Leung"  wrote:
>
> I still think you should try running with a larger heap.  :)  Max spout
> pending determines how many tuples can be pending (tuple tree is not fully
> acked) per spout task.  If you have many spout tasks per worker this can be
> a large amount of memory.  It also depends on how big your tuples are.
>
>
>
> On Mon, Mar 9, 2015 at 6:14 PM, Sa Li  wrote:
>
> Hi, Nathan
>
>
>
> We have played around max spout pending in dev, if we set it as 10, it is
> OK, but if we set it more than 50, GC overhead starts to come out. We are
> finally writing tuples into postgresqlDB, the highest speed for writing
> into DB is around 40Krecords/minute, which is supposed to be very slow,
> maybe that is why tuples getting accumulated in memory before dumped into
> DB. But I think 10 is too small, does that mean, only 10 tuples are allowed
> in the flight?
>
>
>
> thanks
>
>
>
> AL
>
>
>
> On Fri, Mar 6, 2015 at 7:39 PM, Nathan Leung  wrote:
>
> I've not modified netty so I can't comment on that.  I would set max spout
> pending; try 1000 at first.  This will limit the number of tuples that you
> can have in flight simultaneously and therefore limit the amount of memory
> used by these tuples and their processing.
>
>
>
> On Fri, Mar 6, 2015 at 7:03 PM, Sa Li  wrote:
>
> Hi, Nathan
>
>
>
> THe log size of that kafka topic is 23515541, each record is about 3K,  I
> check the yaml file, I don't have max spout pending set, so I assume it
> is should be default: topology.max.spout.pending: null
>
>
>
> Should I set it to a certain value? Also I sometimes seeing the 
> java.nio.channels.ClosedChannelException:
> null, or  b.s.d.worker [ERROR] Error on initialization of server mk-worker
>
> does this mean I should add
>
> storm.messaging.netty.server_worker_threads: 1
>
>
>
> storm.messaging.netty.client_worker_threads: 1
>
> storm.messaging.netty.buffer_size: 5242880 #5MB buffer
>
> storm.messaging.netty.max_retries: 30
>
> storm.messaging.netty.max_wait_ms: 1000
>
>
>
> storm.messaging.netty.min_wait_ms: 100
>
> into yaml, and modfiy the values?
>
>
>
>
>
> thanks
>
>
>
>
>
>
>
> On Fri, Mar 6, 2015 at 2:22 PM, Nathan Leung  wrote:
>
> How much data do you have in Kafka? How is your max spout pending set? If
> you have a high max spout pending (or if you emit unanchored tuples) you
> could be using up a lot of memory.
>
> On Mar 6, 2015 5:14 PM, "Sa Li"  wrote:
>
> Hi, Nathan
>
>
>
> I have met a strange issue, when I set spoutConf.forceFromStart=true, it
> will quickly run into GC overhead limit, even I already increase the heap
> size, but I if I remove this setting
>
> it will work fine, I was thinking maybe the kafkaSpout consuming data much
> fas

RE: Status of running storm on yarn (the yahoo project)

2015-05-27 Thread prasad ch
Hi Nathan,
I want to do real time computation using storm, which one is best storm or 
trident. i need to handle huge amount of  data , exactly once please help me

Thanks!
Date: Wed, 27 May 2015 12:40:43 -0400
Subject: Re: Status of running storm on yarn (the yahoo project)
From: nat...@nathanmarz.com
To: user@storm.apache.org
CC: ev...@yahoo-inc.com; maas...@gmail.com

Mesosphere has official support for Storm on Mesos: 
https://github.com/mesos/storm
On Wed, May 27, 2015 at 11:14 AM,   wrote:
Dell - Internal Use - Confidential 
Thanks Bobby, for the detailed answer. So it sounds like ,  it is better not to 
combine Storm with batch workloads at this point (yarn, mesos or ec2), due to 
the network saturation and timeout threats. Is this behavior also seen in other 
streaming frameworks like spark streaming running on YARN. From: Bobby Evans 
[mailto:ev...@yahoo-inc.com] 
Sent: Wednesday, May 27, 2015 9:07 AM
To: Jeffery Maass; user@storm.apache.org
Subject: Re: Status of running storm on yarn (the yahoo project) Mesos is very 
similar to YARN.  It is a resource scheduler.  Storm in the past had support 
for mesos, through a separate repo https://github.com/nathanmarz/storm-mesos it 
might still work with the latest versions of storm.  I don't know.  The concept 
here is that there was a special layer installed that would look for when the 
cluster had outstanding requests and not enough resources to meet those 
requests.  It would then request that many resources from mesos, launch 
supervisors on those nodes and let the scheduler do the rest.  It works quire 
well for elasticity at a small scale, or when you have a lot more network 
bandwidth than you need.  The problem is if mesos, or YARN, or open-stack, or 
EC2, or ... collocates your storm topology with some big batch job that 
suddenly saturates the network for a few seconds to a min heartbeats could 
start to time out, traffic would not flow from one worker to another, etc.  For 
some topologies all you do is tune your timeouts so workers don't get shot and 
relaunched too frequently and live with the noise from other stuff happening on 
the network.  For us though we have some very tight SLAs, if the data is 5 
seconds old throw it away I cannot use it any more.   My current goal with 
storm in this area is to have it be aware of the resources that your topology 
is using, the SLAs that it has, its desired budget for resources, how far over 
that budget it is willing to go,  Where it could possibly get other resources 
if needed (i.e. YARN, Mesos, Open Stack), and any other constraints it might 
have.  Storm would then take all of this into account and adjust the scheduling 
of your topology so that it can grow and shrink with the resources it needs to 
meet the SLAs it has, optionally taking some of those resources from other 
systems if needed.  This is still a ways out, but looking at the research that 
is being done in this area it should be doable in the next year or so. - Bobby  
  On Wednesday, May 27, 2015 8:38 AM, Jeffery Maass  wrote: 
I have heard Nathan Marz mention Mesos.How is yarn / storm-yarn / slider-yarn 
different from Mesos?

These are the links I found to Mesos:
https://github.com/mesos/storm
https://github.com/nathanmarz/storm-mesos
http://mesos.apache.org/Thank you for your time!

+
Jeff Maass
linkedin.com/in/jeffmaass
stackoverflow.com/users/373418/maassql
+ On Wed, May 27, 2015 at 8:28 AM, Bobby Evans 
 wrote:storm-yarn was originally done as a proof of 
concept.  We had plans to take it further, but the amount of work required to 
make it production ready on a very heavily used cluster was more then we were 
willing to invest at the time.  Most of that work was around network 
scheduling, isolation and prioritization, mainly in YARN itself.  There has 
been some work looking into this, but nothing much has happened with it.  At 
the same time http://slider.incubator.apache.org/ showed up and is now the 
preferred way to run Storm on YARN.  To get around the networking issues most 
people will tag a subset of their cluster, a few racks, and only schedule storm 
to run on those nodes.  Long term I really would like to revive storm on yarn, 
and integrate it directly into storm.  Giving storm and the scheduler the 
ability to request new resources with specific constraints opens up a lot of 
new possibilities.  If you want to help out, or if anyone else wants to help 
out with this work, I would be very happy to file some JIRA in open source and 
help direct what needs to be done. - Bobby   On Wednesday, May 27, 2015 4:59 
AM, Spico Florin  wrote: Hello!I'm interesting in 
running the storm topologies on yarn. I was looking at the yahoo project 
https://github.com/yahoo/storm-yarn, and I could observed that there is no 
activity since 7 months ago. Also, the issues and requests lists are not 
updated.Therefore I have some questions:1. Is there any plan to evolve this 
project?2. Is th

Re: Status of running storm on yarn (the yahoo project)

2015-05-27 Thread Nathan Marz
Mesosphere has official support for Storm on Mesos:
https://github.com/mesos/storm

On Wed, May 27, 2015 at 11:14 AM,  wrote:

> *Dell - Internal Use - Confidential *
>
> Thanks Bobby, for the detailed answer.
>
>
>
> So it sounds like ,  it is better not to combine Storm with batch
> workloads at this point (yarn, mesos or ec2), due to the network saturation
> and timeout threats.
>
>
>
> Is this behavior also seen in other streaming frameworks like spark
> streaming running on YARN.
>
>
>
> *From:* Bobby Evans [mailto:ev...@yahoo-inc.com]
> *Sent:* Wednesday, May 27, 2015 9:07 AM
> *To:* Jeffery Maass; user@storm.apache.org
> *Subject:* Re: Status of running storm on yarn (the yahoo project)
>
>
>
> Mesos is very similar to YARN.  It is a resource scheduler.  Storm in the
> past had support for mesos, through a separate repo
>
>
>
> https://github.com/nathanmarz/storm-mesos
>
>
>
> it might still work with the latest versions of storm.  I don't know.  The
> concept here is that there was a special layer installed that would look
> for when the cluster had outstanding requests and not enough resources to
> meet those requests.  It would then request that many resources from mesos,
> launch supervisors on those nodes and let the scheduler do the rest.  It
> works quire well for elasticity at a small scale, or when you have a lot
> more network bandwidth than you need.  The problem is if mesos, or YARN, or
> open-stack, or EC2, or ... collocates your storm topology with some big
> batch job that suddenly saturates the network for a few seconds to a min
> heartbeats could start to time out, traffic would not flow from one worker
> to another, etc.  For some topologies all you do is tune your timeouts so
> workers don't get shot and relaunched too frequently and live with the
> noise from other stuff happening on the network.  For us though we have
> some very tight SLAs, if the data is 5 seconds old throw it away I cannot
> use it any more.
>
>
>
> My current goal with storm in this area is to have it be aware of the
> resources that your topology is using, the SLAs that it has, its desired
> budget for resources, how far over that budget it is willing to go,  Where
> it could possibly get other resources if needed (i.e. YARN, Mesos, Open
> Stack), and any other constraints it might have.  Storm would then take all
> of this into account and adjust the scheduling of your topology so that it
> can grow and shrink with the resources it needs to meet the SLAs it has,
> optionally taking some of those resources from other systems if needed.
> This is still a ways out, but looking at the research that is being done in
> this area it should be doable in the next year or so.
>
>
>
> - Bobby
>
>
>
>
>
>
>
> On Wednesday, May 27, 2015 8:38 AM, Jeffery Maass 
> wrote:
>
>
>
> I have heard Nathan Marz mention Mesos.
>
> How is yarn / storm-yarn / slider-yarn different from Mesos?
>
> These are the links I found to Mesos:
> https://github.com/mesos/storm
> https://github.com/nathanmarz/storm-mesos
> http://mesos.apache.org/
>
>
> Thank you for your time!
>
> +
> Jeff Maass 
> linkedin.com/in/jeffmaass
> stackoverflow.com/users/373418/maassql
> +
>
>
>
> On Wed, May 27, 2015 at 8:28 AM, Bobby Evans  wrote:
>
> storm-yarn was originally done as a proof of concept.  We had plans to
> take it further, but the amount of work required to make it production
> ready on a very heavily used cluster was more then we were willing to
> invest at the time.  Most of that work was around network scheduling,
> isolation and prioritization, mainly in YARN itself.  There has been some
> work looking into this, but nothing much has happened with it.  At the same
> time http://slider.incubator.apache.org/ showed up and is now the
> preferred way to run Storm on YARN.  To get around the networking issues
> most people will tag a subset of their cluster, a few racks, and only
> schedule storm to run on those nodes.  Long term I really would like to
> revive storm on yarn, and integrate it directly into storm.  Giving storm
> and the scheduler the ability to request new resources with specific
> constraints opens up a lot of new possibilities.  If you want to help out,
> or if anyone else wants to help out with this work, I would be very happy
> to file some JIRA in open source and help direct what needs to be done.
>
> - Bobby
>
>
>
>
>
> On Wednesday, May 27, 2015 4:59 AM, Spico Florin 
> wrote:
>
>
>
> Hello!
>
> I'm interesting in running the storm topologies on yarn.
>
> I was looking at the yahoo project https://github.com/yahoo/storm-yarn,
> and I could observed that there is no activity since 7 months ago. Also,
> the issues and requests lists are not updated.
>
> Therefore I have some questions:
>
> 1. Is there any plan to evolve this project?
>
> 2. Is there any plan to integrate this project in the main branch?
>
> 3. Is someone using this approach in production ready mode?
>
>
>
> I look forw

Load /Cache Data inside Bolt

2015-05-27 Thread Ashish Soni
I need to load some static data inside Bolts and cached it , any
recommendation as what is the best way to do it.

Thanks,


RE: Status of running storm on yarn (the yahoo project)

2015-05-27 Thread Rajesh_Kalluri
Dell - Internal Use - Confidential
Thanks Bobby, for the detailed answer.

So it sounds like ,  it is better not to combine Storm with batch workloads at 
this point (yarn, mesos or ec2), due to the network saturation and timeout 
threats.

Is this behavior also seen in other streaming frameworks like spark streaming 
running on YARN.

From: Bobby Evans [mailto:ev...@yahoo-inc.com]
Sent: Wednesday, May 27, 2015 9:07 AM
To: Jeffery Maass; user@storm.apache.org
Subject: Re: Status of running storm on yarn (the yahoo project)

Mesos is very similar to YARN.  It is a resource scheduler.  Storm in the past 
had support for mesos, through a separate repo

https://github.com/nathanmarz/storm-mesos

it might still work with the latest versions of storm.  I don't know.  The 
concept here is that there was a special layer installed that would look for 
when the cluster had outstanding requests and not enough resources to meet 
those requests.  It would then request that many resources from mesos, launch 
supervisors on those nodes and let the scheduler do the rest.  It works quire 
well for elasticity at a small scale, or when you have a lot more network 
bandwidth than you need.  The problem is if mesos, or YARN, or open-stack, or 
EC2, or ... collocates your storm topology with some big batch job that 
suddenly saturates the network for a few seconds to a min heartbeats could 
start to time out, traffic would not flow from one worker to another, etc.  For 
some topologies all you do is tune your timeouts so workers don't get shot and 
relaunched too frequently and live with the noise from other stuff happening on 
the network.  For us though we have some very tight SLAs, if the data is 5 
seconds old throw it away I cannot use it any more.

My current goal with storm in this area is to have it be aware of the resources 
that your topology is using, the SLAs that it has, its desired budget for 
resources, how far over that budget it is willing to go,  Where it could 
possibly get other resources if needed (i.e. YARN, Mesos, Open Stack), and any 
other constraints it might have.  Storm would then take all of this into 
account and adjust the scheduling of your topology so that it can grow and 
shrink with the resources it needs to meet the SLAs it has, optionally taking 
some of those resources from other systems if needed.  This is still a ways 
out, but looking at the research that is being done in this area it should be 
doable in the next year or so.

- Bobby



On Wednesday, May 27, 2015 8:38 AM, Jeffery Maass 
mailto:maas...@gmail.com>> wrote:

I have heard Nathan Marz mention Mesos.
How is yarn / storm-yarn / slider-yarn different from Mesos?

These are the links I found to Mesos:
https://github.com/mesos/storm
https://github.com/nathanmarz/storm-mesos
http://mesos.apache.org/

Thank you for your time!

+
Jeff Maass
linkedin.com/in/jeffmaass
stackoverflow.com/users/373418/maassql
+

On Wed, May 27, 2015 at 8:28 AM, Bobby Evans 
mailto:ev...@yahoo-inc.com>> wrote:
storm-yarn was originally done as a proof of concept.  We had plans to take it 
further, but the amount of work required to make it production ready on a very 
heavily used cluster was more then we were willing to invest at the time.  Most 
of that work was around network scheduling, isolation and prioritization, 
mainly in YARN itself.  There has been some work looking into this, but nothing 
much has happened with it.  At the same time 
http://slider.incubator.apache.org/ showed up and is now the preferred way to 
run Storm on YARN.  To get around the networking issues most people will tag a 
subset of their cluster, a few racks, and only schedule storm to run on those 
nodes.  Long term I really would like to revive storm on yarn, and integrate it 
directly into storm.  Giving storm and the scheduler the ability to request new 
resources with specific constraints opens up a lot of new possibilities.  If 
you want to help out, or if anyone else wants to help out with this work, I 
would be very happy to file some JIRA in open source and help direct what needs 
to be done.
- Bobby


On Wednesday, May 27, 2015 4:59 AM, Spico Florin 
mailto:spicoflo...@gmail.com>> wrote:

Hello!
I'm interesting in running the storm topologies on yarn.
I was looking at the yahoo project https://github.com/yahoo/storm-yarn, and I 
could observed that there is no activity since 7 months ago. Also, the issues 
and requests lists are not updated.
Therefore I have some questions:
1. Is there any plan to evolve this project?
2. Is there any plan to integrate this project in the main branch?
3. Is someone using this approach in production ready mode?

I look forward for your answers.
 Regards,
 Florin










Re: Status of running storm on yarn (the yahoo project)

2015-05-27 Thread P. Taylor Goetz
I also developed a prototype/proof-of-concept (read: duck tape and bailer 
twine) for running Storm on YARN.

I took a slightly different approach than Yahoo’s storm-yarn and Slider which 
from a high level allow you to spin up a Storm cluster on top of YARN. In my 
PoC a topology is treated as a single YARN application — you use a specialized 
`storm jar` command to submit a topology and request the resources that will be 
dedicated to that topology. Behind the scenes it spins up the necessary 
resources to run the topology — essentially a Storm cluster where all worker 
slots, resources, etc. are dedicated to a single topology. That approach makes 
it easier to deal with things like multi-tenancy.

The way I did it was to develop YARN-aware implementations of the INimbus and 
ISupervisor interfaces that talked to the YARN resource manager, very similar 
to the approach Nathan took with storm-mesos. The ultimate goal was to 
implement elastic scaling of a topology based on demand, SLAs, etc., similar to 
what Bobby described.

Unfortunately, I haven’t had much time to develop it further, though I hope to 
revive it at some point in the future.

-Taylor

On May 27, 2015, at 10:06 AM, Bobby Evans  wrote:

> Mesos is very similar to YARN.  It is a resource scheduler.  Storm in the 
> past had support for mesos, through a separate repo
> 
> https://github.com/nathanmarz/storm-mesos
> 
> it might still work with the latest versions of storm.  I don't know.  The 
> concept here is that there was a special layer installed that would look for 
> when the cluster had outstanding requests and not enough resources to meet 
> those requests.  It would then request that many resources from mesos, launch 
> supervisors on those nodes and let the scheduler do the rest.  It works quire 
> well for elasticity at a small scale, or when you have a lot more network 
> bandwidth than you need.  The problem is if mesos, or YARN, or open-stack, or 
> EC2, or ... collocates your storm topology with some big batch job that 
> suddenly saturates the network for a few seconds to a min heartbeats could 
> start to time out, traffic would not flow from one worker to another, etc.  
> For some topologies all you do is tune your timeouts so workers don't get 
> shot and relaunched too frequently and live with the noise from other stuff 
> happening on the network.  For us though we have some very tight SLAs, if the 
> data is 5 seconds old throw it away I cannot use it any more.  
> 
> My current goal with storm in this area is to have it be aware of the 
> resources that your topology is using, the SLAs that it has, its desired 
> budget for resources, how far over that budget it is willing to go,  Where it 
> could possibly get other resources if needed (i.e. YARN, Mesos, Open Stack), 
> and any other constraints it might have.  Storm would then take all of this 
> into account and adjust the scheduling of your topology so that it can grow 
> and shrink with the resources it needs to meet the SLAs it has, optionally 
> taking some of those resources from other systems if needed.  This is still a 
> ways out, but looking at the research that is being done in this area it 
> should be doable in the next year or so.
>  
> - Bobby 
> 
> 
> 
>  
> On Wednesday, May 27, 2015 8:38 AM, Jeffery Maass  wrote:
> 
> 
> I have heard Nathan Marz mention Mesos.
> 
> How is yarn / storm-yarn / slider-yarn different from Mesos?
> 
> These are the links I found to Mesos:
> https://github.com/mesos/storm
> https://github.com/nathanmarz/storm-mesos
> http://mesos.apache.org/
> 
> 
> Thank you for your time!
> 
> +
> Jeff Maass
> linkedin.com/in/jeffmaass
> stackoverflow.com/users/373418/maassql
> +
> 
> 
> On Wed, May 27, 2015 at 8:28 AM, Bobby Evans  wrote:
> storm-yarn was originally done as a proof of concept.  We had plans to take 
> it further, but the amount of work required to make it production ready on a 
> very heavily used cluster was more then we were willing to invest at the 
> time.  Most of that work was around network scheduling, isolation and 
> prioritization, mainly in YARN itself.  There has been some work looking into 
> this, but nothing much has happened with it.  At the same time 
> http://slider.incubator.apache.org/ showed up and is now the preferred way to 
> run Storm on YARN.  To get around the networking issues most people will tag 
> a subset of their cluster, a few racks, and only schedule storm to run on 
> those nodes.  Long term I really would like to revive storm on yarn, and 
> integrate it directly into storm.  Giving storm and the scheduler the ability 
> to request new resources with specific constraints opens up a lot of new 
> possibilities.  If you want to help out, or if anyone else wants to help out 
> with this work, I would be very happy to file some JIRA in open source and 
> help direct what needs to be done. 
> - Bobby 
> 
> 
> 
> On Wednesday, May 27, 2015 4:59 AM

Re: Status of running storm on yarn (the yahoo project)

2015-05-27 Thread Bobby Evans
Mesos is very similar to YARN.  It is a resource scheduler.  Storm in the past 
had support for mesos, through a separate repo
 https://github.com/nathanmarz/storm-mesos
it might still work with the latest versions of storm.  I don't know.  The 
concept here is that there was a special layer installed that would look for 
when the cluster had outstanding requests and not enough resources to meet 
those requests.  It would then request that many resources from mesos, launch 
supervisors on those nodes and let the scheduler do the rest.  It works quire 
well for elasticity at a small scale, or when you have a lot more network 
bandwidth than you need.  The problem is if mesos, or YARN, or open-stack, or 
EC2, or ... collocates your storm topology with some big batch job that 
suddenly saturates the network for a few seconds to a min heartbeats could 
start to time out, traffic would not flow from one worker to another, etc.  For 
some topologies all you do is tune your timeouts so workers don't get shot and 
relaunched too frequently and live with the noise from other stuff happening on 
the network.  For us though we have some very tight SLAs, if the data is 5 
seconds old throw it away I cannot use it any more.  

My current goal with storm in this area is to have it be aware of the resources 
that your topology is using, the SLAs that it has, its desired budget for 
resources, how far over that budget it is willing to go,  Where it could 
possibly get other resources if needed (i.e. YARN, Mesos, Open Stack), and any 
other constraints it might have.  Storm would then take all of this into 
account and adjust the scheduling of your topology so that it can grow and 
shrink with the resources it needs to meet the SLAs it has, optionally taking 
some of those resources from other systems if needed.  This is still a ways 
out, but looking at the research that is being done in this area it should be 
doable in the next year or so.
 - Bobby
 


  On Wednesday, May 27, 2015 8:38 AM, Jeffery Maass  
wrote:
   

 I have heard Nathan Marz mention Mesos.

How is yarn / storm-yarn / slider-yarn different from Mesos?

These are the links I found to Mesos:
https://github.com/mesos/storm
https://github.com/nathanmarz/storm-mesos
http://mesos.apache.org/


Thank you for your time!

+
Jeff Maass
linkedin.com/in/jeffmaass
stackoverflow.com/users/373418/maassql
+


On Wed, May 27, 2015 at 8:28 AM, Bobby Evans  wrote:

storm-yarn was originally done as a proof of concept.  We had plans to take it 
further, but the amount of work required to make it production ready on a very 
heavily used cluster was more then we were willing to invest at the time.  Most 
of that work was around network scheduling, isolation and prioritization, 
mainly in YARN itself.  There has been some work looking into this, but nothing 
much has happened with it.  At the same time 
http://slider.incubator.apache.org/ showed up and is now the preferred way to 
run Storm on YARN.  To get around the networking issues most people will tag a 
subset of their cluster, a few racks, and only schedule storm to run on those 
nodes.  Long term I really would like to revive storm on yarn, and integrate it 
directly into storm.  Giving storm and the scheduler the ability to request new 
resources with specific constraints opens up a lot of new possibilities.  If 
you want to help out, or if anyone else wants to help out with this work, I 
would be very happy to file some JIRA in open source and help direct what needs 
to be done. 
- Bobby
 


 On Wednesday, May 27, 2015 4:59 AM, Spico Florin  
wrote:
   

 Hello!I'm interesting in running the storm topologies on yarn. I was looking 
at the yahoo project https://github.com/yahoo/storm-yarn, and I could observed 
that there is no activity since 7 months ago. Also, the issues and requests 
lists are not updated.Therefore I have some questions:1. Is there any plan to 
evolve this project?2. Is there any plan to integrate this project in the main 
branch?3. Is someone using this approach in production ready mode?
I look forward for your answers. Regards, Florin






   



  

Re: Status of running storm on yarn (the yahoo project)

2015-05-27 Thread Jeffery Maass
I have heard Nathan Marz mention Mesos.

How is yarn / storm-yarn / slider-yarn different from Mesos?

These are the links I found to Mesos:
https://github.com/mesos/storm
https://github.com/nathanmarz/storm-mesos
http://mesos.apache.org/


Thank you for your time!

+
Jeff Maass 
linkedin.com/in/jeffmaass
stackoverflow.com/users/373418/maassql
+


On Wed, May 27, 2015 at 8:28 AM, Bobby Evans  wrote:

> storm-yarn was originally done as a proof of concept.  We had plans to
> take it further, but the amount of work required to make it production
> ready on a very heavily used cluster was more then we were willing to
> invest at the time.  Most of that work was around network scheduling,
> isolation and prioritization, mainly in YARN itself.  There has been some
> work looking into this, but nothing much has happened with it.  At the same
> time http://slider.incubator.apache.org/ showed up and is now the
> preferred way to run Storm on YARN.  To get around the networking issues
> most people will tag a subset of their cluster, a few racks, and only
> schedule storm to run on those nodes.  Long term I really would like to
> revive storm on yarn, and integrate it directly into storm.  Giving storm
> and the scheduler the ability to request new resources with specific
> constraints opens up a lot of new possibilities.  If you want to help out,
> or if anyone else wants to help out with this work, I would be very happy
> to file some JIRA in open source and help direct what needs to be done.
> - Bobby
>
>
>
>   On Wednesday, May 27, 2015 4:59 AM, Spico Florin 
> wrote:
>
>
> Hello!
> I'm interesting in running the storm topologies on yarn.
> I was looking at the yahoo project https://github.com/yahoo/storm-yarn,
> and I could observed that there is no activity since 7 months ago. Also,
> the issues and requests lists are not updated.
> Therefore I have some questions:
> 1. Is there any plan to evolve this project?
> 2. Is there any plan to integrate this project in the main branch?
> 3. Is someone using this approach in production ready mode?
>
> I look forward for your answers.
>  Regards,
>  Florin
>
>
>
>
>
>
>
>


Re: Status of running storm on yarn (the yahoo project)

2015-05-27 Thread Bobby Evans
storm-yarn was originally done as a proof of concept.  We had plans to take it 
further, but the amount of work required to make it production ready on a very 
heavily used cluster was more then we were willing to invest at the time.  Most 
of that work was around network scheduling, isolation and prioritization, 
mainly in YARN itself.  There has been some work looking into this, but nothing 
much has happened with it.  At the same time 
http://slider.incubator.apache.org/ showed up and is now the preferred way to 
run Storm on YARN.  To get around the networking issues most people will tag a 
subset of their cluster, a few racks, and only schedule storm to run on those 
nodes.  Long term I really would like to revive storm on yarn, and integrate it 
directly into storm.  Giving storm and the scheduler the ability to request new 
resources with specific constraints opens up a lot of new possibilities.  If 
you want to help out, or if anyone else wants to help out with this work, I 
would be very happy to file some JIRA in open source and help direct what needs 
to be done. 
- Bobby
 


 On Wednesday, May 27, 2015 4:59 AM, Spico Florin  
wrote:
   

 Hello!I'm interesting in running the storm topologies on yarn. I was looking 
at the yahoo project https://github.com/yahoo/storm-yarn, and I could observed 
that there is no activity since 7 months ago. Also, the issues and requests 
lists are not updated.Therefore I have some questions:1. Is there any plan to 
evolve this project?2. Is there any plan to integrate this project in the main 
branch?3. Is someone using this approach in production ready mode?
I look forward for your answers. Regards, Florin






  

Re: unsubscribe

2015-05-27 Thread Nipur Patodi
Hi ,
If file is small  you can pass them as json serialized object in storm
config while submiting topology and get from config map at spout and bolt.

Thanks,

_Nipur
On May 27, 2015 3:22 PM, "Chris Bedford"  wrote:

>
>
> On Wed, May 27, 2015 at 1:51 AM, Tousif  wrote:
>
>> Hi ,
>>
>>
>> Is there a way to share a resource file across all workers similar to
>> hdfs. That resource/config file will have to be updated run time. i'm not
>> looking at using hdfs for now.
>>
>> --
>>
>>
>> Regards
>> Tousif Khazi
>>
>>
>
>
> --
> Chris Bedford
>
> Founder & Lead Lackey
> Build Lackey Labs:  http://buildlackey.com
> Go Grails!: http://blog.buildlackey.com
>
>
>


Status of running storm on yarn (the yahoo project)

2015-05-27 Thread Spico Florin
Hello!
I'm interesting in running the storm topologies on yarn.
I was looking at the yahoo project https://github.com/yahoo/storm-yarn, and
I could observed that there is no activity since 7 months ago. Also, the
issues and requests lists are not updated.
Therefore I have some questions:
1. Is there any plan to evolve this project?
2. Is there any plan to integrate this project in the main branch?
3. Is someone using this approach in production ready mode?

I look forward for your answers.
 Regards,
 Florin


unsubscribe

2015-05-27 Thread Chris Bedford
On Wed, May 27, 2015 at 1:51 AM, Tousif  wrote:

> Hi ,
>
>
> Is there a way to share a resource file across all workers similar to
> hdfs. That resource/config file will have to be updated run time. i'm not
> looking at using hdfs for now.
>
> --
>
>
> Regards
> Tousif Khazi
>
>


-- 
Chris Bedford

Founder & Lead Lackey
Build Lackey Labs:  http://buildlackey.com
Go Grails!: http://blog.buildlackey.com


sharing a resource across all workers

2015-05-27 Thread Tousif
Hi ,


Is there a way to share a resource file across all workers similar to hdfs.
That resource/config file will have to be updated run time. i'm not looking
at using hdfs for now.

-- 


Regards
Tousif Khazi