Can I make a bolt to do a batch job?

2014-07-11 Thread 이승진
Hi all,
 
Assume that there are one spout and one bolt.
 
and spout emits a single log, and bolt stores it into DB for example.
 
and they are connected using shuffle grouping.
 
It comes to my mind that if a bolt can wait for 10 tuples to come from spout, 
and request batch insert to DB, it might be beneficial performancewise.
 
Is there any way to make a bolt gather tuples until certain size and do it's 
job?


Re: Can I make a bolt to do a batch job?

2014-07-11 Thread Srinath C
You could use tick tuples  to do that. The
bolt can be configured to receive periodic ticks which you can use to do
the batch insert.


On Fri, Jul 11, 2014 at 1:35 PM, 이승진  wrote:

> Hi all,
>
>
>
> Assume that there are one spout and one bolt.
>
>
>
> and spout emits a single log, and bolt stores it into DB for example.
>
>
>
> and they are connected using shuffle grouping.
>
>
>
> It comes to my mind that if a bolt can wait for 10 tuples to come from
> spout, and request batch insert to DB, it might be beneficial
> performancewise.
>
>
>
> Is there any way to make a bolt gather tuples until certain size and do
> it's job?
>


Re: Can I make a bolt to do a batch job?

2014-07-11 Thread 이승진
great, I think I have to try this right away.
 
so the tick tuples are generated periodically and flows through every single 
executor within the topology?
 
I read some articles online just now, and sounds to me like I have to write 
code for two cases for every bolt; normal tuple case and tick tuple case. 
 
If my understanding is correct, is there any way to make tick tuple apply to 
certain bolt only?
 
-Original Message-
From: "Srinath C" 
To: "user"; 
"이승진"; 
Cc: 
Sent: 2014-07-11 (금) 17:17:54
Subject: Re: Can I make a bolt to do a batch job?
 
You could use tick tuples to do that. The bolt can be configured to receive 
periodic ticks which you can use to do the batch insert.



On Fri, Jul 11, 2014 at 1:35 PM, 이승진  wrote:


Hi all,
Assume that there are one spout and one bolt.
and spout emits a single log, and bolt stores it into DB for example.


and they are connected using shuffle grouping.
It comes to my mind that if a bolt can wait for 10 tuples to come from spout, 
and request batch insert to DB, it might be beneficial performancewise.


Is there any way to make a bolt gather tuples until certain size and do it's 
job?

 




Re: Can I make a bolt to do a batch job?

2014-07-11 Thread Srinath C
Yes, it can be applied to specific bolts not to all bolts in the topology.
You have to specify the property on the Config object passed to the
topology builders.


On Fri, Jul 11, 2014 at 1:59 PM, 이승진  wrote:

> great, I think I have to try this right away.
>
>
>
> so the tick tuples are generated periodically and flows through every
> single executor within the topology?
>
>
>
> I read some articles online just now, and sounds to me like I have to
> write code for two cases for every bolt; normal tuple case and tick tuple
> case.
>
>
>
> If my understanding is correct, is there any way to make tick tuple apply
> to certain bolt only?
>
>
>
> -Original Message-
> *From:* "Srinath C"
> *To:* "user"; "이승진"<
> sweetest...@navercorp.com>;
> *Cc:*
> *Sent:* 2014-07-11 (금) 17:17:54
> *Subject:* Re: Can I make a bolt to do a batch job?
>
> You could use tick tuples  to do that.
> The bolt can be configured to receive periodic ticks which you can use to
> do the batch insert.
>
>
> On Fri, Jul 11, 2014 at 1:35 PM, 이승진  wrote:
>
> Hi all,
>
> Assume that there are one spout and one bolt.
>
> and spout emits a single log, and bolt stores it into DB for example.
>
> and they are connected using shuffle grouping.
>
> It comes to my mind that if a bolt can wait for 10 tuples to come from
> spout, and request batch insert to DB, it might be beneficial
> performancewise.
>
> Is there any way to make a bolt gather tuples until certain size and do
> it's job?
>
>
>


Storm UI

2014-07-11 Thread Benjamin SOULAS
Hi everyone,

Actually intern for my master's degree, I have to implement topologies and
see what's happening. I am trying to see those data via Storm UI; My
problem is that I don't find enough documentation on that... I installed
the splunk interface, but I don't know how to implement it on my topologies
... Does the Metrics interfaces are used for this???

Please I really need help ...

Regards


回复: Storm Drpc Performance

2014-07-11 Thread /pig五餐肉
Thanks,Kiddon.
I try it. However, it seems that there is no improvement and the latency is 
higher than the original class of  DrpcClient.




-- 原始邮件 --
发件人: "Kidong Lee";;
发送时间: 2014年7月11日(星期五) 中午1:45
收件人: "user"; 

主题: Re: Storm Drpc Performance



This could help:
https://github.com/mykidong/storm-finagle-drpc-client




 2014-07-11 11:29 GMT+09:00 /pig五餐肉 <76006...@qq.com>:
 Hi All


I am using to storm drpc topology to do data sorting and filterting.
 the input stream is the data string sequence.
everything is fine when i use drpcclient class to send string to out drpc 
server in a single thread.the latency is about  65ms.
 but when i send data in mutiple threads (about 200 threads) at the same time, 
the latency is high. some threads are more than 1000ms.
 

I have also try the offical demo BasicDRPCTestTopology. 
 the issue is the same.


env: 11 machines(7g ram,8 cores)
 

 is there a way to address this issue ?‍

Re: Unique tuple id

2014-07-11 Thread Richard Burkhardt
I was just wondering because the ShellBolt generates a unique id for 
anchor recognition 
(https://github.com/apache/incubator-storm/blob/master/storm-core/src/jvm/backtype/storm/task/ShellBolt.java#L171) 
where I thought tuple.messageId.hashCode() should also do it, or am I 
wrong?


Am Mittwoch, 9. Juli 2014 22:23:46 schrieb Danijel Schiavuzzi:

Yes. Check out the Tuple class.

On Wednesday, July 9, 2014, Richard Burkhardt mailto:my.rich...@gmail.com>> wrote:

I'm trying to develop something like a remote bolt. To register
the tuples to my remote server, I need a unique tuple id to
ack,fail the a tuple async. Is there something like a unique id
for each emitted tuple?



--
Danijel Schiavuzzi

E: dani...@schiavuzzi.com 
W: www.schiavuzzi.com 
T: +385989035562
Skype: danijels7


Storm UI: not displayed Executors

2014-07-11 Thread 川原駿
Hello.

Recently I upgraded storm to 0.9.2.

In Component summary page of Storm UI, Executors is not displayed only
when "emitted" of its spout/bolt is 0.

Please tell me the solution.

Thanks.


Vertica & Storm Error

2014-07-11 Thread satyajit sa
Hi ,

Have tried a simple program with storm, for inserting data into vertica
tables.

but facing the below error,

ERROR org.apache.zookeeper.server.NIOServerCnxnFactory - Thread
Thread[main,5,main] died
java.lang.RuntimeException: java.io.NotSerializableException:
com.vertica.jdbc.VerticaConnectionImpl
at backtype.storm.utils.Utils.serialize(Utils.java:81)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at
backtype.storm.topology.TopologyBuilder.createTopology(TopologyBuilder.java:106)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at com.wordcount.example.HelloStorm.main(HelloStorm.java:21) ~[bin/:na]
Caused by: java.io.NotSerializableException:
com.vertica.jdbc.VerticaConnectionImpl
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
~[na:1.7.0_51]
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
~[na:1.7.0_51]
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
~[na:1.7.0_51]
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
~[na:1.7.0_51]
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
~[na:1.7.0_51]
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
~[na:1.7.0_51]
at backtype.storm.utils.Utils.serialize(Utils.java:77)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
... 2 common frames omitted

com.vertica.jdbc.VerticaConnectionImpl is from VerticaJDBC driver.

is there any other way to avoid this problem and proceed with out erros.


HelloStorm.java
Description: Binary data


LineReaderSpout.java
Description: Binary data


VerticaCON.java
Description: Binary data


VerticaInsert.java
Description: Binary data


WordSpitterBolt.java
Description: Binary data


xmlParser.java
Description: Binary data


Re: Storm Drpc Performance

2014-07-11 Thread Derek Dagit

drpc.worker.threads defaults to 64.

Maybe try increasing this?

--
Derek

On 7/10/14, 21:29, /pig五餐肉 wrote:

Hi All


I am using to storm drpc topology to do data sorting and filterting.
the input stream is the data string sequence.
everything is fine when i use drpcclient class to send string to out drpc 
server in a single thread.the latency is about  65ms.
but when i send data in mutiple threads (about 200 threads) at the same time, 
the latency is high. some threads are more than 1000ms.


I have also try the offical demo BasicDRPCTestTopology.
the issue is the same.


env: 11 machines(7g ram,8 cores)


  is there a way to address this issue ?‍



Re: Storm UI: not displayed Executors

2014-07-11 Thread Derek Dagit

This should be fixed with either STORM-370 (merged) or STORM-379 (pull request 
open).

The script that sorts the table did not check that the list was of non-zero 
size before it attempted to sort, and that resulted in an exception that halted 
subsequent rendering on the page.

You can:
- checkout the latest storm and use that
- cherry-pick commit 31c786c into the version you are using.
--
Derek

On 7/11/14, 5:54, 川原駿 wrote:

Hello.

Recently I upgraded storm to 0.9.2.

In Component summary page of Storm UI, Executors is not displayed only
when "emitted" of its spout/bolt is 0.

Please tell me the solution.

Thanks.



Trident topology design approach

2014-07-11 Thread Tomas Mazukna
I am new to trident, so far I have implemented multiple storm topologies
for transaction processing where messages get processed and saved to hbase
where one message generates one save.

Now I need to build a topology to calculate transaction timings where the
stream has separate events fro start and end transaction, so I need to keep
the time for begin and marrt it up with end time to have the duration.

What would be the best approach for implementing this in trident?

Thanks,
-- 
Tomas Mazukna


Re: Vertica & Storm Error

2014-07-11 Thread Nathan Leung
As Vladi says, you should initialize the connection in prepare(), not in
the constructor.  Any objects that are created in the constructor are
serialized and sent to the nimbus, so you will get this exception if you
create anything that is not serializable in your spout/bolt constructors.


On Fri, Jul 11, 2014 at 11:06 AM, Vladi Feigin  wrote:

> Probably you're trying to pass the VerticaConnectionImpl object as part of
> storm tuple but it's not serializable. This is why you fail
> Don't pass it - just initialize it in the persistence bolt (the bolt
> writing into Vertica) in the prepare method
> Vladi
>
>
> On Fri, Jul 11, 2014 at 2:46 PM, satyajit sa 
> wrote:
>
>> Hi ,
>>
>> Have tried a simple program with storm, for inserting data into vertica
>> tables.
>>
>> but facing the below error,
>>
>> ERROR org.apache.zookeeper.server.NIOServerCnxnFactory - Thread
>> Thread[main,5,main] died
>> java.lang.RuntimeException: java.io.NotSerializableException:
>> com.vertica.jdbc.VerticaConnectionImpl
>> at backtype.storm.utils.Utils.serialize(Utils.java:81)
>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>> at
>> backtype.storm.topology.TopologyBuilder.createTopology(TopologyBuilder.java:106)
>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>> at com.wordcount.example.HelloStorm.main(HelloStorm.java:21) ~[bin/:na]
>> Caused by: java.io.NotSerializableException:
>> com.vertica.jdbc.VerticaConnectionImpl
>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>> ~[na:1.7.0_51]
>> at
>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>> ~[na:1.7.0_51]
>> at
>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>> ~[na:1.7.0_51]
>> at
>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>> ~[na:1.7.0_51]
>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>> ~[na:1.7.0_51]
>> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>> ~[na:1.7.0_51]
>> at backtype.storm.utils.Utils.serialize(Utils.java:77)
>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>> ... 2 common frames omitted
>>
>> com.vertica.jdbc.VerticaConnectionImpl is from VerticaJDBC driver.
>>
>> is there any other way to avoid this problem and proceed with out erros.
>>
>>
>


Re: Vertica & Storm Error

2014-07-11 Thread Vladi Feigin
Probably you're trying to pass the VerticaConnectionImpl object as part of
storm tuple but it's not serializable. This is why you fail
Don't pass it - just initialize it in the persistence bolt (the bolt
writing into Vertica) in the prepare method
Vladi


On Fri, Jul 11, 2014 at 2:46 PM, satyajit sa  wrote:

> Hi ,
>
> Have tried a simple program with storm, for inserting data into vertica
> tables.
>
> but facing the below error,
>
> ERROR org.apache.zookeeper.server.NIOServerCnxnFactory - Thread
> Thread[main,5,main] died
> java.lang.RuntimeException: java.io.NotSerializableException:
> com.vertica.jdbc.VerticaConnectionImpl
> at backtype.storm.utils.Utils.serialize(Utils.java:81)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at
> backtype.storm.topology.TopologyBuilder.createTopology(TopologyBuilder.java:106)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at com.wordcount.example.HelloStorm.main(HelloStorm.java:21) ~[bin/:na]
> Caused by: java.io.NotSerializableException:
> com.vertica.jdbc.VerticaConnectionImpl
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
> ~[na:1.7.0_51]
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
> ~[na:1.7.0_51]
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
> ~[na:1.7.0_51]
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> ~[na:1.7.0_51]
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> ~[na:1.7.0_51]
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> ~[na:1.7.0_51]
> at backtype.storm.utils.Utils.serialize(Utils.java:77)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> ... 2 common frames omitted
>
> com.vertica.jdbc.VerticaConnectionImpl is from VerticaJDBC driver.
>
> is there any other way to avoid this problem and proceed with out erros.
>
>


debugging tuple failure/redelivery

2014-07-11 Thread Vinay Pothnis
Hello,

I have a storm cluster with 5 nodes and spouts reading from RabbitMQ
source. I am seeing a strange behaviour where some of the tuples are
failing and getting redelivered from RabbitMQ.

The logs on the individual storm nodes do not have any exceptions and no
obvious indicators. I believe I am "acking" from all the bolts that are
part of the topology.

I increased the timeout from the default 30s to 5 minutes. Still, some
messages are failing and getting redelivered. Any pointers as to how to
debug this and figure out the root cause of this?

If I see the admin console for a particular queue on RabbitMQ, I see the
bursts of delivery, acknowledgement and redeliveries. For example, if my
message timeout is 30s, i see bursts at the interval of 1 minute. If my
message timeout is 1 minute, i see bursts at the interval of 2 minutes. Any
pointers/suggestions regarding this issue?

Thanks
Vinay


Re: could assign Topology execute in some supervisor

2014-07-11 Thread Nathan Leung
as a sanity check, did you restart your nimbus?


On Thu, Jul 10, 2014 at 9:53 PM, chenlax  wrote:

> Nathan Leung,
> i add the isolation conf as follow:
>
> storm.scheduler: "backtype.storm.scheduler.IsolationScheduler"
>
> isolation.scheduler.machines:
> "hellotest": 1
>
> and submit topology dosen't has any error log or warm log.
>
> Thanks,
> Lax
>
>
> --
> Date: Thu, 10 Jul 2014 10:32:10 -0400
>
> Subject: Re: could assign Topology execute in some supervisor
> From: ncle...@gmail.com
> To: user@storm.incubator.apache.org
>
>
> Your yaml doesn't appear to be properly formatted.  Either that, or your
> email client stripped some characters.
>
>
> On Thu, Jul 10, 2014 at 4:16 AM, chenlax  wrote:
>
> and submit other topology also can't get works.
>
> Thanks,
> Lax
>
>
> --
> From: lax...@hotmail.com
> To: user@storm.incubator.apache.org
>
> Subject: RE: could assign Topology execute in some supervisor
> Date: Thu, 10 Jul 2014 16:14:12 +0800
>
>
> i try use isolation.scheduler,but when i submit topology,the topology
> can't get any works.
>
> Cluster SummaryVersionNimbus uptimeSupervisors Used slotsFree slotsTotal
> slotsExecutorsTasks0.8.24m 12s30424200 Topology summaryNameIdStatusUptimeNum
> workersNum executorsNum taskshellotest
> 
> hellotest-2-1404979653ACTIVE58s000
>
> Nimbus conf:
> isolation.scheduler.machines{"hellotest" 1}
>
> Thanks,
> Lax
>
>
> --
> Date: Tue, 8 Jul 2014 09:07:27 -0400
> Subject: RE: could assign Topology execute in some supervisor
> From: ncle...@gmail.com
> To: user@storm.incubator.apache.org
>
> Does the isolation scheduler suit your needs?
> https://storm.incubator.apache.org/2013/01/11/storm082-released.html
> On Jul 8, 2014 8:53 AM, "chenlax"  wrote:
>
>
> i means assign 2 supervisor machines execute a topology,the worker number
> more than 2.maybe 10 or more.
>
> Thanks,
> Lax
>
>
> --
> Date: Tue, 8 Jul 2014 08:11:30 -0400
> Subject: Re: could assign Topology execute in some supervisor
> From: tomas.mazu...@gmail.com
> To: user@storm.incubator.apache.org
>
> yes,
>
> config.setNumWorkers(2);
>
> this will run 2 worker processes, usually on 2 separate supervisor
> machines.
>
>
> On Tue, Jul 8, 2014 at 6:12 AM, chenlax  wrote:
>
> The Storm has 8 supervisors,i want assign a topology run in 2
> superviosrs,and the 2 supercisor only execute the topology.
> how can i do it?
>
> Thanks,
> Lax
>
>
>
>
> --
> Tomas Mazukna
> 678-557-3834
>
>
>


Re: Storm UI

2014-07-11 Thread Harsha




Storm UI provides metrics about topologies on the cluster and
no.of tuples emitted, transferred and any last known errors
etc..

you can start storm ui by running STORM_HOME/bin/storm ui which
runs daemon at port 8080. If you hover over the table headers
in Storm UI it will show you a text which talks about that
particular value.

If you are trying to add custom metrics to your topology please
refer to this
page [1]http://www.bigdata-cookbook.com/post/72320512609/storm-
metrics-how-to



On Fri, Jul 11, 2014, at 02:38 AM, Benjamin SOULAS wrote:

Hi everyone,

Actually intern for my master's degree, I have to implement
topologies and see what's happening. I am trying to see those
data via Storm UI; My problem is that I don't find enough
documentation on that... I installed the splunk interface, but
I don't know how to implement it on my topologies ... Does the
Metrics interfaces are used for this???

Please I really need help ...

Regards

References

1. http://www.bigdata-cookbook.com/post/72320512609/storm-metrics-how-to


Messages in fly

2014-07-11 Thread Haralds Ulmanis
Does anyone know how to look-up current size of messages in fly ?
I'm pushing messages to spout and I'd like some logic to tell that cluster
is too busy.


writing huge amount of data to HDFS

2014-07-11 Thread Chen Wang
Hi, Guys,
I have a storm topology, with a single thread bolt querying large amount of
data (From elasticsearch), and emit to a HBase bolt(10 threads), doing some
filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply emit the
tuple to arvo client, which will be received by two flume node and then
sink into hdfs. I am testing in local mode.

In the query bolt, i am  getting around 15000 entries in a batch, the query
itself takes about 4second, however, he emit method in the query bolt takes
about 20 seconds. Does it mean that
the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the query
bolt?

How can I tune my topology to make this process as fast as possible? I
tried to increase the HBase thread to 20 but it does not seem to help.

I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt to
avro.

Thanks for any advice.
Chen


Re: writing huge amount of data to HDFS

2014-07-11 Thread Chen Wang
typo in previous email
The emit method in the query bolt takes about 200(instead of 20) seconds..


On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang 
wrote:

> Hi, Guys,
> I have a storm topology, with a single thread bolt querying large amount
> of data (From elasticsearch), and emit to a HBase bolt(10 threads), doing
> some filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply
> emit the tuple to arvo client, which will be received by two flume node and
> then sink into hdfs. I am testing in local mode.
>
> In the query bolt, i am  getting around 15000 entries in a batch, the
> query itself takes about 4second, however, he emit method in the query bolt
> takes about 20 seconds. Does it mean that
> the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the
> query bolt?
>
> How can I tune my topology to make this process as fast as possible? I
> tried to increase the HBase thread to 20 but it does not seem to help.
>
> I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt
> to avro.
>
> Thanks for any advice.
> Chen
>


Re: writing huge amount of data to HDFS

2014-07-11 Thread Sam Goodwin
Can you show some code? 200 seconds for 15K puts sounds like you're not
batching.


On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang 
wrote:

> typo in previous email
> The emit method in the query bolt takes about 200(instead of 20) seconds..
>
>
> On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang 
> wrote:
>
>> Hi, Guys,
>> I have a storm topology, with a single thread bolt querying large amount
>> of data (From elasticsearch), and emit to a HBase bolt(10 threads), doing
>> some filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply
>> emit the tuple to arvo client, which will be received by two flume node and
>> then sink into hdfs. I am testing in local mode.
>>
>> In the query bolt, i am  getting around 15000 entries in a batch, the
>> query itself takes about 4second, however, he emit method in the query bolt
>> takes about 20 seconds. Does it mean that
>> the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the
>> query bolt?
>>
>> How can I tune my topology to make this process as fast as possible? I
>> tried to increase the HBase thread to 20 but it does not seem to help.
>>
>> I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt
>> to avro.
>>
>> Thanks for any advice.
>> Chen
>>
>
>


Re: writing huge amount of data to HDFS

2014-07-11 Thread Chen Wang
here you go:
https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
Its actually pretty straight forward. The only thing worth of mention is
that I use another thread in the ES bolt to do the actual query and tuple
emit.
Thanks for looking.
Chen



On Fri, Jul 11, 2014 at 1:18 PM, Sam Goodwin 
wrote:

> Can you show some code? 200 seconds for 15K puts sounds like you're not
> batching.
>
>
> On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang 
> wrote:
>
>> typo in previous email
>> The emit method in the query bolt takes about 200(instead of 20) seconds..
>>
>>
>> On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang 
>> wrote:
>>
>>> Hi, Guys,
>>> I have a storm topology, with a single thread bolt querying large amount
>>> of data (From elasticsearch), and emit to a HBase bolt(10 threads), doing
>>> some filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply
>>> emit the tuple to arvo client, which will be received by two flume node and
>>> then sink into hdfs. I am testing in local mode.
>>>
>>> In the query bolt, i am  getting around 15000 entries in a batch, the
>>> query itself takes about 4second, however, he emit method in the query bolt
>>> takes about 20 seconds. Does it mean that
>>> the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the
>>> query bolt?
>>>
>>> How can I tune my topology to make this process as fast as possible? I
>>> tried to increase the HBase thread to 20 but it does not seem to help.
>>>
>>> I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt
>>> to avro.
>>>
>>> Thanks for any advice.
>>> Chen
>>>
>>
>>
>


Re: debugging tuple failure/redelivery

2014-07-11 Thread Vinay Pothnis
Hello,

Just to close this thread out and to update:

The issue was with the ZK cluster. There was connectivity issue among the
ZK nodes and this was the cause for the redeliveries and failure to ack.

Once the connectivity issue was resolved, the messages were processed
without any redeliveries.

Thanks
Vinay


On Fri, Jul 11, 2014 at 8:28 AM, Vinay Pothnis 
wrote:

> Hello,
>
> I have a storm cluster with 5 nodes and spouts reading from RabbitMQ
> source. I am seeing a strange behaviour where some of the tuples are
> failing and getting redelivered from RabbitMQ.
>
> The logs on the individual storm nodes do not have any exceptions and no
> obvious indicators. I believe I am "acking" from all the bolts that are
> part of the topology.
>
> I increased the timeout from the default 30s to 5 minutes. Still, some
> messages are failing and getting redelivered. Any pointers as to how to
> debug this and figure out the root cause of this?
>
> If I see the admin console for a particular queue on RabbitMQ, I see the
> bursts of delivery, acknowledgement and redeliveries. For example, if my
> message timeout is 30s, i see bursts at the interval of 1 minute. If my
> message timeout is 1 minute, i see bursts at the interval of 2 minutes. Any
> pointers/suggestions regarding this issue?
>
> Thanks
> Vinay
>


Re: writing huge amount of data to HDFS

2014-07-11 Thread Chen Wang
Here is the output from the ES query bolt:
 "Total execution time for this batch: 179655(millisecond)" is the call
time around .emit. As you can see, to emit 14000 entries, it takes
anytime from 145231 to 18


 INFO  com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=14000 hits=14000 took=26172
40813 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-13_00-00-00
40889 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 782
40890 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 4000 records
59335 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
59335 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=28000 hits=14000 took=18033
238920 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-14_00-00-00
238990 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 179655
238990 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 8000 records
257633 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
257633 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=42000 hits=14000 took=17926
260932 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-15_00-00-00
402852 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-16_00-00-00
402865 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 145231
402865 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 2000 records
417427 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
417427 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=56000 hits=14000 took=13962
417459 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-17_00-00-00
417493 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 66
417493 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 6000 records
429629 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
429629 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=7 hits=14000 took=12009
441208 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-18_00-00-00
744276 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-19_00-00-00
744277 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 314647
744277 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 0 records
779030 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
779030 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=84000 hits=14000 took=34631
785315 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-20_00-00-00
785332 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 6302
785332 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 4000 records
811859 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
811859 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=98000 hits=14000 took=25806
945938 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-21_00-00-00
960308 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 148449
960308 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 8000

Storm 0.9.2 Performance on EC2 Larges

2014-07-11 Thread Gary Malouf
Hi everyone,

We've been banging our heads against the wall trying to get reasonable
performance out of a small storm cluster.

Setup after stripping down trying to debug:

- All servers on EC2 m3.larges
- 2 Kestrel 2.4.1 queue servers
- 3 Storm Servers (1 running ui + nimbus, all running supervisors and thus
workers)
- 2 workers per instance, workers get 2GB of RAM max
- 1 topology with 2 KestrelSpouts

We measure performance by doing the following:

- loading up the queues with a couple million items in each
- deploying the topology
- pulling up the storm ui and tracking the changes in ack counts over time
on the spouts to compute average throughputs


With acking enabled on our spouts we were getting around 4k messages/second
With acking disabled on our spouts, we were seeing around 6k messages/second


Adding a few bolts with acking quickly bring performance down below 800
messages/second - pretty dreadful.  Based on the reports many other people
have posted about their Storm clusters, I find these numbers really
disappointing.  We've tried tuning the worker jvm options, number of
workers/executors with this simple setup but could not squeeze anything
more out.

Does anyone have any further suggestions about where we should be looking?
 We are about set to pull storm out of production and roll our own
processor.

Thanks,

Gary


Re: writing huge amount of data to HDFS

2014-07-11 Thread Harsha
Hi Chen,

  I looked at your code. The first part is inside a
Bolt's execute method ?  and it looks like fetching all the
data (1 per call)  from a elastic search and emitting each
value from inside the execute method which ends when the ES
result set runs out.

It doesn't look like you followed storm's conventions here was
there any reason not use Spout here . A bolt' execute method
gets called for every tuple that's getting passed. Docs on
spout &
bolt [1]https://storm.incubator.apache.org/documentation/Concep
ts.html



from your comment in the code "1 hits per shard will be
returned for each scroll" and if it taking longer  read 1
records from ES I would suggest you to reduce this batch size
". The idea here is you are making quicker calls to ES and
pushing the data downstream and making another call to ES for
the next batch instead of acquiring one big batch in single
call.



 "i am  getting around 15000 entries in a batch, the query
itself takes about 4second, however, he emit method in the
query bolt takes about 20 seconds." Can you try reducing the
batch size here too it looks like the time is taking emitting
15k entries at one go.



  Was there any reason/utility of using flume to write
to hdfs. If not I would recommend
using [2]https://github.com/ptgoetz/storm-hdfs bolt .







On Fri, Jul 11, 2014, at 03:37 PM, Chen Wang wrote:

Here is the output from the ES query bolt:
 "Total execution time for this batch: 179655(millisecond)" is
the call time around .emit. As you can see, to emit 14000
entries, it takes
anytime from 145231 to 18


 INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=14000 hits=14000 took=26172
40813 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-13_00-00-00
40889 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 782
40890 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 4000 records
59335 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
59335 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=28000 hits=14000 took=18033
238920 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-14_00-00-00
238990 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 179655
238990 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 8000 records
257633 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
257633 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=42000 hits=14000 took=17926
260932 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-15_00-00-00
402852 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-16_00-00-00
402865 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 145231
402865 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 2000 records
417427 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
417427 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=56000 hits=14000 took=13962
417459 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-17_00-00-00
417493 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 66
417493 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 6000 records
429629 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
429629 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=7 hits=14000 took=12009
441208 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-18_00-00-00
744276 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-19_00-00-00
744277 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 314647
744277 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current ba

Re: Storm 0.9.2 Performance on EC2 Larges

2014-07-11 Thread Andrew Montalenti
What's limiting your throughput? Your e-mail doesn't have enough
information to make a diagnosis.

Whether 4k or 6k processed messages per second is "fast" depends on a lot
of factors -- average message size, parallelism, hardware, batching
approach, etc.

P. Taylor Goetz has a nice slide presentation discussing various factors to
think about when scaling Storm topologies for throughput:

http://www.slideshare.net/ptgoetz/scaling-storm-hadoop-summit-2014

One trick I tend to use to identify throughput bottlenecks is to lay out a
topology with mock bolts that do nothing but "pass tuples through",
configured identically from a partitioning / paralellism standpoint to my
actual topology. Then see how much throughput I get simply piping tuples
from the spout through that mock topology. This can often help you find
issues with things like performance bugs originating at the spout,
acking/emitting bugs, or other similar problems. It can also let you remove
some components from your topology to performance test them in isolation.

You can also review this recent JIRA ticket about improvements to the Netty
transport. Not only is this a lot of engineering effort going into Storm's
performance at scale, but benchmarks listed in there show throughput levels
of several hundred thousand messages per second, saturating cores and
network on topology machines.

https://issues.apache.org/jira/browse/STORM-297

Please don't roll your own stream processor -- the world doesn't need
another. :-D Something is likely wrong with the topology's layout and I'm
sure it's fixable.

HTH,

---
Andrew Montalenti
Co-Founder & CTO
http://parse.ly



On Fri, Jul 11, 2014 at 6:38 PM, Gary Malouf  wrote:

> Hi everyone,
>
> We've been banging our heads against the wall trying to get reasonable
> performance out of a small storm cluster.
>
> Setup after stripping down trying to debug:
>
> - All servers on EC2 m3.larges
> - 2 Kestrel 2.4.1 queue servers
> - 3 Storm Servers (1 running ui + nimbus, all running supervisors and thus
> workers)
> - 2 workers per instance, workers get 2GB of RAM max
> - 1 topology with 2 KestrelSpouts
>
> We measure performance by doing the following:
>
> - loading up the queues with a couple million items in each
> - deploying the topology
> - pulling up the storm ui and tracking the changes in ack counts over time
> on the spouts to compute average throughputs
>
>
> With acking enabled on our spouts we were getting around 4k messages/second
> With acking disabled on our spouts, we were seeing around 6k
> messages/second
>
>
> Adding a few bolts with acking quickly bring performance down below 800
> messages/second - pretty dreadful.  Based on the reports many other people
> have posted about their Storm clusters, I find these numbers really
> disappointing.  We've tried tuning the worker jvm options, number of
> workers/executors with this simple setup but could not squeeze anything
> more out.
>
> Does anyone have any further suggestions about where we should be looking?
>  We are about set to pull storm out of production and roll our own
> processor.
>
> Thanks,
>
>  Gary
>