Re: Native Streaming Window API

2016-10-05 Thread Tech Id
Thanks a lot Koji. Appreciate the effort !

On Wed, Oct 5, 2016 at 9:26 AM, Koji Ishida <kishid...@gmail.com> wrote:

> Hi.
>
> Sorry for late reply.
>
> I translated previous my posting in English.
>
> I wish that is helpful for my friends all over the world :)
>
> http://qiita.com/kojiisd/items/1618cf8d79bb5ed995d3#english-translation
>
> Regards,
> Koji Ishida
>
> 2016/09/15 9:50、Koji Ishida <kishid...@gmail.com> のメール:
>
> Hi T.I.
>
> >Any chance you want to make it helpful for the entire world, rather than
> just my Japanese friends ;) ?
>
> OK.
>
> Now my peers and I started to translate
> my post for the entire world, so wait a minute :)
>
> Thanks,
> Koji.
>
> 2016年9月14日(水) 13:58 Tech Id <tech.login@gmail.com>:
>
>> Thank you Koji for the links.
>>
>> We do not understand Japanese :( but it seems your blog post is really
>> useful (just looking at the code and logs).
>> Any chance you want to make it helpful for the entire world, rather than
>> just my Japanese friends ;) ?
>>
>> On Tue, Sep 13, 2016 at 6:21 PM, Koji Ishida <kishid...@gmail.com> wrote:
>>
>>> Hello, T.I.
>>>
>>> Documentation for Native Streaming API exists in Apache Storm page as
>>> Windowing now.
>>>
>>> http://storm.apache.org/releases/1.0.2/Windowing.html
>>>
>>> You can understand the specification for Native Streaming API.
>>>
>>>
>>> If you or your peers can understand Japanese, you can read my post. I
>>> wrote explanation for Native Streaming API basic examples.
>>>
>>> http://qiita.com/kojiisd/items/1618cf8d79bb5ed995d3
>>>
>>> Regards,
>>> Koji Ishida
>>>
>>>
>>> 2016年9月14日水曜日、Tech Id<tech.login@gmail.com>さんは書きました:
>>>
>>> Hi,
>>>>
>>>> I want to use the Native Streaming Window API introduced in Storm 1.0.0
>>>>
>>>> Is there some documentation on the same along with some high level
>>>> design or an example?
>>>>
>>>> Thanks
>>>> T.I.
>>>>
>>>>
>>
>


Re: How will storm replay the tuple tree?

2016-09-13 Thread Tech Id
Thanks Ambud,

I did read some very good things about acking mechanism in Storm but I am
not sure it explains why point to point checking is expensive.

Consider the example: Spout--> BoltA--->BoltB.

If BoltB fails, it will report failure to the acker.
If the acker can ask the Spout to replay, then why can't the acker ask the
parent of BoltB to replay at this point?
I don't think keeping parent of a bolt could be expensive.


On a related note, I am a little confused about a statement "When a new
tupletree is born, the spout sends the XORed edge-ids of each tuple
recipient, which the acker records in its pending ledger" in
Acking-framework-implementation.html
<http://storm.apache.org/releases/current/Acking-framework-implementation.html>
.
How does the spout know before hand which bolts would receive the tuple?
Bolts forward tuples to other bolts based on groupings and dynamically
generated fields. How does spout know what fields will be generated and
which bolts will receive the tuples? If it does not know that, then how
does it send the XOR of each tuple recipient in a tuple's path because each
tuple's path will be different (I think, not sure though).


Thx,
T.I.


On Tue, Sep 13, 2016 at 6:37 PM, Ambud Sharma <asharma52...@gmail.com>
wrote:

> Here is a post on it https://bryantsai.com/fault-
> tolerant-message-processing-in-storm/.
>
> Point to point tracking is expensive unless you are using transactions.
> Flume does point to point transfers using transactions.
>
> On Sep 13, 2016 3:27 PM, "Tech Id" <tech.login@gmail.com> wrote:
>
>> I agree with this statement about code/architecture but in case of some
>> system outages, like one of the end-points (Solr, Couchbase, Elastic-Search
>> etc.) being down temporarily, a very large number of other fully-functional
>> and healthy systems will receive a large number of duplicate replays
>> (especially in heavy throughput topologies).
>>
>> If you can elaborate a little more on the performance cost of tracking
>> tuples or point to a document reflecting the same, that will be of great
>> help.
>>
>> Best,
>> T.I.
>>
>> On Tue, Sep 13, 2016 at 12:26 PM, Hart, James W. <jwh...@seic.com> wrote:
>>
>>> Failures should be very infrequent, if they are not then rethink the
>>> code and architecture.  The performance cost of tracking tuples in the way
>>> that would be required to replay at the failure is large, basically that
>>> method would slow everything way down for very infrequent failures.
>>>
>>>
>>>
>>> *From:* S G [mailto:sg.online.em...@gmail.com]
>>> *Sent:* Tuesday, September 13, 2016 3:17 PM
>>> *To:* user@storm.apache.org
>>> *Subject:* Re: How will storm replay the tuple tree?
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am a little curious to know why we begin at the spout level for case 1.
>>>
>>> If we replay at the failing bolt's parent level (BoltA in this case),
>>> then it should be more performant due to a decrease in duplicate processing
>>> (as compared to whole tuple tree replays).
>>>
>>>
>>>
>>> If BoltA crashes due to some reason while replaying, only then the Spout
>>> should receive this as a failure and whole tuple tree should be replayed.
>>>
>>>
>>>
>>> This saving in duplicate processing will be more visible with several
>>> layers of bolts.
>>>
>>>
>>>
>>> I am sure there is a good reason to replay the whole tuple-tree, and
>>> want to know the same.
>>>
>>>
>>>
>>> Thanks
>>>
>>> SG
>>>
>>>
>>>
>>> On Tue, Sep 13, 2016 at 10:22 AM, P. Taylor Goetz <ptgo...@gmail.com>
>>> wrote:
>>>
>>> Hi Cheney,
>>>
>>>
>>>
>>> Replays happen at the spout level. So if there is a failure at any point
>>> in the tuple tree (the tuple tree being the anchored emits, unanchored
>>> emits don’t count), the original spout tuple will be replayed. So the
>>> replayed tuple will traverse the topology again, including unanchored
>>> points.
>>>
>>>
>>>
>>> If an unanchored tuple fails downstream, it will not trigger a replay.
>>>
>>>
>>>
>>> Hope this helps.
>>>
>>>
>>>
>>> -Taylor
>>>
>>>
>>>
>>>
>>>
>>> On Sep 13, 2016, at 4:42 AM, Cheney Chen <tbcql1...@gmail.com> wrote:
>>>
>>>
>>>
>>> Hi there,
>>>
>>>
>>>
>>> We're using storm 1.0.1, and I'm checking through http://storm.apache.or
>>> g/releases/1.0.1/Guaranteeing-message-processing.html
>>>
>>>
>>>
>>> Got questions for below two scenarios.
>>>
>>> Assume topology: S (spout) --> BoltA --> BoltB
>>>
>>> 1. S: anchored emit, BoltA: anchored emit
>>>
>>> Suppose BoltB processing failed w/ ack, what will the replay be, will it
>>> execute both BoltA and BoltB or only failed BoltB processing?
>>>
>>>
>>>
>>> 2. S: anchored emit, BoltA: unanchored emit
>>>
>>> Suppose BoltB processing failed w/ ack, replay will not happen, correct?
>>>
>>>
>>>
>>> --
>>>
>>> Regards,
>>> Qili Chen (Cheney)
>>>
>>> E-mail: tbcql1...@gmail.com
>>> MP: (+1) 4086217503
>>>
>>>
>>>
>>>
>>>
>>
>>


Re: Native Streaming Window API

2016-09-13 Thread Tech Id
Thank you Koji for the links.

We do not understand Japanese :( but it seems your blog post is really
useful (just looking at the code and logs).
Any chance you want to make it helpful for the entire world, rather than
just my Japanese friends ;) ?

On Tue, Sep 13, 2016 at 6:21 PM, Koji Ishida <kishid...@gmail.com> wrote:

> Hello, T.I.
>
> Documentation for Native Streaming API exists in Apache Storm page as
> Windowing now.
>
> http://storm.apache.org/releases/1.0.2/Windowing.html
>
> You can understand the specification for Native Streaming API.
>
>
> If you or your peers can understand Japanese, you can read my post. I
> wrote explanation for Native Streaming API basic examples.
>
> http://qiita.com/kojiisd/items/1618cf8d79bb5ed995d3
>
> Regards,
> Koji Ishida
>
>
> 2016年9月14日水曜日、Tech Id<tech.login@gmail.com>さんは書きました:
>
> Hi,
>>
>> I want to use the Native Streaming Window API introduced in Storm 1.0.0
>>
>> Is there some documentation on the same along with some high level design
>> or an example?
>>
>> Thanks
>> T.I.
>>
>>


Re: How will storm replay the tuple tree?

2016-09-13 Thread Tech Id
I agree with this statement about code/architecture but in case of some
system outages, like one of the end-points (Solr, Couchbase, Elastic-Search
etc.) being down temporarily, a very large number of other fully-functional
and healthy systems will receive a large number of duplicate replays
(especially in heavy throughput topologies).

If you can elaborate a little more on the performance cost of tracking
tuples or point to a document reflecting the same, that will be of great
help.

Best,
T.I.

On Tue, Sep 13, 2016 at 12:26 PM, Hart, James W.  wrote:

> Failures should be very infrequent, if they are not then rethink the code
> and architecture.  The performance cost of tracking tuples in the way that
> would be required to replay at the failure is large, basically that method
> would slow everything way down for very infrequent failures.
>
>
>
> *From:* S G [mailto:sg.online.em...@gmail.com]
> *Sent:* Tuesday, September 13, 2016 3:17 PM
> *To:* user@storm.apache.org
> *Subject:* Re: How will storm replay the tuple tree?
>
>
>
> Hi,
>
>
>
> I am a little curious to know why we begin at the spout level for case 1.
>
> If we replay at the failing bolt's parent level (BoltA in this case), then
> it should be more performant due to a decrease in duplicate processing (as
> compared to whole tuple tree replays).
>
>
>
> If BoltA crashes due to some reason while replaying, only then the Spout
> should receive this as a failure and whole tuple tree should be replayed.
>
>
>
> This saving in duplicate processing will be more visible with several
> layers of bolts.
>
>
>
> I am sure there is a good reason to replay the whole tuple-tree, and want
> to know the same.
>
>
>
> Thanks
>
> SG
>
>
>
> On Tue, Sep 13, 2016 at 10:22 AM, P. Taylor Goetz 
> wrote:
>
> Hi Cheney,
>
>
>
> Replays happen at the spout level. So if there is a failure at any point
> in the tuple tree (the tuple tree being the anchored emits, unanchored
> emits don’t count), the original spout tuple will be replayed. So the
> replayed tuple will traverse the topology again, including unanchored
> points.
>
>
>
> If an unanchored tuple fails downstream, it will not trigger a replay.
>
>
>
> Hope this helps.
>
>
>
> -Taylor
>
>
>
>
>
> On Sep 13, 2016, at 4:42 AM, Cheney Chen  wrote:
>
>
>
> Hi there,
>
>
>
> We're using storm 1.0.1, and I'm checking through http://storm.apache.
> org/releases/1.0.1/Guaranteeing-message-processing.html
>
>
>
> Got questions for below two scenarios.
>
> Assume topology: S (spout) --> BoltA --> BoltB
>
> 1. S: anchored emit, BoltA: anchored emit
>
> Suppose BoltB processing failed w/ ack, what will the replay be, will it
> execute both BoltA and BoltB or only failed BoltB processing?
>
>
>
> 2. S: anchored emit, BoltA: unanchored emit
>
> Suppose BoltB processing failed w/ ack, replay will not happen, correct?
>
>
>
> --
>
> Regards,
> Qili Chen (Cheney)
>
> E-mail: tbcql1...@gmail.com
> MP: (+1) 4086217503
>
>
>
>
>


Native Streaming Window API

2016-09-13 Thread Tech Id
Hi,

I want to use the Native Streaming Window API introduced in Storm 1.0.0

Is there some documentation on the same along with some high level design
or an example?

Thanks
T.I.


[no subject]

2016-09-11 Thread Tech Id



Question on SolrUpdateBolt

2016-04-15 Thread Tech Id
Hi,

I have a question on SolrUpdateBolt.execute()

method.

It seems that SolrUpdateBolt is sending every tuple to Solr in the
execute() method but sending a commit() only after a specified number of
documents have been sent.

Would it be better if we batch the documents in memory and then send to
Solr ?

I am drawing inspiration from another very popular search-engine bolt
EsBolt that keeps the tuples in memory and then sends one batch-request
along with ack() or fail() based on a single batch-request's outcome.

Here are some pointers on the EsBolt that shows how they do it:
EsBolt.execute()

--> RestRepository.writeToIndex()

 ---> RestRepository.doWriteToIndex()


If we do the same in SolrUpdateBolt, the number of http-calls is reduced by
a factor of N, where N is the batch-size of the request and that would be a
good performance boost IMO

Thanks,
Tid


Too many errors in Kafka-Spout / Es-Bolt combination

2016-03-31 Thread Tech Id
Hi,

I see an increasing number of tuple failures when increasing the number of
Es-Bolt's parallelism beyond 20.

One error I frequently see in the storm's worker logs is:

10:56:36.145 s.k.KafkaUtils [INFO] Task [4/6] assigned [Partition{host=
kafka-broker01.xyz.com:9092, partition=3}, Partition{host=
kafka-broker01.xyz.com:9092, partition=9}, Partition{host=
kafka-broker01.xyz.com:9092, partition=15}, Partition{host=
kafka-broker02.xyz.com:9092, partition=21}, Partition{host=
kafka-broker00.xyz.com:9092, partition=27}]

10:56:36.145 s.k.ZkCoordinator [INFO] Task [4/6] Deleted partition
managers: []

10:56:36.145 s.k.ZkCoordinator [INFO] Task [4/6] New partition managers: []

10:56:36.145 s.k.ZkCoordinator [INFO] Task [4/6] Finished refreshing

10:56:37.929 s.k.KafkaUtils [WARN] Got fetch request with offset out of
range: [381679652030]

10:56:37.930 s.k.PartitionManager [WARN] Using new offset: 381684046396

10:56:38.038 s.k.PartitionManager [WARN] Removing the failed offsets that
are out of range: [381679656376, ... 12,000 offsets here  ... ,
381679657542]

10:56:38.055 STDERR [INFO] 2016-03-31 10:56:38,040 ERROR Unable to write to
stream UDP:localhost:514 for appender syslog

10:56:38.060 STDERR [INFO] 2016-03-31 10:56:38,042 ERROR An exception
occurred processing Appender syslog
org.apache.logging.log4j.core.appender.AppenderLoggingException: Error
flushing stream UDP:localhost:514

10:56:38.061 STDERR [INFO] at
org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:159)

10:56:38.061 STDERR [INFO] at
org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:112)

10:56:38.061 STDERR [INFO] at
org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:99)

10:56:38.061 STDERR [INFO] at
org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:430)

10:56:38.061 STDERR [INFO] at
org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:409)

10:56:38.062 STDERR [INFO] at
org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:367)

10:56:38.062 STDERR [INFO] at
org.apache.logging.log4j.core.Logger.logMessage(Logger.java:112)

10:56:38.062 STDERR [INFO] at
org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:738)

10:56:38.062 STDERR [INFO] at
org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:708)

10:56:38.063 STDERR [INFO] at
org.apache.logging.slf4j.Log4jLogger.warn(Log4jLogger.java:243)

10:56:38.063 STDERR [INFO] at
storm.kafka.PartitionManager.fill(PartitionManager.java:183)

10:56:38.063 STDERR [INFO] at
storm.kafka.PartitionManager.next(PartitionManager.java:131)

10:56:38.063 STDERR [INFO] at
storm.kafka.KafkaSpout.nextTuple(KafkaSpout.java:141)

10:56:38.064 STDERR [INFO] at
backtype.storm.daemon.executor$fn__5624$fn__5639$fn__5670.invoke(executor.clj:607)

10:56:38.064 STDERR [INFO] at
backtype.storm.util$async_loop$fn__545.invoke(util.clj:479)

10:56:38.064 STDERR [INFO] at clojure.lang.AFn.run(AFn.java:22)

10:56:38.064 STDERR [INFO] at java.lang.Thread.run(Thread.java:745)

10:56:38.064 STDERR [INFO] Caused by: java.io.IOException: Message too long

10:56:38.065 STDERR [INFO] at java.net.PlainDatagramSocketImpl.send(Native
Method)

10:56:38.065 STDERR [INFO] at
java.net.DatagramSocket.send(DatagramSocket.java:693)

10:56:38.065 STDERR [INFO] at
org.apache.logging.log4j.core.net.DatagramOutputStream.flush(DatagramOutputStream.java:103)

10:56:38.065 STDERR [INFO] at
org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:156)

10:56:38.066 STDERR [INFO] ... 16 more

10:56:38.066 STDERR [INFO]

10:56:59.832 b.s.m.n.Server [INFO] Getting metrics for server on port 6704

10:56:59.832 b.s.m.n.Client [INFO] Getting metrics for client connection to
Netty-Client-server-005-17.xyz.com/10.10.10.169:6704

10:56:59.833 b.s.m.n.Client [INFO] Getting metrics for client connection to
Netty-Client-server-006-16.xyz.com/10.10.10.171:6704

10:56:59.833 b.s.m.n.Client [INFO] Getting metrics for client connection to
Netty-Client-server-006-15.xyz.com/10.10.10.170:6704

10:56:59.833 b.s.m.n.Client [INFO] Getting metrics for client connection to
Netty-Client-server-004-17.xyz.com/10.10.10.164:6704

10:56:59.833 b.s.m.n.Client [INFO] Getting metrics for client connection to
Netty-Client-server-005-13.xyz.com/10.10.10.165:6704

10:57:24.446 s.k.KafkaUtils [WARN] Got fetch request with offset out of
range: [381803772151]

10:57:24.447 s.k.PartitionManager [WARN] Using new offset: 381808257383



In particular, see the line:


*Removing the failed offsets that are out of range: [381679656376, ...
12,000 offsets here  ... , 381679657542]*


It actually had 12,000 offsets that I removed to make the message look
smaller here.

Why does it have so many out of range offsets?


Does anyone know what I may be doing wrong?


Thanks

Tid


Re: Storm KafkaSpout Integration

2016-03-31 Thread Tech Id
Hey David,

I would be interested in seeing what Kafka-Spouts you found online and why
you found them better.
Also, if you have your own Kafka-Spout opensourced in github, a link to
that would be great too.

Thanks
Tid

On Thu, Mar 31, 2016 at 2:21 AM, david kavanagh 
wrote:

> Hi Spico,
>
> I changed the parallelism as you suggested but it didn't work. Yesterday
> evening i gave up on using the KafkaSpout class that comes with storm. I
> found some Kafka consumer java classes online and wrote my own Kafka spout
> which is working fine. Thanks for the advice anyway.
>
> Regards
> David
>
> --
> Date: Wed, 30 Mar 2016 21:33:38 +0300
> Subject: Re: Storm KafkaSpout Integration
> From: spicoflo...@gmail.com
> To: user@storm.apache.org
>
> hi,
> i think the problem that you have is that you have stup one partition per
> topic, but you try to conume with 10 kafka task spouts.
> check this lines builder.setSpout("words", new KafkaSpout(kafkaConfig),
> 10);
> 10 represents the task parslellism for the spout, that shoul be in the
> case of kafka the same number as the partition you have  setup for kafka
> topic. you use more than one kafka partition when you would like to consume
> in parallel the data from the topic. please check the very good
> documentation on ksfka partition on confluent site.
> in my opinon, set up your hint parallelism to 1 would solve the problem.
> tne max spout pending has a different meaning.
> regards,
> florin
>
> On Wednesday, March 30, 2016, david kavanagh 
> wrote:
> > I am only creating one partition in code here:
> >  GlobalPartitionInformation hostsAndPartitions = new
> GlobalPartitionInformation();
> >  hostsAndPartitions.addPartition(0, new Broker("127.0.0.1", 9092));
> >  BrokerHosts brokerHosts = new StaticHosts(hostsAndPartitions);
> > I hope that answered your question. I am new to both Storm and Kafka so
> i am not sure exactly how it works.
> > If i am understanding you correctly, the line you told me to add in the
> first email should work because i am only creating one partition?
> > config.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
> > Thanks again for the help :-)
> > David
> > 
> > Date: Wed, 30 Mar 2016 15:36:19 +0530
> > Subject: Re: Storm KafkaSpout Integration
> > From: dkira...@aadhya-analytics.com
> > To: user@storm.apache.org
> >
> >
> > Hi david,
> >
> > Can I know how many partitions you are having?
> > statement I have given to you is default.if you are  running with no of
> partitions make sure you give same number eg: if you are running with two
> partitions change the number to 2 in the statement .
> > config.put(Config.TOPOLOGY_MAX_SPOUT_PENDING,2 );
> >
> > Best regards,
> > K.Sai Dilip Reddy.
> > On Wed, Mar 30, 2016 at 3:00 PM, david kavanagh 
> wrote:
> >
> > Thanks for the reply!
> > I added the line as you suggested but there is still no difference
> unfortunately.
> > I am just guessing at this stage but judging by the output below it, it
> seems like it is something to do with the partitioning or the offset.
> > The warnings start by staying that  there are more tasks than
> partitions.
> > Task 1 is assigned the partition that is created in the code
> (highlighted in green), then the rest of the tasks are not assigned any
> partitions.
> > Eventually is states 'Read partition information from:
> /twitter/twitter-topic-id/partition_0  --> null'
> > So it seems like it is not reading data from Kafka at all. I really
> don't understand what is going on here.
> > Any ideas?
> >
> > Kind Regards
> > David
> > --
> > Storm Output:
> > Thread-9-print] INFO  backtype.storm.daemon.executor - Prepared bolt
> print:(2)
> > 32644 [Thread-11-words] INFO
>  org.apache.curator.framework.imps.CuratorFrameworkImpl - Starting
> > 32685 [Thread-19-words] WARN  storm.kafka.KafkaUtils - there are more
> tasks than partitions (tasks: 10; partitions: 1), some tasks will be idle
> > 32686 [Thread-19-words] WARN  storm.kafka.KafkaUtils - Task [5/10] no
> partitions assigned
> > 32686 [Thread-17-words] WARN  storm.kafka.KafkaUtils - there are more
> tasks than partitions (tasks: 10; partitions: 1), some tasks will be idle
> > 32686 [Thread-15-words] WARN  storm.kafka.KafkaUtils - there are more
> tasks than partitions (tasks: 10; partitions: 1), some tasks will be idle
> > 32686 [Thread-17-words] WARN  storm.kafka.KafkaUtils - Task [4/10] no
> partitions assigned
> > 32686 [Thread-15-words] WARN  storm.kafka.KafkaUtils - Task [3/10] no
> partitions assigned
> > 32686 [Thread-11-words] WARN  storm.kafka.KafkaUtils - there are more
> tasks than partitions (tasks: 10; partitions: 1), some tasks will be idle
> > 32686 [Thread-11-words] INFO  storm.kafka.KafkaUtils - Task [1/10]
> assigned [Partition{host=127.0.0.1:9092, partition=0}]
> > 32687 [Thread-29-words] WARN  storm.kafka.KafkaUtils - there 

Re: Next version of storm?

2016-03-30 Thread Tech Id
STORM-971 <https://issues.apache.org/jira/browse/STORM-971> is another good
fix we would like to have.

Would appreciate if someone can suggest when is next version of storm
scheduled?


Thanks for helping,
Regards
Tid

On Sat, Mar 12, 2016 at 6:23 AM, Craig Charleton <craig.charle...@gmail.com>
wrote:

> Mr. Goetz,
>
> I was unable to find the answer to this question online. I apologize if I
> missed it. I was wondering what changes/improvements will be present around
> stateful bolts in 1.0?
>
> Many Thanks!
>
>
>
> > On Mar 11, 2016, at 11:04 PM, P. Taylor Goetz <ptgo...@gmail.com> wrote:
> >
> > 1.0 should be released this month. We will likely update 0.10.x as well.
> >
> > 1.0 is a big release, and we want to make sure it's solid.
> >
> > -Taylor
> >
> >
> >
> >> On Mar 11, 2016, at 10:47 PM, Tech Id <tech.login@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> We need to have the fix for:
> >>STORM-1207: Added flux support for IWindowedBolt
> >>
> >> Its been ~5 months since storm was last released (Oct 2015) and ~1000
> commits have gone into it since then.
> >> Is someone can help me know when the newer version of storm would be
> released it would be great help.
> >>
> >> Appreciate your time,
> >> Regards
> >> Tid
>


Re: external/storm-elasticsearch - upgrade requested

2016-03-30 Thread Tech Id
Hey Aaron,

Do you have a target throughput that you want to achieve through ES-Bolt ?
How many storm machines you plan to run your topology on?

Thanks


On Wed, Mar 30, 2016 at 10:53 AM, Aaron.Dossett <aaron.doss...@target.com>
wrote:

> In setting up a storm -> ES topology today I ran across that same fact.
> The EsIndexBolt is also synchronous mode only, with no option for asynch.
> This week I’ll evaluate adding batching/async to storm-elasticsearch or
> switching to ES-hadoop for my use use cases.
>
> From: Tech Id <tech.login@gmail.com>
> Reply-To: "user@storm.apache.org" <user@storm.apache.org>
> Date: Wednesday, March 30, 2016 at 10:24 AM
> To: "user@storm.apache.org" <user@storm.apache.org>
> Subject: Re: external/storm-elasticsearch - upgrade requested
>
> One thing I see in favor of the elasticsearch-hadoop is that it provides
> batching without Trident.
>


Re: external/storm-elasticsearch - upgrade requested

2016-03-30 Thread Tech Id
I think elasticsearch-hadoop has a good number of options to tune and those
are documented well too.
(See
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html
)

One thing I see in favor of the elasticsearch-hadoop is that it provides
batching without Trident.
This could be a very good thing for someone not using Trident.

ES-Hadoop also does a node discovery of all the ES-nodes and then writes
data to only primary shards (which avoids a jump from replica-shards to
primary-shards).

ES-Hadoop supported routing until 2.x version of ES and there is a bug to
make it work in 2.x version as well.

ES-Hadoop also provides a good support of time-based index-rolling which is
great for logging-type use-cases.

Some more features include authentication, SSL, proxy etc.

I haven't looked much closely into storm-elasticsearch to see if it
provides the same functionality.

We should compare the above features and run some performance tests to
compare the two.

Disclaimer: I am just a simple user of ES-Hadoop and not associated with it
in any way. Would love to switch to some other bolt if its better or more
performant.


On Sat, Mar 26, 2016 at 8:11 PM, Jungtaek Lim <kabh...@gmail.com> wrote:

> Lakshmanan,
>
> We looked over elasticsearch-hadoop when we adopted storm-elasticsearch,
> and elasticsearch-hadoop has only basic feature at that time.
> (No trident implementation)
> Please refer https://github.com/apache/storm/pull/573 to track our
> discussion on this module.
>
> Thanks!
> Jungtaek Lim (HeartSaVioR)
>
> 2016년 3월 27일 (일) 오전 3:12, Lakshmanan Muthuraman <lakshma...@tokbox.com>님이
> 작성:
>
>> We have been using elasticsearch-hadoop to write to ElasticSearch from
>> Storm. It looks good so far. Any thoughts on adopting ElasticSearch Hadoop
>> as  part of Storm external rather than trying to write and maintain our own
>> bolt in Storm Project.
>>
>> Any thoughts?
>>
>> On Tue, Mar 22, 2016 at 2:19 PM, Aaron.Dossett <aaron.doss...@target.com>
>> wrote:
>>
>>> No, we hadn’t looked at that, but once we are streaming into elastic at
>>> scale (not there yet) I would be interested to compare.
>>>
>>> From: Tech Id <tech.login@gmail.com>
>>> Reply-To: "user@storm.apache.org" <user@storm.apache.org>
>>> Date: Monday, March 21, 2016 at 1:21 PM
>>> To: "user@storm.apache.org" <user@storm.apache.org>
>>> Subject: Re: external/storm-elasticsearch - upgrade requested
>>>
>>> Thanks Aaron,
>>>
>>> Did you have a chance to compare the elasticsearch-hadoop (
>>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/storm.html)
>>> with the storm-elasticsearch (
>>> https://github.com/apache/storm/tree/master/external/storm-elasticsearch)
>>> ?
>>>
>>> Former uses REST client while the latter uses TransportClient.
>>>
>>> It would be interesting to know some performance numbers between them.
>>>
>>> Thanks !
>>>
>>> On Mon, Mar 21, 2016 at 5:40 AM, Aaron.Dossett <aaron.doss...@target.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> The best way to make feature requests like this is via the Apache Jira.
>>>> (https://issues.apache.org/jira/browse/STORM/)
>>>>
>>>> As it happens, my team (at Target) has adapted the elastic search bolt
>>>> for 2.X and we are using it in production.  We opened a JIRA (
>>>> https://issues.apache.org/jira/browse/STORM-1475) to track that and
>>>> contribute our change back to the project.  We’re cleaning a couple of
>>>> things up right now, but I’m hopeful we will contribute it back soon.
>>>>
>>>> Your second request could be a nice feature enhancement, can you open a
>>>> JIRA for it? In my experience, these external bolts are a really nice way
>>>> to get involved with the project as well. You could try adding a couple of
>>>> features and contributing those.
>>>>
>>>> Thanks!
>>>>
>>>> From: Tech Id <tech.login@gmail.com>
>>>> Reply-To: "user@storm.apache.org" <user@storm.apache.org>
>>>> Date: Friday, March 18, 2016 at 9:45 PM
>>>> To: "user@storm.apache.org" <user@storm.apache.org>
>>>> Subject: external/storm-elasticsearch - upgrade requested
>>>>
>>>> Hi,
>>>>
>>>> I see that the version of elastic-search used
>>>> in external/storm-elasticsearch/pom.xml is quite old (1.6.0) while the
>>>> latest elastic-search is 2.2.0
>>>>
>>>> 2.x version is not compatible with 1.x version of elastic-search and so
>>>> I request you to upgrade.
>>>>
>>>> Also, current storm-elasticsearch uses TransportClient of
>>>> elastic-search but does not expose many of the useful options in the
>>>> TransportClient (like routing).
>>>>
>>>> Request you to expose them to the users too.
>>>>
>>>>
>>>> Thanks
>>>> Tid
>>>>
>>>
>>>
>>


Whats the max throughput anyone has seen for kafka-spout, es-bolt?

2016-03-24 Thread Tech Id
Hi,

We are trying to optimize our Storm topology that uses Kafka-Spout and
Elastic-Search-Bolt (no other spouts/bolts).

Current performance statistics are as follows:

storm-workers:  1
elastic-search primaries : 1
elastic-search replicas : 1
1 process in storm having 1 kafka-spout thread and 6 elastic-search bolt
threads
kafka-fetch-size  : 10 MB
kafka-buffer-size : 11 MB
es-flush-entries-size : 10,000
16gb heap size with new-ratio = 1 (for Elastic-Search as well as Storm)
average kafka-message-size : 1 kb

The maximum ingestion rate we are able to achieve with the above is 800,000
messages per minute from kafka to elastic-search.

These statistics scale almost horizontally with the number of storm worker
nodes/processes (we use LOCAL_OR_SHUFFLE grouping) and with a similar
increase in elastic-search nodes.

Can someone comment on these throughput statistics?

Any recommendations on increasing the throughput would be much appreciated.

Thanks,
Tid


Re: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf exception in kafka-storm spout?

2016-03-24 Thread Tech Id
You need to increase worker's Heap Size by adding the following in each
worker's storm.yaml file:

worker.childopts: "-Xmx16g"

Then restart each worker's supervisor.



On Wed, Mar 23, 2016 at 4:07 AM, Sai Dilip Reddy Kiralam <
dkira...@aadhya-analytics.com> wrote:

> Hi,
>
>
>
> When I Increase the kafkaconfig.fetchSizeBytes value to 150*1024*1024,I
> get desired output from kafka-spout but after sometime kafka spout is
> throwing the exception and my spout is failing.What should I do to handle
> this.Do I need to add any lines in storm.yaml to handle this exception.
>
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
>   at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:622)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at java.util.AbstractCollection.toString(AbstractCollection.java:464)
>   at clojure.core$str.invoke(core.clj:513)
>   at clojure.core$str$fn__3896.invoke(core.clj:517)
>   at clojure.core$str.doInvoke(core.clj:519)
>   at clojure.lang.RestFn.invoke(RestFn.java:516)
>   at backtype.storm.daemon.task$mk_tasks_fn$fn__6307.invoke(task.clj:152)
>   at 
> backtype.storm.daemon.executor$fn__6579$fn__6594$send_spout_msg__6612.invoke(executor.clj:482)
>   at 
> backtype.storm.daemon.executor$fn__6579$fn$reify__6621.emit(executor.clj:526)
>   at 
> backtype.storm.spout.SpoutOutputCollector.emit(SpoutOutputCollector.java:49)
>   at 
> backtype.storm.spout.SpoutOutputCollector.emit(SpoutOutputCollector.java:63)
>   at storm.kafka.PartitionManager.next(PartitionManager.java:141)
>   at storm.kafka.KafkaSpout.nextTuple(KafkaSpout.java:141)
>   at 
> backtype.storm.daemon.executor$fn__6579$fn__6594$fn__6623.invoke(executor.clj:565)
>   at backtype.storm.util$async_loop$fn__459.invoke(util.clj:463)
>   at clojure.lang.AFn.run(AFn.java:24)
>   at java.lang.Thread.run(Thread.java:745)
>
> Need suggestions
>
>
>
>
> *Best regards,*
> *K.Sai Dilip Reddy.*
>


external/storm-elasticsearch - upgrade requested

2016-03-18 Thread Tech Id
Hi,

I see that the version of elastic-search used
in external/storm-elasticsearch/pom.xml is quite old (1.6.0) while the
latest elastic-search is 2.2.0

2.x version is not compatible with 1.x version of elastic-search and so I
request you to upgrade.

Also, current storm-elasticsearch uses TransportClient of elastic-search
but does not expose many of the useful options in the TransportClient (like
routing).

Request you to expose them to the users too.


Thanks
Tid


Re: Next version of storm?

2016-03-11 Thread Tech Id
Thank you Taylor !

On Fri, Mar 11, 2016 at 8:04 PM, P. Taylor Goetz <ptgo...@gmail.com> wrote:

> 1.0 should be released this month. We will likely update 0.10.x as well.
>
> 1.0 is a big release, and we want to make sure it's solid.
>
> -Taylor
>
>
>
> > On Mar 11, 2016, at 10:47 PM, Tech Id <tech.login@gmail.com> wrote:
> >
> > Hi,
> >
> > We need to have the fix for:
> > STORM-1207: Added flux support for IWindowedBolt
> >
> > Its been ~5 months since storm was last released (Oct 2015) and ~1000
> commits have gone into it since then.
> > Is someone can help me know when the newer version of storm would be
> released it would be great help.
> >
> > Appreciate your time,
> > Regards
> > Tid
>


Next version of storm?

2016-03-11 Thread Tech Id
Hi,

We need to have the fix for:
STORM-1207: Added flux support for IWindowedBolt

Its been ~5 months since storm was last released (Oct 2015) and ~1000
commits have gone into it since then.
Is someone can help me know when the newer version of storm would be
released it would be great help.

Appreciate your time,
Regards
Tid