Re: Trident Kafka Spout - Ack count increasing even though no messages are processed

2014-09-27 Thread Nathan Marz
Trident executes a batch every 500ms (by default). A batch involves a bunch of coordination messages going out to all the bolts to coordinate the batch (even if the batch is empty). So that's what you're seeing. On Fri, Sep 26, 2014 at 12:32 PM, Deepak Subhramanian < deepak.subhraman...@gmail.com>

Re: Duplicate record processed

2014-09-19 Thread Nathan Marz
Storm processes many records from a partition at once. If a record fails, all tuples from the offset onwards will have to be retried. So in your case, if a record before that record failed, it's possible that a subsequent record that had already been successfully processed was retried. This is tru

Re: Isolation Scheduler resources

2014-09-01 Thread Nathan Marz
You'd probably want to make use of the component-specific configuration on spouts and bolts to funnel the information to your scheduler as to which tasks should be colocated. On Mon, Sep 1, 2014 at 1:42 PM, Nathan Marz wrote: > Yes, you can definitely accomplish that by writing

Re: Isolation Scheduler resources

2014-09-01 Thread Nathan Marz
e > read latency really maters here. Is a pluggable scheduler (all bolt tasks > in a strong node) and a static concurrent hashmap a good approach to it? > > Thanks, > Michael > > > > On Mon, Sep 1, 2014 at 8:10 PM, Nathan Marz wrote: > >> There's not muc

Re: Isolation Scheduler resources

2014-09-01 Thread Nathan Marz
There's not much more to it than what was documented on the blog post announcing it: http://storm.incubator.apache.org/2013/01/11/storm082-released.html What are your questions about it? On Mon, Sep 1, 2014 at 12:04 PM, Michael Vogiatzis wrote: > Hello, > > Does anyone know of any good resour

Re: transient errors of "Tuple created with wrong number of fields"

2014-08-27 Thread Nathan Marz
Hi Jie, A few questions: 1. How many topologies do you have running on this cluster? 2. Once the topology recovers, do the errors still happen or do they disappear for the most part? 3. Have you noticed any worker failures on the cluster around the time these errors happen? -Nathan On Tue, Aug

Re: data cleansing in real time systems

2014-08-20 Thread Nathan Marz
Deletion is typically done by running a job that copies the master dataset into a new folder, filtering out bad data along the way. This is expensive, but that's ok since this is only done in rare circumstances. When I've done this in the past I'm extra careful before deleting the corrupted master

Re: Acking and failing same tuple.

2014-07-27 Thread Nathan Marz
You shouldn't do this, so this behavior is undefined. On Sat, Jul 26, 2014 at 2:26 PM, Rishabh Goyal wrote: > Hi, > > What happens if I ack and fail same tuple during the execute method ? Is > the tuple finally acked or replayed ? > > -- Twitter: @nathanmarz http://nathanmarz.com

Re: v0.9.2-incubating and .ser files

2014-06-19 Thread Nathan Marz
A stack dump of all workers would be useful in the case of a topology freeze. On Thu, Jun 19, 2014 at 3:52 PM, Nathan Marz wrote: > There were a bunch of changes to the internals, so a regression is > certainly possible. Let us know as many details as possible if you are able > to

Re: v0.9.2-incubating and .ser files

2014-06-19 Thread Nathan Marz
There were a bunch of changes to the internals, so a regression is certainly possible. Let us know as many details as possible if you are able to reproduce it. On Thu, Jun 19, 2014 at 3:09 PM, Andrew Montalenti wrote: > Another interesting 0.9.2 issue I came across: the IConnection interface >

Re: Trident State and Static State

2014-06-16 Thread Nathan Marz
at be a bad practice?) > > - update a state with something like partitionAggregate(sf) or > partitionPersist(sf) > > - create a newStatic(sf) for querying this state (that we have updated) > later in the topology > > > On Jun 17, 2014 2:18 AM, "Nathan Marz" wrote:

Re: Trident State and Static State

2014-06-16 Thread Nathan Marz
Static state just refers to a state that is not maintained by your Trident topology but which you still want to be able to query, so something like a database that some other system is responsible for updating. On Mon, Jun 16, 2014 at 4:21 AM, Carlos Rodriguez wrote: > Hi guys, > > We are using

Re: how does PersistentAggregate distribute the DB Calls ?

2014-06-03 Thread Nathan Marz
When possible it will do as much aggregation Storm-side so as to minimize amount it needs to interact with database. So if you do a persistent global count, for example, it will compute the count for the batch (in parallel), and then the task that finishes the global count will do a single get/upda

Re: [VOTE] Storm Logo Contest - Round 1

2014-05-18 Thread Nathan Marz
#10 - 5 points On Sun, May 18, 2014 at 4:47 AM, Marco wrote: > #3 - 2 pt. > #4 - 1 pt. > #5 - 2 pts. > > Il Venerdì 16 Maggio 2014 18:41, P. Taylor Goetz ha > scritto: > > > This is a call to vote on selecting the top 3 Storm logos from the 11 > entries received. This is the first of two rou

Re: Are batches processed sequentially or in parallel?

2014-05-14 Thread Nathan Marz
er batch coordinator ? > > Weide > > > On Tue, May 13, 2014 at 11:30 AM, Nathan Marz wrote: > >> Both. topology.max.spout.pending specifies how many batches are processed >> in parallel. However, for state updates, the batches are processed >> sequentially. So

Re: Are batches processed sequentially or in parallel?

2014-05-14 Thread Nathan Marz
persistentAggregate uses partitionPersist underneath the hood. On Wed, May 14, 2014 at 10:13 AM, Nathan Marz wrote: > Yes. > > > On Tue, May 13, 2014 at 5:33 PM, Weide Zhang wrote: > >> Hi Nathan, >> >> I have a followup question on this. If I'

Re: Are batches processed sequentially or in parallel?

2014-05-13 Thread Nathan Marz
Both. topology.max.spout.pending specifies how many batches are processed in parallel. However, for state updates, the batches are processed sequentially. So the state update for batch 2 won't be executed until the state update for batch 1 succeeds. On Tue, May 13, 2014 at 10:31 AM, Raphael Hsieh

Re: BasicOutputCollector: Question

2014-04-30 Thread Nathan Marz
Yes, that's right. On Wed, Apr 30, 2014 at 4:43 AM, Kashyap Mhaisekar wrote: > Ok. So, using BaseBasicBolt guarantees acks for tuples on execute method > completion, right? > > Regards, > Kashyap > > > On Wednesday, April 30, 2014, Nathan Marz wrote: > >>

Re: BasicOutputCollector: Question

2014-04-30 Thread Nathan Marz
That sounds like a bad idea. The whole point of using BasicBolt's is to not manually ack tuples. If you want control over acking, use RichBolt's. On Tue, Apr 29, 2014 at 9:18 PM, Kashyap Mhaisekar wrote: > Hi, > Is there a way to get reference to outputCollector using > BasicOutputCollector? I w

Re: How to think of batches vs partitions

2014-04-17 Thread Nathan Marz
described > here<https://github.com/nathanmarz/storm/wiki/Trident-state#opaque-transactional-spouts> > ( > https://github.com/nathanmarz/storm/wiki/Trident-state#opaque-transactional-spouts) > myself? > > > > On Thu, Apr 17, 2014 at 11:00 AM, Nathan Marz wrote: >

Re: How to think of batches vs partitions

2014-04-17 Thread Nathan Marz
gt; how does this happen ? Will it query the datastore for the current value, >> then add the current aggregate value to the stored value in order to create >> the global aggregate ? Where does this logic happen ? I can't seem to find >> where this happens in the per

Re: How to think of batches vs partitions

2014-04-16 Thread Nathan Marz
Batches are processed sequentially, but each batch is partitioned (and therefore processed in parallel). As a batch is processed, it can be repartitioned an arbitrary number of times throughout the Trident topology. On Wed, Apr 16, 2014 at 4:28 PM, Raphael Hsieh wrote: > Hi I found myself being

Re: Multiple Storm components getting assigned to same worker slot despite of free slots being available

2014-03-21 Thread Nathan Marz
topology.optimize doesn't do anything at the moment. It was something planned for in the early days but turned out to be unecessary. On Fri, Mar 21, 2014 at 9:00 PM, bijoy deb wrote: > Thanks Drew.I am going to try those options and see if that helps. > > Thanks > Bijoy > > > On Fri, Mar 21, 201

Re: Tuple message id uniqueness

2014-03-18 Thread Nathan Marz
No, it doesn't have to be. You're in full control of it. Internally Storm generates its own tuple ID and maintains a map from that globally unique tuple id to your spout id. The spout id is simply used in the ack/fail methods of the spout (so that you know what was acked/failed) On Tue, Mar 18, 2

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

2014-03-13 Thread Nathan Marz
onnect to other tech, storm-kafka, > storm-cassandra, etc > extras for other things like storm-starter, storm-deploy, storm-puppet > > > > On 13 Mar 2014, at 3:57 pm, Nathan Marz wrote: > > I don't like either name tbh. Storm itself is already broken into modules &g

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

2014-03-12 Thread Nathan Marz
;> > >>> hi, I am in agreement with Taylor and believe I understand his > >>> intent. An incredible tool/framework/application like Storm is > >>> only enhanced and gains value from the number of well maintained > >>> and vetted modules that can

Re: licensing model of storm

2014-03-03 Thread Nathan Marz
The latest Storm release 0.9.1 is all Apache-licensed. We replaced the ZeroMQ transport with a Netty transport (though you can still use the zeromq transport by changing the settings) On Mon, Mar 3, 2014 at 12:46 AM, Richards Peter wrote: > Hi team, > > We are using storm 0.8.2 in our project. I

Re: Implementing a Trident Spout

2014-03-02 Thread Nathan Marz
Throwing a FailedException is how you programatically fail a batch. On Sun, Mar 2, 2014 at 8:45 AM, David Smith wrote: > Also I don't see a way to a fail batch programmatically like you can do > with traditional storm? What happens if I throw a Failed Exception from a > within function, state q

Re: Snapshottable and Snapshotget

2014-03-02 Thread Nathan Marz
Snapshottable is used for storing a single value, like a global count. SnapshotGet retrieves that value into your Stream. The globalKey is fixed On Sun, Mar 2, 2014 at 8:02 PM, Jahagirdar, Madhu < madhu.jahagir...@philips.com> wrote: > All, > > 1) Could any one explain the usecase where Snap

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

2014-02-25 Thread Nathan Marz
I'm only +1 for pulling in storm-kafka and updating it. Other projects put these contrib modules in a "contrib" folder and keep them managed as completely separate codebases. As it's not actually a "module" necessary for Storm, there's an argument there for doing it that way rather than via the mul

Re: trident multiput

2014-02-19 Thread Nathan Marz
If you're using transactional states, then Trident will store the batch id with the stored values. So if there's a partial failure, only the ones with different batch ids will be updated on the next try. On Wed, Feb 19, 2014 at 12:11 PM, Adrian Mocanu wrote: > Hi > > When using multiPut in Trid

Re: Trident not working on streaming spout

2014-02-17 Thread Nathan Marz
That's not true, it won't make a difference performance-wise whether you emit one tuple or many in nextTuple. The batching happens automatically after that. On Mon, Feb 17, 2014 at 3:51 PM, Danijel Schiavuzzi wrote: > Also, Trident is batch-based, so in nextTuple() you should really emit > batch

Re: Consistent partitions?

2014-02-16 Thread Nathan Marz
Yes batchId + partitionIndex consistently represents the same data as long as: 1. Any repartitioning you do is deterministic (so partitionBy is, but shuffle is not) 2. You're using a spout that replays the exact same batch each time (which is true of transactional spouts but not of opaque transact

Re: storm jar slowness

2014-02-14 Thread Nathan Marz
Yes, it is arbitrary. Opening an issue for this is a good idea. On Fri, Feb 14, 2014 at 5:24 PM, Adam Lewis wrote: > This is promising, 150KB chunk size is giving me over 1 MB/sec; 300KB > chunk size up to 2 MB/sec although spikier performance. My experimental > setup leaves a lot to be desire

Re: How to evaluating latency per tuple?

2014-02-04 Thread Nathan Marz
1 is already calculated by Storm and reported in default metrics. For 2 you could track that yourself in your spout implementation. On Mon, Feb 3, 2014 at 7:40 PM, Sho Nakatani wrote: > Hi all, > > I'm new to Storm. > I'm finding the way to count latency per tuple. > > What I exactly want are:

Re: Threading and partitioning on trident bolts

2014-01-27 Thread Nathan Marz
Correct, they will be run sequentially within the bolt. Threading within a bolt adds a ton of complication (needing to synchronize around output collector, for instance), so that's really bad. What are you trying to achieve by adding more threading? You won't get better resource usage because the r

Re: batch and partition - differences

2014-01-20 Thread Nathan Marz
A batch is all the tuples being computed on at once each run of the topology. Each stage of the processing is split into partitions for parallelization. On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer wrote: > Hi, it is not clear what is the different between batch and partition on > the Trident

Re: Storm - Stream Partitions

2014-01-13 Thread Nathan Marz
All this information is available through the TopologyContext. On Mon, Jan 13, 2014 at 1:34 AM, Klausen Schaefersinho < klaus.schaef...@gmail.com> wrote: > Hi, > > I have to store the state of a bolt persistently. As I have quite a lot of > events coming through I would like to use some kind of

Re: Can spout.nextTuple method be blocked?

2013-12-27 Thread Nathan Marz
You shouldn't do that because the spout will be unable to ack or fail tuples while its blocked. On Fri, Dec 27, 2013 at 5:06 PM, Link Wang wrote: > hi, all > I wonder to know what will happen If I provide a blocked or dead loop > implementation for nextTuple in my spout, does it just means that

Re: CombinerAggregator API

2013-12-21 Thread Nathan Marz
Actually, nevermind, those cases aren't problems since these objects are contained within the operation of the combiner aggregator. But I still wouldn't do it as it goes against the spirit of the API. On Sat, Dec 21, 2013 at 6:56 PM, Nathan Marz wrote: > They could potentially

Re: CombinerAggregator API

2013-12-21 Thread Nathan Marz
in what > circumstances could those values be reused? > > Thanks. > > > On Sat, Dec 21, 2013 at 5:49 PM, Nathan Marz wrote: > >> You shouldn't do that. Those input values could potentially be re-used >> for other operations. I would recommend looking into using

Re: CombinerAggregator API

2013-12-21 Thread Nathan Marz
You shouldn't do that. Those input values could potentially be re-used for other operations. I would recommend looking into using persistent data structures. On Sat, Dec 21, 2013 at 6:42 AM, Adam Lewis wrote: > Hi All, > > When implementing CombinerAggregator in Trident, > is > there any issue