Connection reset when submitting topology

2014-06-24 Thread Sam Goodwin
I've been working on a Storm topology and recently it has become impossible to submit my topology. It just times out. The only thing that has changed since I was last able to successfully submit the topology is the size of the jar/logic of the topology. It's worth mentioning that it has randomly su

Re: Trident persistent aggregation on multiple fields

2014-07-09 Thread Sam Goodwin
This may work. myStream.groupBy(new Fields("user")) .persistentAggregate(new MemoryMapState.Factory(), new Fields("some_int_1"), new Sum(), new Fields("total_some_int_1") .newValuesStream()

Re: Trident persistent aggregation on multiple fields

2014-07-09 Thread Sam Goodwin
This may work. myStream.groupBy(new Fields("user")) .persistentAggregate(new MemoryMapState.Factory(), new Fields("some_int_1"), new Sum(), new Fields("total_some_int_1") .newValuesStream()

Re: Trident persistent aggregation on multiple fields

2014-07-09 Thread Sam Goodwin
n-existent field: 'some_int_2' from stream containing > fields fields: <[user, total_some_int_1]> > > What's the purpose of newValuesStream() then ? > > > 2014-07-09 19:20 GMT+02:00 Sam Goodwin : > >> This may work. >> >>

Re: Trident persistent aggregation on multiple fields

2014-07-10 Thread Sam Goodwin
I don't think this works for persistentAggregate. On Thu, Jul 10, 2014 at 4:30 PM, Xuehui He wrote: > copy from : > http://storm.incubator.apache.org/documentation/Trident-API-Overview.html > > > Sometimes you want to execute multiple aggregators at the same time. This > is called chaining and

Re: writing huge amount of data to HDFS

2014-07-11 Thread Sam Goodwin
Can you show some code? 200 seconds for 15K puts sounds like you're not batching. On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang wrote: > typo in previous email > The emit method in the query bolt takes about 200(instead of 20) seconds.. > > > On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang > wrote:

Re: Scaling Storm Trident by add additional nodes (processes)

2014-07-16 Thread Sam Goodwin
ParalleismHint is definitely the way to scale up consumption of a Kafka topic. Make sure the parallelism doesn't exceed the number of partitions though and try to keep it balanced. Not sure how your environment is set up but simply deploy more hosts to the cluster and then use the rebalance option

Re: setup a bitmap in bolt

2014-07-16 Thread Sam Goodwin
Are you trying to count the number of distinct actions seen per user? The HyperLogLog algorithm was designed for that purpose If you are using Redis it already has an in-built HyperLogLog data structure. I've built a variety of different data structures and used them in Storm. What exactly is the

Re: How to implememt distinct count in trident topolgy?

2014-07-16 Thread Sam Goodwin
Even with Redis you'll need to maintain the sliding window yourself. Does it need to be exact? If you want to estimate the number of distinct users seen in a sliding window then use the HyperLogLog data structure with a ring buffer. It's fast, accurate and memory efficient. For example, allocate 6

Re: Real time data processing on Hadoop WITHOUT Java

2014-07-17 Thread Sam Goodwin
Summingbird is a big data platform. Part of it attempts to provide a functional interface to real-time data analysis https://github.com/twitter/summingbird On Thu, Jul 17, 2014 at 11:59 PM, Darryl Stoflet wrote: > There is some exploratory work underway to translate pig jobs into storm. > I kno

Re: anyone use Storm kafkaSpout implement a HyperLoglog

2014-08-15 Thread Sam Goodwin
I'm not too sure about how postgres hll works but i'm assuming you're going to have to send every tuple to Postgres DB remotely. This is very expensive. Where if you build your hll data strucuture in storm you only have to persist the fixed size serialized version of the hll to the database each tr