Re: How long until fields grouping gets overwhelmed with data?
Thank you very much Erik. So this is the code: @Override public List chooseTasks(int taskId, List values) { int targetTaskIndex = Math.abs(TupleUtils.listHashCode(outFields.select(groupFields, values))) % numTasks; return Collections.singletonList(targetTasks.get(targetTaskIndex)); } TupleUtils.listHashCode leads to public static int listHashCode(List alist) { if (alist == null) { return 1; } else { return Arrays.deepHashCode(alist.toArray()); } } So it does seem like the function is independent of which Spout emits it. What matters is the field name based on which the fields grouping is done. Thanks everyone! On Thu, Aug 11, 2016 at 3:52 PM, Erik Weathers wrote: > I think these are the appropriate code pointers: > > Original Clojure-based storm-core: > > https://github.com/apache/storm/blob/v0.9.6/storm-core/ > src/clj/backtype/storm/daemon/executor.clj#L36-L39 > > > New Java-based storm-core: > > https://github.com/apache/storm/blob/3b1ab3d8a7da7ed35adc448d24f1f1 > ccb6c5ff27/storm-core/src/jvm/org/apache/storm/daemon/ > GrouperFactory.java#L157-L161 > > > On Thu, Aug 11, 2016 at 2:57 AM, Navin Ipe com> wrote: > >> True, but that's what I wanted to confirm by mentioning spout S1 and S2. >> Will S1 and S2 use their own n mod hash functions or is it a common >> function decided by Storm? (If anyone could offer a pointer on where I >> could find this in the Storm source code, I could try finding it myself too) >> >> On Thu, Aug 11, 2016 at 2:36 PM, Gireesh Ramji >> wrote: >> >>> It does not matter who hashes it as long as they all use the same hash >>> function it will go to the same bolt >>> >>> >>> ---------- >>> *From:* Navin Ipe >>> *To:* user@storm.apache.org >>> *Sent:* Thursday, August 11, 2016 4:56 PM >>> *Subject:* Re: How long until fields grouping gets overwhelmed with >>> data? >>> >>> If the hash is dynamically computed and is stateless, then that brings >>> up one more question. >>> >>> Let's say there are two spout classes S1 and S2. I create 10 tasks of S1 >>> and 10 tasks of S2. >>> There are 10 tasks of a bolt B. >>> >>> S1 and S2 are fieldsGrouped with B. >>> >>> I receive data x in S1 and another data x in S2. >>> >>> If S1's emit of x goes to task1 of B, then will S2's emit of x also go >>> to task1 of B? >>> >>> *Basically the question is: *Is the hash value decided by the Spout or >>> by Storm? Because if it is decided by the spout, then S1's emit of x can go >>> to task 1 but S2's emit of x might go to some other task of the bolt, and >>> that won't serve the purpose of someone who wants all x'es to go to one >>> bolt. >>> >>> >>> >>> >>> On Wed, Aug 10, 2016 at 8:58 PM, Navin Ipe < >>> navin@searchlighthealth.com> wrote: >>> >>> Oh that's good to know. I assume it works like this: >>> https://en.wikipedia.org/wiki/ >>> Hash_function#Hashing_ uniformly_distributed_data >>> <https://en.wikipedia.org/wiki/Hash_function#Hashing_uniformly_distributed_data> >>> >>> On Wed, Aug 10, 2016 at 6:23 PM, Nathan Leung wrote: >>> >>> It's based on a modulo of a hash of the field. The fields grouping is >>> stateless. >>> >>> On Aug 10, 2016 8:18 AM, "Navin Ipe" >> > wrote: >>> >>> Hi, >>> >>> For spouts to be able to continuously send a fields grouped tuple to the >>> same bolt, it would have to store a key value map something like this, >>> right? >>> >>> field1023 ---> Bolt1 >>> field1343 ---> Bolt3 >>> field1629 ---> Bolt5 >>> field1726 ---> Bolt1 >>> field1481 ---> Bolt3 >>> >>> So if my topology runs for a very long time and the spout generates many >>> unique field values, won't this key value map run out of memory eventually? >>> >>> OR is there a failsafe or a map limit that Storm has to handle this >>> without crashing? >>> >>> If memory problems could happen, what would be an alternative way to >>> solve this problem where many unique fields could get generated over time? >>> >>> -- >>> Regards, >>> Navin >>> >>> >>> >>> >>> -- >>> Regards, >>> Navin >>> >>> >>> >>> >>> -- >>> Regards, >>> Navin >>> >>> >>> >> >> >> -- >> Regards, >> Navin >> > > -- Regards, Navin
Re: How long until fields grouping gets overwhelmed with data?
I think these are the appropriate code pointers: Original Clojure-based storm-core: https://github.com/apache/storm/blob/v0.9.6/storm-core/src/clj/backtype/storm/daemon/executor.clj#L36-L39 New Java-based storm-core: https://github.com/apache/storm/blob/3b1ab3d8a7da7ed35adc448d24f1f1ccb6c5ff27/storm-core/src/jvm/org/apache/storm/daemon/GrouperFactory.java#L157-L161 On Thu, Aug 11, 2016 at 2:57 AM, Navin Ipe wrote: > True, but that's what I wanted to confirm by mentioning spout S1 and S2. > Will S1 and S2 use their own n mod hash functions or is it a common > function decided by Storm? (If anyone could offer a pointer on where I > could find this in the Storm source code, I could try finding it myself too) > > On Thu, Aug 11, 2016 at 2:36 PM, Gireesh Ramji > wrote: > >> It does not matter who hashes it as long as they all use the same hash >> function it will go to the same bolt >> >> >> -- >> *From:* Navin Ipe >> *To:* user@storm.apache.org >> *Sent:* Thursday, August 11, 2016 4:56 PM >> *Subject:* Re: How long until fields grouping gets overwhelmed with data? >> >> If the hash is dynamically computed and is stateless, then that brings up >> one more question. >> >> Let's say there are two spout classes S1 and S2. I create 10 tasks of S1 >> and 10 tasks of S2. >> There are 10 tasks of a bolt B. >> >> S1 and S2 are fieldsGrouped with B. >> >> I receive data x in S1 and another data x in S2. >> >> If S1's emit of x goes to task1 of B, then will S2's emit of x also go to >> task1 of B? >> >> *Basically the question is: *Is the hash value decided by the Spout or >> by Storm? Because if it is decided by the spout, then S1's emit of x can go >> to task 1 but S2's emit of x might go to some other task of the bolt, and >> that won't serve the purpose of someone who wants all x'es to go to one >> bolt. >> >> >> >> >> On Wed, Aug 10, 2016 at 8:58 PM, Navin Ipe > om> wrote: >> >> Oh that's good to know. I assume it works like this: >> https://en.wikipedia.org/wiki/ >> Hash_function#Hashing_ uniformly_distributed_data >> <https://en.wikipedia.org/wiki/Hash_function#Hashing_uniformly_distributed_data> >> >> On Wed, Aug 10, 2016 at 6:23 PM, Nathan Leung wrote: >> >> It's based on a modulo of a hash of the field. The fields grouping is >> stateless. >> >> On Aug 10, 2016 8:18 AM, "Navin Ipe" > > wrote: >> >> Hi, >> >> For spouts to be able to continuously send a fields grouped tuple to the >> same bolt, it would have to store a key value map something like this, >> right? >> >> field1023 ---> Bolt1 >> field1343 ---> Bolt3 >> field1629 ---> Bolt5 >> field1726 ---> Bolt1 >> field1481 ---> Bolt3 >> >> So if my topology runs for a very long time and the spout generates many >> unique field values, won't this key value map run out of memory eventually? >> >> OR is there a failsafe or a map limit that Storm has to handle this >> without crashing? >> >> If memory problems could happen, what would be an alternative way to >> solve this problem where many unique fields could get generated over time? >> >> -- >> Regards, >> Navin >> >> >> >> >> -- >> Regards, >> Navin >> >> >> >> >> -- >> Regards, >> Navin >> >> >> > > > -- > Regards, > Navin >
Re: How long until fields grouping gets overwhelmed with data?
True, but that's what I wanted to confirm by mentioning spout S1 and S2. Will S1 and S2 use their own n mod hash functions or is it a common function decided by Storm? (If anyone could offer a pointer on where I could find this in the Storm source code, I could try finding it myself too) On Thu, Aug 11, 2016 at 2:36 PM, Gireesh Ramji wrote: > It does not matter who hashes it as long as they all use the same hash > function it will go to the same bolt > > > -- > *From:* Navin Ipe > *To:* user@storm.apache.org > *Sent:* Thursday, August 11, 2016 4:56 PM > *Subject:* Re: How long until fields grouping gets overwhelmed with data? > > If the hash is dynamically computed and is stateless, then that brings up > one more question. > > Let's say there are two spout classes S1 and S2. I create 10 tasks of S1 > and 10 tasks of S2. > There are 10 tasks of a bolt B. > > S1 and S2 are fieldsGrouped with B. > > I receive data x in S1 and another data x in S2. > > If S1's emit of x goes to task1 of B, then will S2's emit of x also go to > task1 of B? > > *Basically the question is: *Is the hash value decided by the Spout or by > Storm? Because if it is decided by the spout, then S1's emit of x can go to > task 1 but S2's emit of x might go to some other task of the bolt, and that > won't serve the purpose of someone who wants all x'es to go to one bolt. > > > > > On Wed, Aug 10, 2016 at 8:58 PM, Navin Ipe com> wrote: > > Oh that's good to know. I assume it works like this: > https://en.wikipedia.org/wiki/ > Hash_function#Hashing_ uniformly_distributed_data > <https://en.wikipedia.org/wiki/Hash_function#Hashing_uniformly_distributed_data> > > On Wed, Aug 10, 2016 at 6:23 PM, Nathan Leung wrote: > > It's based on a modulo of a hash of the field. The fields grouping is > stateless. > > On Aug 10, 2016 8:18 AM, "Navin Ipe" > wrote: > > Hi, > > For spouts to be able to continuously send a fields grouped tuple to the > same bolt, it would have to store a key value map something like this, > right? > > field1023 ---> Bolt1 > field1343 ---> Bolt3 > field1629 ---> Bolt5 > field1726 ---> Bolt1 > field1481 ---> Bolt3 > > So if my topology runs for a very long time and the spout generates many > unique field values, won't this key value map run out of memory eventually? > > OR is there a failsafe or a map limit that Storm has to handle this > without crashing? > > If memory problems could happen, what would be an alternative way to solve > this problem where many unique fields could get generated over time? > > -- > Regards, > Navin > > > > > -- > Regards, > Navin > > > > > -- > Regards, > Navin > > > -- Regards, Navin
Re: How long until fields grouping gets overwhelmed with data?
It does not matter who hashes it as long as they all use the same hash function it will go to the same bolt From: Navin Ipe To: user@storm.apache.org Sent: Thursday, August 11, 2016 4:56 PM Subject: Re: How long until fields grouping gets overwhelmed with data? If the hash is dynamically computed and is stateless, then that brings up one more question. Let's say there are two spout classes S1 and S2. I create 10 tasks of S1 and 10 tasks of S2. There are 10 tasks of a bolt B. S1 and S2 are fieldsGrouped with B. I receive data x in S1 and another data x in S2. If S1's emit of x goes to task1 of B, then will S2's emit of x also go to task1 of B? Basically the question is: Is the hash value decided by the Spout or by Storm? Because if it is decided by the spout, then S1's emit of x can go to task 1 but S2's emit of x might go to some other task of the bolt, and that won't serve the purpose of someone who wants all x'es to go to one bolt. On Wed, Aug 10, 2016 at 8:58 PM, Navin Ipe wrote: Oh that's good to know. I assume it works like this: https://en.wikipedia.org/wiki/ Hash_function#Hashing_ uniformly_distributed_data On Wed, Aug 10, 2016 at 6:23 PM, Nathan Leung wrote: It's based on a modulo of a hash of the field. The fields grouping is stateless. On Aug 10, 2016 8:18 AM, "Navin Ipe" wrote: Hi, For spouts to be able to continuously send a fields grouped tuple to the same bolt, it would have to store a key value map something like this, right? field1023 ---> Bolt1 field1343 ---> Bolt3 field1629 ---> Bolt5 field1726 ---> Bolt1 field1481 ---> Bolt3 So if my topology runs for a very long time and the spout generates many unique field values, won't this key value map run out of memory eventually? OR is there a failsafe or a map limit that Storm has to handle this without crashing? If memory problems could happen, what would be an alternative way to solve this problem where many unique fields could get generated over time? -- Regards,Navin -- Regards,Navin -- Regards,Navin
Re: How long until fields grouping gets overwhelmed with data?
If the hash is dynamically computed and is stateless, then that brings up one more question. Let's say there are two spout classes S1 and S2. I create 10 tasks of S1 and 10 tasks of S2. There are 10 tasks of a bolt B. S1 and S2 are fieldsGrouped with B. I receive data x in S1 and another data x in S2. If S1's emit of x goes to task1 of B, then will S2's emit of x also go to task1 of B? *Basically the question is: *Is the hash value decided by the Spout or by Storm? Because if it is decided by the spout, then S1's emit of x can go to task 1 but S2's emit of x might go to some other task of the bolt, and that won't serve the purpose of someone who wants all x'es to go to one bolt. On Wed, Aug 10, 2016 at 8:58 PM, Navin Ipe wrote: > Oh that's good to know. I assume it works like this: > https://en.wikipedia.org/wiki/Hash_function#Hashing_ > uniformly_distributed_data > > On Wed, Aug 10, 2016 at 6:23 PM, Nathan Leung wrote: > >> It's based on a modulo of a hash of the field. The fields grouping is >> stateless. >> >> On Aug 10, 2016 8:18 AM, "Navin Ipe" >> wrote: >> >>> Hi, >>> >>> For spouts to be able to continuously send a fields grouped tuple to the >>> same bolt, it would have to store a key value map something like this, >>> right? >>> >>> field1023 ---> Bolt1 >>> field1343 ---> Bolt3 >>> field1629 ---> Bolt5 >>> field1726 ---> Bolt1 >>> field1481 ---> Bolt3 >>> >>> So if my topology runs for a very long time and the spout generates many >>> unique field values, won't this key value map run out of memory eventually? >>> >>> OR is there a failsafe or a map limit that Storm has to handle this >>> without crashing? >>> >>> If memory problems could happen, what would be an alternative way to >>> solve this problem where many unique fields could get generated over time? >>> >>> -- >>> Regards, >>> Navin >>> >> > > > -- > Regards, > Navin > -- Regards, Navin
Re: How long until fields grouping gets overwhelmed with data?
Oh that's good to know. I assume it works like this: https://en.wikipedia.org/wiki/Hash_function#Hashing_uniformly_distributed_data On Wed, Aug 10, 2016 at 6:23 PM, Nathan Leung wrote: > It's based on a modulo of a hash of the field. The fields grouping is > stateless. > > On Aug 10, 2016 8:18 AM, "Navin Ipe" > wrote: > >> Hi, >> >> For spouts to be able to continuously send a fields grouped tuple to the >> same bolt, it would have to store a key value map something like this, >> right? >> >> field1023 ---> Bolt1 >> field1343 ---> Bolt3 >> field1629 ---> Bolt5 >> field1726 ---> Bolt1 >> field1481 ---> Bolt3 >> >> So if my topology runs for a very long time and the spout generates many >> unique field values, won't this key value map run out of memory eventually? >> >> OR is there a failsafe or a map limit that Storm has to handle this >> without crashing? >> >> If memory problems could happen, what would be an alternative way to >> solve this problem where many unique fields could get generated over time? >> >> -- >> Regards, >> Navin >> > -- Regards, Navin
Re: How long until fields grouping gets overwhelmed with data?
It's based on a modulo of a hash of the field. The fields grouping is stateless. On Aug 10, 2016 8:18 AM, "Navin Ipe" wrote: > Hi, > > For spouts to be able to continuously send a fields grouped tuple to the > same bolt, it would have to store a key value map something like this, > right? > > field1023 ---> Bolt1 > field1343 ---> Bolt3 > field1629 ---> Bolt5 > field1726 ---> Bolt1 > field1481 ---> Bolt3 > > So if my topology runs for a very long time and the spout generates many > unique field values, won't this key value map run out of memory eventually? > > OR is there a failsafe or a map limit that Storm has to handle this > without crashing? > > If memory problems could happen, what would be an alternative way to solve > this problem where many unique fields could get generated over time? > > -- > Regards, > Navin >