Re: Batch size warnings
Right you are, thank you Cody. Wondering if I may reach out again to the list and ask a similar question in a more specific way: Scenario: Cassandra 3.x, small cluster (<10 nodes), 1 DC Is a batch warn threshold of 50kb, and average batch sizes in the 40kb range a recipe for regret? Should we be considering a solution such as Cody elucidated earlier in the thread, or am I over-worrying the issue? On Wed, Dec 7, 2016 at 4:08 PM, Cody Yancey wrote: > There is a disconnect between write.3 and write.4, but it can only affect > performance, not consistency. The presence or absence of a row's txnUUID in > the IncompleteTransactions table is the ultimate source of truth, and rows > whose txnUUID are not null will be checked against that truth in the read > path. > > And yes, it is a good point, failures with this model will accumulate and > degrade performance if you never clear out old failed transactions. The > tables we have that use this generally use TTLs so we don't really care as > long as irrecoverable transaction failures are very rare. > > Thanks, > Cody > > On Wed, Dec 7, 2016 at 1:56 PM, Voytek Jarnot > wrote: > >> Appreciate the long writeup Cody. >> >> Yeah, we're good with temporary inconsistency (thankfully) as well. I'm >> going to try to ride the batch train and hope it doesn't derail - our load >> is fairly static (or, more precisely, increase in load is fairly slow and >> can be projected). >> >> Enjoyed your two-phase commit text. Presumably one would also have some >> cleanup implementation that culls any failed updates (write.5) which could >> be identified in read.3 / read.4? Still a disconnect possible between >> write.3 and write.4, but there's always something... >> >> We're insert-only (well, with some deletes via TTL, but anyway), so >> that's somewhat tempting, but I'd rather not prematurely optimize. Unless, >> of course, anyone's got experience such that "batches over XXkb are >> definitely going to be a problem". >> >> Appreciate everyone's time. >> --Voytek Jarnot >> >> On Wed, Dec 7, 2016 at 11:31 AM, Cody Yancey wrote: >> >>> Hi Voytek, >>> I think the way you are using it is definitely the canonical way. >>> Unfortunately, as you learned, there are some gotchas. We tried >>> substantially increasing the batch size and it worked for a while, until we >>> reached new scale, and we increased it again, and so forth. It works, but >>> soon you start getting write timeouts, lots of them. And the thing about >>> multi-partition batch statements is that they offer atomicity, but not >>> isolation. This means your database can temporarily be in an inconsistent >>> state while writes are propagating to the various machines. >>> >>> For our use case, we could deal with temporary inconsistency, as long as >>> it was for a strictly bounded period of time, on the order of a few >>> seconds. Unfortunately, as with all things eventually consistent, it >>> degrades to "totally inconsistent" when your database is under heavy load >>> and the time-bounds expand beyond what the application can handle. When a >>> batch write times out, it often still succeeds (eventually) but your tables >>> can be inconsistent for >>> >>> minutes, even while nodetool status shows all nodes up and normal. >>> >>> But there is another way, that requires us to take a page from our RDBMS >>> ancestors' book: multi-phase commit. >>> >>> Similar to logged batch writes, multi-phase commit patterns typically >>> entail some write amplification cost for the benefit of stronger >>> consistency guarantees across isolatable units (in Cassandra's case, >>> *partitions*). However, multi-phase commit offers stronger guarantees >>> that batch writes, and ALL of the additional write load is completely >>> distributed as per your load-balancing policy, where as batch writes all go >>> through one coordinator node, then get written in their entirety to the >>> batch log on two or three nodes, and then get dispersed in a distributed >>> fashion from there. >>> >>> A typical two-phase commit pattern looks like this: >>> >>> The Write Path >>> >>>1. The client code chooses a random UUID. >>>2. The client writes the UUID into the IncompleteTransactions table, >>>which only has one column, the transactionUUID. >>>3. The client makes all of the inserts involved in the transaction, >>>IN PARALLEL, with the transactionUUID duplicated in every inserted row. >>>4. The client deletes the UUID from IncompleteTransactions table. >>>5. The client makes parallel updates to all of the rows it inserted, >>>IN PARALLEL, setting the transactionUUID to null. >>> >>> The Read Path >>> >>>1. The client reads some rows from a partition. If this particular >>>client request can handle extraneous rows, you are done. If not, read on >>> to >>>step #2. >>>2. The client gathers the set of unique transactionUUIDs. In the >>>main case, they've all been deleted by step #5 in the Write Path. If no
Re: Batch size warnings
There is a disconnect between write.3 and write.4, but it can only affect performance, not consistency. The presence or absence of a row's txnUUID in the IncompleteTransactions table is the ultimate source of truth, and rows whose txnUUID are not null will be checked against that truth in the read path. And yes, it is a good point, failures with this model will accumulate and degrade performance if you never clear out old failed transactions. The tables we have that use this generally use TTLs so we don't really care as long as irrecoverable transaction failures are very rare. Thanks, Cody On Wed, Dec 7, 2016 at 1:56 PM, Voytek Jarnot wrote: > Appreciate the long writeup Cody. > > Yeah, we're good with temporary inconsistency (thankfully) as well. I'm > going to try to ride the batch train and hope it doesn't derail - our load > is fairly static (or, more precisely, increase in load is fairly slow and > can be projected). > > Enjoyed your two-phase commit text. Presumably one would also have some > cleanup implementation that culls any failed updates (write.5) which could > be identified in read.3 / read.4? Still a disconnect possible between > write.3 and write.4, but there's always something... > > We're insert-only (well, with some deletes via TTL, but anyway), so that's > somewhat tempting, but I'd rather not prematurely optimize. Unless, of > course, anyone's got experience such that "batches over XXkb are definitely > going to be a problem". > > Appreciate everyone's time. > --Voytek Jarnot > > On Wed, Dec 7, 2016 at 11:31 AM, Cody Yancey wrote: > >> Hi Voytek, >> I think the way you are using it is definitely the canonical way. >> Unfortunately, as you learned, there are some gotchas. We tried >> substantially increasing the batch size and it worked for a while, until we >> reached new scale, and we increased it again, and so forth. It works, but >> soon you start getting write timeouts, lots of them. And the thing about >> multi-partition batch statements is that they offer atomicity, but not >> isolation. This means your database can temporarily be in an inconsistent >> state while writes are propagating to the various machines. >> >> For our use case, we could deal with temporary inconsistency, as long as >> it was for a strictly bounded period of time, on the order of a few >> seconds. Unfortunately, as with all things eventually consistent, it >> degrades to "totally inconsistent" when your database is under heavy load >> and the time-bounds expand beyond what the application can handle. When a >> batch write times out, it often still succeeds (eventually) but your tables >> can be inconsistent for >> >> minutes, even while nodetool status shows all nodes up and normal. >> >> But there is another way, that requires us to take a page from our RDBMS >> ancestors' book: multi-phase commit. >> >> Similar to logged batch writes, multi-phase commit patterns typically >> entail some write amplification cost for the benefit of stronger >> consistency guarantees across isolatable units (in Cassandra's case, >> *partitions*). However, multi-phase commit offers stronger guarantees >> that batch writes, and ALL of the additional write load is completely >> distributed as per your load-balancing policy, where as batch writes all go >> through one coordinator node, then get written in their entirety to the >> batch log on two or three nodes, and then get dispersed in a distributed >> fashion from there. >> >> A typical two-phase commit pattern looks like this: >> >> The Write Path >> >>1. The client code chooses a random UUID. >>2. The client writes the UUID into the IncompleteTransactions table, >>which only has one column, the transactionUUID. >>3. The client makes all of the inserts involved in the transaction, >>IN PARALLEL, with the transactionUUID duplicated in every inserted row. >>4. The client deletes the UUID from IncompleteTransactions table. >>5. The client makes parallel updates to all of the rows it inserted, >>IN PARALLEL, setting the transactionUUID to null. >> >> The Read Path >> >>1. The client reads some rows from a partition. If this particular >>client request can handle extraneous rows, you are done. If not, read on >> to >>step #2. >>2. The client gathers the set of unique transactionUUIDs. In the main >>case, they've all been deleted by step #5 in the Write Path. If not, go to >>#3. >>3. For remaining transactionUUIDs (which should be a very small >>number), query the IncompleteTransactions table. >>4. The client code culls rows where the transactionUUID existed in >>the IncompleteTransactions table. >> >> This is just an example, one that is reasonably performant for >> ledger-style non-updated inserts. For transactions involving updates to >> possibly existing data, more effort is required, generally the client needs >> to be smart enough to merge updates based on a timestamp, with a periodic >> batch
Re: Batch size warnings
Appreciate the long writeup Cody. Yeah, we're good with temporary inconsistency (thankfully) as well. I'm going to try to ride the batch train and hope it doesn't derail - our load is fairly static (or, more precisely, increase in load is fairly slow and can be projected). Enjoyed your two-phase commit text. Presumably one would also have some cleanup implementation that culls any failed updates (write.5) which could be identified in read.3 / read.4? Still a disconnect possible between write.3 and write.4, but there's always something... We're insert-only (well, with some deletes via TTL, but anyway), so that's somewhat tempting, but I'd rather not prematurely optimize. Unless, of course, anyone's got experience such that "batches over XXkb are definitely going to be a problem". Appreciate everyone's time. --Voytek Jarnot On Wed, Dec 7, 2016 at 11:31 AM, Cody Yancey wrote: > Hi Voytek, > I think the way you are using it is definitely the canonical way. > Unfortunately, as you learned, there are some gotchas. We tried > substantially increasing the batch size and it worked for a while, until we > reached new scale, and we increased it again, and so forth. It works, but > soon you start getting write timeouts, lots of them. And the thing about > multi-partition batch statements is that they offer atomicity, but not > isolation. This means your database can temporarily be in an inconsistent > state while writes are propagating to the various machines. > > For our use case, we could deal with temporary inconsistency, as long as > it was for a strictly bounded period of time, on the order of a few > seconds. Unfortunately, as with all things eventually consistent, it > degrades to "totally inconsistent" when your database is under heavy load > and the time-bounds expand beyond what the application can handle. When a > batch write times out, it often still succeeds (eventually) but your tables > can be inconsistent for > > minutes, even while nodetool status shows all nodes up and normal. > > But there is another way, that requires us to take a page from our RDBMS > ancestors' book: multi-phase commit. > > Similar to logged batch writes, multi-phase commit patterns typically > entail some write amplification cost for the benefit of stronger > consistency guarantees across isolatable units (in Cassandra's case, > *partitions*). However, multi-phase commit offers stronger guarantees > that batch writes, and ALL of the additional write load is completely > distributed as per your load-balancing policy, where as batch writes all go > through one coordinator node, then get written in their entirety to the > batch log on two or three nodes, and then get dispersed in a distributed > fashion from there. > > A typical two-phase commit pattern looks like this: > > The Write Path > >1. The client code chooses a random UUID. >2. The client writes the UUID into the IncompleteTransactions table, >which only has one column, the transactionUUID. >3. The client makes all of the inserts involved in the transaction, IN >PARALLEL, with the transactionUUID duplicated in every inserted row. >4. The client deletes the UUID from IncompleteTransactions table. >5. The client makes parallel updates to all of the rows it inserted, >IN PARALLEL, setting the transactionUUID to null. > > The Read Path > >1. The client reads some rows from a partition. If this particular >client request can handle extraneous rows, you are done. If not, read on to >step #2. >2. The client gathers the set of unique transactionUUIDs. In the main >case, they've all been deleted by step #5 in the Write Path. If not, go to >#3. >3. For remaining transactionUUIDs (which should be a very small >number), query the IncompleteTransactions table. >4. The client code culls rows where the transactionUUID existed in the >IncompleteTransactions table. > > This is just an example, one that is reasonably performant for > ledger-style non-updated inserts. For transactions involving updates to > possibly existing data, more effort is required, generally the client needs > to be smart enough to merge updates based on a timestamp, with a periodic > batch job that cleans out obsolete inserts. If it feels like reinventing > the wheel, that's because it is. But it just might be the quickest path to > what you need. > > Thanks, > Cody > > On Wed, Dec 7, 2016 at 10:15 AM, Edward Capriolo > wrote: > >> I have been circling around a thought process over batches. Now that >> Cassandra has aggregating functions, it might be possible write a type of >> record that has an END_OF_BATCH type marker and the data can be suppressed >> from view until it was all there. >> >> IE you write something like a checksum record that an intelligent client >> can use to tell if the rest of the batch is complete. >> >> On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot >> wrote: >> >>> Been about a month since I have up on it, bu
Re: Batch size warnings
Hi Voytek, I think the way you are using it is definitely the canonical way. Unfortunately, as you learned, there are some gotchas. We tried substantially increasing the batch size and it worked for a while, until we reached new scale, and we increased it again, and so forth. It works, but soon you start getting write timeouts, lots of them. And the thing about multi-partition batch statements is that they offer atomicity, but not isolation. This means your database can temporarily be in an inconsistent state while writes are propagating to the various machines. For our use case, we could deal with temporary inconsistency, as long as it was for a strictly bounded period of time, on the order of a few seconds. Unfortunately, as with all things eventually consistent, it degrades to "totally inconsistent" when your database is under heavy load and the time-bounds expand beyond what the application can handle. When a batch write times out, it often still succeeds (eventually) but your tables can be inconsistent for minutes, even while nodetool status shows all nodes up and normal. But there is another way, that requires us to take a page from our RDBMS ancestors' book: multi-phase commit. Similar to logged batch writes, multi-phase commit patterns typically entail some write amplification cost for the benefit of stronger consistency guarantees across isolatable units (in Cassandra's case, *partitions*). However, multi-phase commit offers stronger guarantees that batch writes, and ALL of the additional write load is completely distributed as per your load-balancing policy, where as batch writes all go through one coordinator node, then get written in their entirety to the batch log on two or three nodes, and then get dispersed in a distributed fashion from there. A typical two-phase commit pattern looks like this: The Write Path 1. The client code chooses a random UUID. 2. The client writes the UUID into the IncompleteTransactions table, which only has one column, the transactionUUID. 3. The client makes all of the inserts involved in the transaction, IN PARALLEL, with the transactionUUID duplicated in every inserted row. 4. The client deletes the UUID from IncompleteTransactions table. 5. The client makes parallel updates to all of the rows it inserted, IN PARALLEL, setting the transactionUUID to null. The Read Path 1. The client reads some rows from a partition. If this particular client request can handle extraneous rows, you are done. If not, read on to step #2. 2. The client gathers the set of unique transactionUUIDs. In the main case, they've all been deleted by step #5 in the Write Path. If not, go to #3. 3. For remaining transactionUUIDs (which should be a very small number), query the IncompleteTransactions table. 4. The client code culls rows where the transactionUUID existed in the IncompleteTransactions table. This is just an example, one that is reasonably performant for ledger-style non-updated inserts. For transactions involving updates to possibly existing data, more effort is required, generally the client needs to be smart enough to merge updates based on a timestamp, with a periodic batch job that cleans out obsolete inserts. If it feels like reinventing the wheel, that's because it is. But it just might be the quickest path to what you need. Thanks, Cody On Wed, Dec 7, 2016 at 10:15 AM, Edward Capriolo wrote: > I have been circling around a thought process over batches. Now that > Cassandra has aggregating functions, it might be possible write a type of > record that has an END_OF_BATCH type marker and the data can be suppressed > from view until it was all there. > > IE you write something like a checksum record that an intelligent client > can use to tell if the rest of the batch is complete. > > On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot > wrote: > >> Been about a month since I have up on it, but it was very much related to >> the stuff you're dealing with ... Basically Cassandra just stepping on its >> own er, tripping over its own feet streaming MVs. >> >> On Dec 7, 2016 10:45 AM, "Benjamin Roth" wrote: >> >>> I meant the mv thing >>> >>> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" : >>> Sure, about which part? default batch size warning is 5kb I've increased it to 30kb, and will need to increase to 40kb (8x default setting) to avoid WARN log messages about batch sizes. I do realize it's just a WARNing, but may as well avoid those if I can configure it out. That said, having to increase it so substantially (and we're only dealing with 5 tables) is making me wonder if I'm not taking the correct approach in terms of using batches to guarantee atomicity. On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth >>> > wrote: > Could you please be more specific? > > Am 07.12.2016 17:10 schrieb "Voytek Jarnot" : > >> Should've mentioned - run
Re: Batch size warnings
@Ed, what you just said reminded me a lot of RAMP transactions. I did a blog post on it here: http://rustyrazorblade.com/2015/11/ramp-made-easy/ I've been considering doing a follow up on how to do a Cassandra data model enabling RAMP transactions, but that takes time, and I have almost zero of that. On Wed, Dec 7, 2016 at 9:16 AM Edward Capriolo wrote: > I have been circling around a thought process over batches. Now that > Cassandra has aggregating functions, it might be possible write a type of > record that has an END_OF_BATCH type marker and the data can be suppressed > from view until it was all there. > > IE you write something like a checksum record that an intelligent client > can use to tell if the rest of the batch is complete. > > On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot > wrote: > > Been about a month since I have up on it, but it was very much related to > the stuff you're dealing with ... Basically Cassandra just stepping on its > own er, tripping over its own feet streaming MVs. > > On Dec 7, 2016 10:45 AM, "Benjamin Roth" wrote: > > I meant the mv thing > > Am 07.12.2016 17:27 schrieb "Voytek Jarnot" : > > Sure, about which part? > > default batch size warning is 5kb > I've increased it to 30kb, and will need to increase to 40kb (8x default > setting) to avoid WARN log messages about batch sizes. I do realize it's > just a WARNing, but may as well avoid those if I can configure it out. > That said, having to increase it so substantially (and we're only dealing > with 5 tables) is making me wonder if I'm not taking the correct approach > in terms of using batches to guarantee atomicity. > > On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth > wrote: > > Could you please be more specific? > > Am 07.12.2016 17:10 schrieb "Voytek Jarnot" : > > Should've mentioned - running 3.9. Also - please do not recommend MVs: I > tried, they're broken, we punted. > > On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot > wrote: > > The low default value for batch_size_warn_threshold_in_kb is making me > wonder if I'm perhaps approaching the problem of atomicity in a non-ideal > fashion. > > With one data set duplicated/denormalized into 5 tables to support > queries, we use batches to ensure inserts make it to all or 0 tables. This > works fine, but I've had to bump the warn threshold and fail threshold > substantially (8x higher for the warn threshold). This - in turn - makes > me wonder, with a default setting so low, if I'm not solving this problem > in the canonical/standard way. > > Mostly just looking for confirmation that we're not unintentionally doing > something weird... > > > > >
Re: Batch size warnings
I have been circling around a thought process over batches. Now that Cassandra has aggregating functions, it might be possible write a type of record that has an END_OF_BATCH type marker and the data can be suppressed from view until it was all there. IE you write something like a checksum record that an intelligent client can use to tell if the rest of the batch is complete. On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot wrote: > Been about a month since I have up on it, but it was very much related to > the stuff you're dealing with ... Basically Cassandra just stepping on its > own er, tripping over its own feet streaming MVs. > > On Dec 7, 2016 10:45 AM, "Benjamin Roth" wrote: > >> I meant the mv thing >> >> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" : >> >>> Sure, about which part? >>> >>> default batch size warning is 5kb >>> I've increased it to 30kb, and will need to increase to 40kb (8x default >>> setting) to avoid WARN log messages about batch sizes. I do realize it's >>> just a WARNing, but may as well avoid those if I can configure it out. >>> That said, having to increase it so substantially (and we're only dealing >>> with 5 tables) is making me wonder if I'm not taking the correct approach >>> in terms of using batches to guarantee atomicity. >>> >>> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth >>> wrote: >>> Could you please be more specific? Am 07.12.2016 17:10 schrieb "Voytek Jarnot" : > Should've mentioned - running 3.9. Also - please do not recommend > MVs: I tried, they're broken, we punted. > > On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot < > voytek.jar...@gmail.com> wrote: > >> The low default value for batch_size_warn_threshold_in_kb is making >> me wonder if I'm perhaps approaching the problem of atomicity in a >> non-ideal fashion. >> >> With one data set duplicated/denormalized into 5 tables to support >> queries, we use batches to ensure inserts make it to all or 0 tables. >> This >> works fine, but I've had to bump the warn threshold and fail threshold >> substantially (8x higher for the warn threshold). This - in turn - makes >> me wonder, with a default setting so low, if I'm not solving this problem >> in the canonical/standard way. >> >> Mostly just looking for confirmation that we're not unintentionally >> doing something weird... >> > > >>>
Re: Batch size warnings
Ok thanks. Im investingating a Lot. There will be some improvements coming but cannot promise if it will solve All existing problems. We will see and keep working on it. Am 07.12.2016 17:58 schrieb "Voytek Jarnot" : > Been about a month since I have up on it, but it was very much related to > the stuff you're dealing with ... Basically Cassandra just stepping on its > own er, tripping over its own feet streaming MVs. > > On Dec 7, 2016 10:45 AM, "Benjamin Roth" wrote: > >> I meant the mv thing >> >> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" : >> >>> Sure, about which part? >>> >>> default batch size warning is 5kb >>> I've increased it to 30kb, and will need to increase to 40kb (8x default >>> setting) to avoid WARN log messages about batch sizes. I do realize it's >>> just a WARNing, but may as well avoid those if I can configure it out. >>> That said, having to increase it so substantially (and we're only dealing >>> with 5 tables) is making me wonder if I'm not taking the correct approach >>> in terms of using batches to guarantee atomicity. >>> >>> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth >>> wrote: >>> Could you please be more specific? Am 07.12.2016 17:10 schrieb "Voytek Jarnot" : > Should've mentioned - running 3.9. Also - please do not recommend > MVs: I tried, they're broken, we punted. > > On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot < > voytek.jar...@gmail.com> wrote: > >> The low default value for batch_size_warn_threshold_in_kb is making >> me wonder if I'm perhaps approaching the problem of atomicity in a >> non-ideal fashion. >> >> With one data set duplicated/denormalized into 5 tables to support >> queries, we use batches to ensure inserts make it to all or 0 tables. >> This >> works fine, but I've had to bump the warn threshold and fail threshold >> substantially (8x higher for the warn threshold). This - in turn - makes >> me wonder, with a default setting so low, if I'm not solving this problem >> in the canonical/standard way. >> >> Mostly just looking for confirmation that we're not unintentionally >> doing something weird... >> > > >>>
Re: Batch size warnings
Been about a month since I have up on it, but it was very much related to the stuff you're dealing with ... Basically Cassandra just stepping on its own er, tripping over its own feet streaming MVs. On Dec 7, 2016 10:45 AM, "Benjamin Roth" wrote: > I meant the mv thing > > Am 07.12.2016 17:27 schrieb "Voytek Jarnot" : > >> Sure, about which part? >> >> default batch size warning is 5kb >> I've increased it to 30kb, and will need to increase to 40kb (8x default >> setting) to avoid WARN log messages about batch sizes. I do realize it's >> just a WARNing, but may as well avoid those if I can configure it out. >> That said, having to increase it so substantially (and we're only dealing >> with 5 tables) is making me wonder if I'm not taking the correct approach >> in terms of using batches to guarantee atomicity. >> >> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth >> wrote: >> >>> Could you please be more specific? >>> >>> Am 07.12.2016 17:10 schrieb "Voytek Jarnot" : >>> Should've mentioned - running 3.9. Also - please do not recommend MVs: I tried, they're broken, we punted. On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot >>> > wrote: > The low default value for batch_size_warn_threshold_in_kb is making > me wonder if I'm perhaps approaching the problem of atomicity in a > non-ideal fashion. > > With one data set duplicated/denormalized into 5 tables to support > queries, we use batches to ensure inserts make it to all or 0 tables. > This > works fine, but I've had to bump the warn threshold and fail threshold > substantially (8x higher for the warn threshold). This - in turn - makes > me wonder, with a default setting so low, if I'm not solving this problem > in the canonical/standard way. > > Mostly just looking for confirmation that we're not unintentionally > doing something weird... > >>
Re: Batch size warnings
I meant the mv thing Am 07.12.2016 17:27 schrieb "Voytek Jarnot" : > Sure, about which part? > > default batch size warning is 5kb > I've increased it to 30kb, and will need to increase to 40kb (8x default > setting) to avoid WARN log messages about batch sizes. I do realize it's > just a WARNing, but may as well avoid those if I can configure it out. > That said, having to increase it so substantially (and we're only dealing > with 5 tables) is making me wonder if I'm not taking the correct approach > in terms of using batches to guarantee atomicity. > > On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth > wrote: > >> Could you please be more specific? >> >> Am 07.12.2016 17:10 schrieb "Voytek Jarnot" : >> >>> Should've mentioned - running 3.9. Also - please do not recommend MVs: >>> I tried, they're broken, we punted. >>> >>> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot >>> wrote: >>> The low default value for batch_size_warn_threshold_in_kb is making me wonder if I'm perhaps approaching the problem of atomicity in a non-ideal fashion. With one data set duplicated/denormalized into 5 tables to support queries, we use batches to ensure inserts make it to all or 0 tables. This works fine, but I've had to bump the warn threshold and fail threshold substantially (8x higher for the warn threshold). This - in turn - makes me wonder, with a default setting so low, if I'm not solving this problem in the canonical/standard way. Mostly just looking for confirmation that we're not unintentionally doing something weird... >>> >>> >
Re: Batch size warnings
Sure, about which part? default batch size warning is 5kb I've increased it to 30kb, and will need to increase to 40kb (8x default setting) to avoid WARN log messages about batch sizes. I do realize it's just a WARNing, but may as well avoid those if I can configure it out. That said, having to increase it so substantially (and we're only dealing with 5 tables) is making me wonder if I'm not taking the correct approach in terms of using batches to guarantee atomicity. On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth wrote: > Could you please be more specific? > > Am 07.12.2016 17:10 schrieb "Voytek Jarnot" : > >> Should've mentioned - running 3.9. Also - please do not recommend MVs: I >> tried, they're broken, we punted. >> >> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot >> wrote: >> >>> The low default value for batch_size_warn_threshold_in_kb is making me >>> wonder if I'm perhaps approaching the problem of atomicity in a non-ideal >>> fashion. >>> >>> With one data set duplicated/denormalized into 5 tables to support >>> queries, we use batches to ensure inserts make it to all or 0 tables. This >>> works fine, but I've had to bump the warn threshold and fail threshold >>> substantially (8x higher for the warn threshold). This - in turn - makes >>> me wonder, with a default setting so low, if I'm not solving this problem >>> in the canonical/standard way. >>> >>> Mostly just looking for confirmation that we're not unintentionally >>> doing something weird... >>> >> >>
Re: Batch size warnings
Could you please be more specific? Am 07.12.2016 17:10 schrieb "Voytek Jarnot" : > Should've mentioned - running 3.9. Also - please do not recommend MVs: I > tried, they're broken, we punted. > > On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot > wrote: > >> The low default value for batch_size_warn_threshold_in_kb is making me >> wonder if I'm perhaps approaching the problem of atomicity in a non-ideal >> fashion. >> >> With one data set duplicated/denormalized into 5 tables to support >> queries, we use batches to ensure inserts make it to all or 0 tables. This >> works fine, but I've had to bump the warn threshold and fail threshold >> substantially (8x higher for the warn threshold). This - in turn - makes >> me wonder, with a default setting so low, if I'm not solving this problem >> in the canonical/standard way. >> >> Mostly just looking for confirmation that we're not unintentionally doing >> something weird... >> > >
Re: Batch size warnings
Should've mentioned - running 3.9. Also - please do not recommend MVs: I tried, they're broken, we punted. On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot wrote: > The low default value for batch_size_warn_threshold_in_kb is making me > wonder if I'm perhaps approaching the problem of atomicity in a non-ideal > fashion. > > With one data set duplicated/denormalized into 5 tables to support > queries, we use batches to ensure inserts make it to all or 0 tables. This > works fine, but I've had to bump the warn threshold and fail threshold > substantially (8x higher for the warn threshold). This - in turn - makes > me wonder, with a default setting so low, if I'm not solving this problem > in the canonical/standard way. > > Mostly just looking for confirmation that we're not unintentionally doing > something weird... >