Re: cassandra-stress HexStrings generator

2018-12-14 Thread Saleil Bhat (BLOOMBERG/ 731 LEX)
Main issue is resolved. The test I was using to determine normality was too 
sensitive to discretization, so it was yielding a negative result even though 
the data looked pretty normal on visual inspection.  The tool only ever uses 
the Strings generator; HexStrings is unused.

The only (minor) concern is that the Strings generator generates some control 
characters as part of the generated string. I presume that this behavior is 
undesired and that the characters should be restricted to ASCII printing 
characters. 

Thanks, 
-Saleil

From: bened...@apache.org At: 12/13/18 17:10:17To:  Saleil Bhat (BLOOMBERG/ 731 
LEX ) ,  dev@cassandra.apache.org
Subject: Re: cassandra-stress HexStrings generator

I’m honestly not sure.  The code has changed since I last worked on it, which 
was years ago.  I suspect the profile mode has entirely supplanted the prior 
modes, and that these older modes supported the HexStrings generator.

Perhaps somebody else can help answer this question.


> On 13 Dec 2018, at 17:37, Saleil Bhat (BLOOMBERG/ 731 LEX) 
 wrote:
> 
> Ah ok thanks. This brings up another question: how did the HexStrings 
generator code path even get called? 
> 
> 
> 
> When I saw these results, I was using the following test table: 
>  CREATE TABLE testtable (
>  partition_key text,
>  clustering_column text, 
>  value text,
>  PRIMARY KEY (partition_key, clustering_column)
>  )
> 
> 
> From StressProfile.java, any column of type TEXT should use the Strings 
generator. 
> However, my data looks suspiciously like the HexStrings generator was being 
used instead. 
> 
> 
> First, the generated strings included control characters like SUB (\x1A), BEL 
(\x07), etc. However, the Strings generator code looks like it forces the 
characters to be in the printing characters range. 
> Second, the result I documented previously (that the characters are normally 
distributed, but the strings are not), matches the implementation of 
HexStrings. 
> 
> 
> 
> Do you know why this might be the case?
> 
> Thanks, 
> -Saleil 
> 
> 
> From: bened...@apache.org At: 12/12/18 18:09:14To:  Saleil Bhat (BLOOMBERG/ 
731 LEX ) ,  dev@cassandra.apache.org
> Subject: Re: cassandra-stress HexStrings generator
> 
> Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s 
> been a long time so I cannot remember much for certain).  
> 
> It should be implemented like the Strings generator.  It looks like both 
> HexStrings and HexBytes are incorrect, and have been for a long time.
> 
> 
>> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX) 
>  wrote:
>> 
>> Hi, 
>> 
>> I have a question about the behavior of the HexStrings value generator in 
the 
> cassandra-stress tool, particularly concerning its population/identity 
> distribution.  
>> 
>> 
>> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML 
> profile, the population field in a columnspec “represents the total unique 
> population distribution of that column across rows.”
>> 
>> 
>> I interpreted this to mean that if I specify some distribution 'F' for a 
> column, then the probability of occurrence for each potential value of that 
> column is given by 'F'. 
>> 
>> So, for example, if I provided the following columnspec for a text column: 
>> name: fake_column 
>>  size: fixed(32) 
>>population: gaussian(1..100)  
>> and then generated a large amount of data according to this specification, 
>> I would expect there to be 100 distinct values for ‘fake_column’, and that a 
> histogram of the frequency of occurrence of each value would be roughly 
> bell-shaped. 
>> 
>> 
>> 
>> However, the current implementation of the HexStrings generator deviates 
from 
> this expectation. In the current implementation, each CHARACTER in the string 
> is drawn from F, rather than the string as a whole. Therefore, if you plot 
the 
> histogram of frequency of occurrence for each character, you get a 
bell-shaped 
> curve, but the distribution of the occurrences of whole strings (the actual 
> columns) is something else. 
>> 
>> 
>> My question is, is this the desired behavior for string columns? Was my 
> expectation/interpretation incorrect? If so, can anyone give some insight as 
to 
> why strings are designed to behave this way and what the use case is for this 
> behavior? 
>> 
>> Thanks, 
>> -Saleil 
> 
> 




Re: CASSANDRA-14925 DecimalSerializer.toString() can OOM

2018-12-14 Thread dinesh.jo...@yahoo.com.INVALID
I think it makes sticking to trunk as this change will affect log messages and 
may break tooling that depends on certain patterns.
Dinesh 

On Friday, December 14, 2018, 4:09:51 PM GMT+5:30, Jasonstack Zhao Yang 
 wrote:  
 
 Hi,

Would like to get some feedback for CASSANDRA-14925.

In order to avoid potential OOM attack, we propose to change
DecimalSerializer.toString() from `BigDecimal.toPlainString()` to
`BigDecimal.toString()` on Trunk.

This change should not cause any compatibility issues..

Thanks
Zhao Yang
  

CASSANDRA-14925 DecimalSerializer.toString() can OOM

2018-12-14 Thread Jasonstack Zhao Yang
Hi,

Would like to get some feedback for CASSANDRA-14925.

In order to avoid potential OOM attack, we propose to change
DecimalSerializer.toString() from `BigDecimal.toPlainString()` to
`BigDecimal.toString()` on Trunk.

This change should not cause any compatibility issues..

Thanks
Zhao Yang


Re: Revisit the proposal to use github PR

2018-12-14 Thread Benedict Elliott Smith
So, without hopefully getting too dragged into this, I’m honestly not convinced 
that GitHub is so phenomenal for review.  Though perhaps I just suck at the UX.

- Missed notifications, to both me and seemingly Jira.  Jira sometimes fails my 
personal email too, but commits@ can be relied upon.
- Replication from GitHub PRs is super unpleasant to read (so I don’t).
- Complex patches are noisy - impossible to easily signal/emphasise important 
commentary; gets lost in the maelstrom of nitpicks
- Stale comments.  Force-pushes lose their context.  Perhaps you can expand 
them?  At the very least, they are hidden by default, meaning probably the most 
important commentary that required major changes is least accessible.

I have personally found a great balance to be using GitHub for nitpicks or 
questions, because in their context they’re a lot easier for both parties, and 
they’re also less critical.  I encourage questions to be resolved by actually 
commenting the code, so that future readers don’t have to look in GitHub or 
anywhere else for the answer.

Complex feedback potentially requiring a rethink of the approach, or a 
refactor, etc, I prefer to leave in Jira.  We have a lot of practice having 
complex discussions here, and are fairly good at keeping the discussion 
focused, to de/emphasise un/important points.

That said, I would be totally fine with using GitHub PRs, if we keep important 
discussions to Jira.

Basically, +1 Sylvain’s view, or thereabouts.


> On 13 Dec 2018, at 18:00, Jason Brown  wrote:
> 
> To clarify my position: Github PRs are great for *reviewing* code, and the
> commentary is much easier to follow imo. But for *merging* code, esp into
> our multi-branch strategy, PRs don't fit well, unless there's some
> technique I and perhaps others are unaware of.
> 
> On Thu, Dec 13, 2018 at 9:47 AM Ariel Weisberg  wrote:
> 
>> Hi,
>> 
>> I'm not clear on what github makes worse. It preserves more history then
>> the JIRA approach. When people invitably force push their branches you
>> can't tell from the link to a branch on JIRA. Github preserves the comments
>> and force push history so you know what version of the code each comment
>> applied to. Github also tracks when requests for changes are acknowledged
>> and resolved.  I have had to make the same change request many times and
>> keep track independently whether it was resolved. This has also resulting
>> in mistakes getting merged when I missed a comment that was ignored.
>> 
>> Now that github can CC JIRA that also CCs to commits@. It's better then
>> JIRA comments because each comment includes a small diff of the code the
>> comment applies to. To do that in JIRA I have to manually link to the code
>> the PR and most people don't do that for every comment so some of them are
>> inscrutable after the fact. Also manually created links sometimes refer to
>> references that disappear or get force pushed. It's a bit tricky to get
>> right.
>> 
>> To me arguing against leveraging a better code review workflow (whether
>> github or some other tool) is like arguing against using source control
>> tools. Sure the filesystem and home grown scripts can be used to work
>> around lack of source control, but why would you?
>> 
>> I see two complaints so far. One is that github PRs encourage nitpicking.
>> I don't see tool based solution to that off the cuff. Another is that
>> github doesn't by default CC JIRA. Maybe we can just refuse to accept
>> improperly formatted PRs and look into auto rejecting ones that don't refer
>> to a ticket?
>> 
>> Ariel
>> 
>> On Thu, Dec 13, 2018, at 12:20 PM, Aleksey Yeschenko wrote:
>>> There are some nice benefits to GH PRs, one of them is that we could
>>> eventually set up CircleCI hooks that would explicitly prevent commits
>>> that don’t pass the tests.
>>> 
>>> But handling multiple branches would indeed be annoying. Would have to
>>> either submit 1 PR per branch - which is both tedious and non-atomic -
>>> or do a mixed approach, with a PR for the oldest branch, then a manual
>>> merge upwards. The latter would be kinda meh, especially when commits
>>> for different branches diverge.
>>> 
>>> For me personally, the current setup works quite well, and I mostly
>>> share Sylvain’s opinion above, for the same reasons listed.
>>> 
>>> —
>>> AY
>>> 
 On 13 Dec 2018, at 08:15, Sylvain Lebresne  wrote:
 
 Fwiw, I personally find it very useful to have all discussion, review
 comments included, in the same place (namely JIRA, since for better or
 worse, that's what we use for tracking tickets). Typically, that means
 everything gets consistently pushed to the  commits@ mailing list,
>> which I
 find extremely convenient to keep track of things. I also have a theory
 that the inline-comments type of review github PR give you is very
 convenient for nitpicks, shallow or spur-of-the-moment comments, but
 doesn't help that much for deeper reviews, and that it thus to