Re: High cost of failed Math.*Exact calls

2025-01-10 Thread John Rose
On 13 Dec 2024, at 5:22, Raffaello Giulietti wrote:

> For your specific multiplication use case you might try with
>
> long high = Math.multiplyHigh(a, b);
> long low = a * b;
> if (high == 0) return low;
> return "the big integer consisting of high and low";
>
> It might be possible that the multiplyHigh() and * on the same operands, when 
> appearing adjacent to each other, get optimized to just one instruction.
> And if not, they might be executed "in parallel" inside the CPU.

Here’s a relevant RFE, not yet implemented:

https://bugs.openjdk.org/browse/JDK-8285871
Math.multiplyHigh and multiply on same inputs can be computed faster if their 
computation is shared

And FWIW, a lambda-based API could be made generic in the lambda result, so 
that the full BigInteger (or whatever) could be returned immediately:

public static 
R multiplyExact(long x, long y, BiFunction recompute) {
  if ()  return x * y;
  return recompute(x, y);
}

That lambda can also cover use cases with throwing and sigils and nulls.

To work efficiently it also needs some sort of box suppression, either by 
reliable inlining and EA, or by handling Long as a value class.

For completeness, if we get primitives over generics, then we could also spell 
it with lower-case “L”s:

public static 
R multiplyExact(long x, long y, BiFunction recompute) {
  if ()  return x * y;
  return recompute(x, y);
}



Re: RFR: 8293336: AOT-linking of invokedynamic for lambda expression and string concat [v5]

2024-10-21 Thread John Rose
On 15 Oct 2024, at 12:35, Ioi Lam wrote:

> On Tue, 15 Oct 2024 19:08:20 GMT, Dan Heidinga  wrote:
>
>>> 597:
>>> 598: /** Number of CPUS, to place bounds on some sizings */
>>> 599: static @Stable int NCPU;
>>
>> I would prefer to not mark this `@Stable` at this time as it would have 
>> different assembly and runtime values and instead add a followup RFE to 
>> investigate adding it in later.
>
> We have been archiving `ConcurrentHashMap` objects for many JDK releases (to 
> support archiving of modules) and `NCPU` would change between CDS archive 
> dump time and program run time. We have not seen any problems, so I think 
> it's safe to use the `@Stable` annotation.

Is it necessary to get constant folding from NCPU?  I guess a division is slow, 
but it does not seem to be a hot path, at all.  The division is not in any 
loop.  If there is no performance issue, it’s perhaps more honest just to use a 
mutable variable.  I guess what we need here is something that means “stable 
except for AOT activity”.

Normally, making something stable means it should be used according to the 
contract of stable, which is only one second value, and no third, after the 
initial default.

There is a sub-rosa contract, as well, which is that we can add third values if 
they do not conflict with the second values.  But we have to convince ourselves 
that is safe, because the second and third values can (in general) co-exist as 
folded constants in JIT code.

We currently refuse to fold stable constants in AOT code (and there’s no AOT 
code in this JEP anyway) so the AOT-only stable binding is tolerable.

Here’s what I think we should do later (not in this PR):  Amend the contract 
for @Stable that the value is allowed to be different during the AOT assembly 
phase than later.  Working out the details will require some discussion:  
Should the AOT-AP reset stable values that need new bindings?  (Probably…?)  Or 
should there be a warmup method that takes full responsibility for managing the 
stable value during the production run?

For now, I’m content to make this guy NCPU either stable or not.  But it is 
clear that, along with other details, we will need an amended story for the 
life-cycle of a stable variable that includes the AOT assembly phase.

Eventually, something like this may well play into a similar Leyden story for 
user-visible stables.  Those would be declared by an object API, not the 
private annotation we use in the JDK.  User-visible stable objects are likely 
to be wrappers for JDK-level stable-annotated fields, and that’s how the 
turtles tend to stack up.

No PR change requested here!


Re: RFR: 8293336: AOT-linking of invokedynamic for lambda expression and string concat

2024-10-08 Thread John Rose
On 8 Oct 2024, at 14:47, Ioi Lam wrote:

> On Wed, 25 Sep 2024 17:14:54 GMT, Chen Liang  wrote:
>
>>> 402: MethodType primordialMT = new MethodType(rtype, ptypes);
>>> 403: if (AOTHolder.archivedMethodTypes != null) {
>>> 404: MethodType mt = 
>>> AOTHolder.archivedMethodTypes.get(primordialMT);
>>
>> Can we populate these AOT MTs into the internTable so we only look up once?
>
> There's a trade off here --
> start-up will be slowed down if the archivedMethodTypes table is big. Also, 
> `internTable` uses more memory as it wraps every entry inside a weak 
> reference.

I think there is a design pattern should be thinking about,
which I have been calling the “Three Table Solution”.  (And
there might be just Two Tables.)  The runtime table is the
existing intern table with its weak references.  It is
designed for fast queries, fast updates, and good relations
with the GC.  It does not have the best possible footprint.
The AOT table is archivedMethodTypes.  It is designed for
compactness, fast static boot-up, and reasonably fast queries.
It is read-only.  (It is permitted to have slower queries,
such as in a log N tree instead of O(1) hash table, but if
that is the case then entries should be migrated into the
main runtime table as they are extracted.)

A third table, maybe, is the startup table; it collects
entries at startup before the big main runtime table can
be created.  It might also serve instead of the runtime
table during the assembly phase.  Its main property is
simplicity, meaning lack of complex init-dependencies
on the rest of the JDK.  It does not need weak refs,
or any other kind of scalability, because it will
be quickly retired when the system boots up.  In
fact, the AOT table is likely to fulfill all early
queries, so the startup table is probably empty,
as long as an AOT cache is available.  The startup
table might be useful for exploded builds, as well
as for the assembly phase.  Note that while the
startup table exists, the main runtime table is
not yet created; its static reference is null.
(It should be a @Stable field!)

That third table is not very useful if we can figure
out how to use the main runtime table (with its
weak refs) at all init phases.  So maybe this design
pattern is the Two Table Solution, if we can tame
weak refs for Leyden.

But I do think we have a long-term need for two
tables, with an AOT one optimized for fast boot-up
and compactness (and slow to create, and not mutated
during startup).

The order of operations could be:

A. check AOT table; if present, return
B. check main table or startup table; if present, return
C. make new object, install in appropriate table, return

But in some cases it is better to always check the main
table first, if it exists.  (A @Stable reference will
make the query method inline, after the JIT gets ahold
of the object.)  Like this:

A. check main table; if present, return
B. check AOT table or startup table
B1. if present, and if main table exists, move to main table
B2. if present, then return
C. make new object, install in appropriate table, return

> PR Review Comment: 
> https://git.openjdk.org/jdk/pull/21143#discussion_r1782292378


Re: RFR: 8338023: Support two vector selectFrom API [v5]

2024-08-27 Thread John Rose
On 23 Aug 2024, at 15:33, Paul Sandoz wrote:

> The float/double conversion bothers me, not suggesting we do something about 
> it here, noting down for any future conversation on shuffles.

Yes, it’s a pain which is noticeable in the vector/shuffle conversions.
In the worst case it adds dynamic reformatting operations to get from
the artificially “uniform” float/double index format into the real
format the hardware requires.  As a workaround, the user could convert
the float/double payloads bitwise into int/long payloads, and then
do the shuffling in the uniform int/long API, later reconverting
back to float/double after the payloads are reordered.  Those
conversions don’t actually use any dynamic operations.

For prototyping, it seems fine to take the hit and ignore the fact
that the index vectors are in an odd (though “uniform”) format.


Re: RFR: 8338023: Support two vector selectFrom API [v3]

2024-08-21 Thread John Rose
On 21 Aug 2024, at 11:30, Paul Sandoz wrote:

> Is it possible for the intrinsic to be responsible for wrapping, if needed? 
> If was looking at 
> [`vpermi2b`](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpermi2b&ig_expand=4917,4982,5004,5010,5014&techs=AVX_512)
>  and AFAICT it implicitly wraps, operating on the lower N bits. Is that 
> correct?

That’s not a bad idea.  But it is also possible (and routine) for the
JIT to take an expression like (i >> (j&31)) down to (i >> j) if the
hardware takes care of the (j&31) inside its >> operation.  I think
that some hardware permutation operations do something similar to >>
in that they simply ignore irrelevant bits in the steering indexes.
(Other operations do exotic things with irrelevant bits, such as
interpreting the sign bit as a command to “force this one to zero”.)

If the wrapping operation for steering indexes is just a vpand against
a simple constant, then maybe (maybe!) the JIT can easily drop that
vpand, when the input is passed to a friendly auto-masking instruction,
just like with (i >> (j&31)).

On the other hand, Paul’s idea might be more robust.  It would require
that the permutation intrinsics would apply vpand at the right places,
and omit vpand when possible.

On the other other hand (the first hand) the classic way of doing it
doesn’t introduce vpand inside of intrinsics, which has a routine
advantage:  The vpands introduced outside of the intrinsic can be
user-introduced or framework-introduced or both.  In all cases, the
JIT treats them uniformly and can collapse them together.  Putting
magic fixup instructions inside of intrinsic expansion risks making
them invisible to the routine optimizations of the JIT.  So,
assuming the vpand gets good optimization, putting it outside of
the intrinsic is the most robust option, as long as “good optimization”
includes the >>(j&31) trick for auto-masking instructions.  So the
intrinsic should look for a vpand in its steering input, and pop off
the IR node if the hardware masking is found to produce the same result.


Re: RFR: 8338023: Support two vector selectFrom API [v3]

2024-08-21 Thread John Rose
On 21 Aug 2024, at 10:51, Sandhya Viswanathan wrote:

> @jatin-bhateja Thanks, the PR ((https://github.com/openjdk/jdk/pull/20634) is 
> still work in progress and can be simplified much further. The changes I am 
> currently working on are do wrap by default for rearrange and selectFrom as 
> suggested by John and Paul, no additional api with boolean wrap as parameter, 
> and no changes to shuffle constructors.

Yes, thank you Sandhya; this is the destination I hope to arrive at.
Not necessarily 100% in this PR, but this PR should be consistent with it.

…To review:  Shuffles store their indexes “partially wrapped” so as
to preserve information about which indexes were out of bounds, but
they also preserve all index values mod VLEN.  It’s always an option,
though not a requirement, to fully wrap, removing the OOB info and
reducing every index down to 0..VLEN-1.  When using a vector instead
of a shuffle for steering, we think of this as creating a temp shuffle
first, then doing the appropriate operation(s).  But for best instruction
selection, we have found that it’s fastest to force everything down to
0..VLEN-1 immediately, at least in the vector case, and to a doubled
dynamic range, mod 2VLEN, for the two-input case.  There’s always an
equivalent expression which uses an explicit shuffle to carry either
VLEN (fully wrapped) or 2VLEN (partially wrapped) indexes.  For the
vector-steered version we implement only the most favorable pattern
of shuffle usage, one which never throws.  And of course we don’t
build a temp shuffle either.


Re: RFR: 8338023: Support two vector selectFrom API

2024-08-16 Thread John Rose
(Better late than never, although I wish I’d been more explicit
about this on panama-dev.)

I think we should be moving away from throwing exceptions on all
reorder/shuffle/permute vector ops, and moving toward wrapping.
These ops all operate on vectors (small arrays) of vector lane
indexes (small array indexes in a fixed domain, always a power
of two).  The throwing behavior checks an input for bad indexes
and throws a (scalar) exception if there are any at all.  The
wrapping behavior reduces bad indexes to good ones by an unsigned
modulo operation (which is at worst a mask for powers of two).

If I’m right, then new API points should start out with wrap
semantics, not throw semantics.  And old API points should be
migrated ASAP.

There’s no loss of functionality in such a move.  Instead the
defaults are moved around.  Before, throwing was the default
and wrapping was an explicit operation.  After, wrapping would
be the default and throwing would be explicit.  Both wrapping
and throwing checks are available through explicit calls to
VectorShuffle methods checkIndexes and wrapIndexes.

OK, so why is wrapping better than throwing?  And first, why
did we start with throwing as the default?  Well, we chose
throwing as the default to make the vector operations
more Java-like.  Java scalar operations don’t try to reduce
bad array indexes into the array domain; they throw.  Since
a shuffle op is like an array reference, it makes sense to
emulate the checks built into Java array references.

Or it did make sense.  I think there is a technical debt
here which is turning out to be hard to pay off.  The tech
debt is to suppress or hoist or strength-reduce the vector
instructions that perform the check for invalid indexes
(in parallel), then ask “did any of those checks fail?”
(a mask reduction), then do a conditional branch to
failure code.  I think I was over-confident that our
scalar tactics for reducing array range checks would
apply to vectors as well.  On second thought, vectorizing
our key optimization, of loop range splitting (pre/main/post
loops) is kind of a nightmare.

Instead, consider the alternative of wrapping.  First,
you use vpand or the like to mask the indexes down to
the valid range.  Then you run the shuffle/permute
instruction.  That’s it.  There is no scalar query
or branch.  And, there are probably some circumstances
where you can omit the vpand operation:  Perhaps the
hardware already masks the inputs (as with shift
instructions).  Or, perhaps C2 can do bitwise inference
of the vectors and figure out that the vpand is a nop.
(I am agitating for bitwise types in C2; this is a use
case for them.)  In the worst case, the vpand op is
fast and pipelines well.

This is why I think we should switch, ASAP, to masking
instead of throwing, on bad indexes.

I think some of our reports from customers have shown
that the extra checks necessary for throwing on bad
indexes are giving their code surprising slowdowns,
relative to C-based vector code.

Did I miss a point?

— John

On 14 Aug 2024, at 3:43, Jatin Bhateja wrote:

> On Mon, 12 Aug 2024 22:03:44 GMT, Paul Sandoz  wrote:
>
>> The results look promising. I can provide guidance on the specification 
>> e.g., we can specify the behavior in terms of rearrange, with the addition 
>> of throwing on out of bounds indexes.
>>
>> Regarding the throwing of exceptions, some wider context will help to know 
>> where we are heading before we finalize the specification. I believe we are 
>> considering changing the default throwing behavior for index out of bounds 
>> to wrapping, thereby we can avoid bounds checks. If that is the case we 
>> should wait until that is done then update rather than submitting a CSR just 
>> yet?
>>
>> I see you created a specific intrinsic, which will avoid the cost of shuffle 
>> creation. Should we apply the same approach (in a subsequent PR) to the 
>> single argument shuffle? Or perhaps if we manage to optimize shuffles and 
>> change the default wrapping we don't require a specific intrinsic and can 
>> just use defer to rearrange?
>
> Hi @PaulSandoz ,
> Thanks for your comments. With this new API we intend to enforce stricter 
> specification w.r.t to index values to emit a lean instruction sequence 
> preventing any cycles spent on massaging inputs to a consumable form, thus 
> preventing redundant wrapping and unwrapping operations.
>
> Existing [two vector rearrange 
> API](https://docs.oracle.com/en/java/javase/22/docs/api/jdk.incubator.vector/jdk/incubator/vector/Vector.html#rearrange(jdk.incubator.vector.VectorShuffle,jdk.incubator.vector.Vector))
>  has a flexible specification which allows wrapping out of bounds shuffle 
> indexes into exceptional index with a -ve value.
>
> Even if we optimize existing two vector rearrange implementation we will 
> still need to emit additional instructions to generate an indexes which lie 
> within two vector range [0, 2*VLEN). I see this as a specialized API like 
> vector compress

Re: [External] : Re: New candidate JEP: 471: Deprecate the Memory-Access Methods in sun.misc.Unsafe for Removal

2024-05-03 Thread John Rose
P.S. I didn’t directly address David’s request for that magic number N.
I would just assume 32 bits as the conservative element size (for oops)
and work from there.  If the cache line size is 64 bytes, then N=16.
These are robust assumptions, even if they waste a little space
when N could be 8 (because oops are 64 bits).  If the objects are
flattened, N=16 is still conservatively correct, as long as they
flatten to at least 4 bytes (which is very likely, and if not just
use a primitive array).  The same external considerations that
determine the D$ line size, to a value other than 64, can also
configure the parameter N, to a value other than 16.  That is
how I would build David’s widget without using Unsafe.


Re: [External] : Re: New candidate JEP: 471: Deprecate the Memory-Access Methods in sun.misc.Unsafe for Removal

2024-05-03 Thread John Rose
Some of these sorts of use cases could be covered by somehow discovering
a platform-specific parameter N such that a[I] and a[I+N] are not likely
to have false sharing in a data cache, for any I.  (Or a series of N0,
N1, … that prevent a[N0], a[N1]… from being falsely shared.)  David is
right that this parameter depends on the physical array stride,
and that depends on (1) compressed oops or not, and/or (2) flattening
of specific element types.  It depends on cache architecture also.
Throw Leyden into the mix, and it is not clear that just querying
a number is the right answer, either, since it might be predicted
one way by Leyden and end up differently in application deployment.

But this is fragile and error-prone, I think.  It might be nicer
to encapsulate it as folks suggest, even though its users are
power users.  (They are probably power users in a different
domain from VM implementation.)

So I also would prefer to see, rather than a fragile idiom for using
arrays, an API that features an opaque container object and a varhandle
for accessing its (un-)contended “lanes”. The thing would feel like
an “array with varhandles”.  And the opaque object might even be
an array of some sort, if you looked at it closely.  I don’t think
it needs to be wrapped in an envelope to hide its nature.  Power
users ought to know not to look too closely at the container object.

It seems to me that this should be built first on top of the JDK,
not inside the JDK, as a way to shake out the design before the JDK
commits to such a design.

The @Contended feature does not ensure that (un-)contended fields
fall at the start of a cache line, because the VM does not align
objects that precisely (by default).  Likewise, an indexed version
like we are talking about here would not guarantee any particular
offset that the data lanes would fall in (within cache lines).
It would simply set unrelated ones far enough apart to ensure that
that they cannot fall on the same cache line; this requires that
the platform guarantee an appropriate minimum distance.

BTW, the @Contended feature allows fields to be grouped.  A good
corresponding feature for arrays (with varhandles) should allow
for grouping similarly.

So, a first draft API might look something like this:

class ContendedArrayFactory {
  ContendedArrayFactory(Class elementType, int groupSize);
  VarHandle accessor(int which);  // which < groupSize
  ContendedArrayFactory(Class elementType); // default groupSize=1
  VarHandle accessor();  // default which=0
  Object make(int length);  // storage len = roundup(groupSize,D$len)*length
}

In a Leyden setting, each ContendedArrayFactory, and the arrays it
makes, should be (re)generated fresh at startup, if there is suspicion
that data cache size could change.  Frankly, I don’t think there will
be such a suspicion.  Leyden tends to “bake in” environmental settings
such as whether oop compression is enabled.

(The real core libs people should weigh in; I’m just brainstorming here.)

On 3 May 2024, at 11:20, Ron Pressler wrote:

>> On 3 May 2024, at 18:33, David Lloyd  wrote:
>>
>>
>> On Fri, May 3, 2024 at 10:12 AM Mark Reinhold  
>> wrote:
>> https://openjdk.org/jeps/471
>>
>>   Summary: Deprecate the memory-access methods in sun.misc.Unsafe for
>>   removal in a future release.
>>
>>
>> We still use Unsafe fairly often in various Red Hat products (primarily 
>> because our baseline support JDK for these products is typically 11 or 17 at 
>> present), in a variety of ways for a variety of reasons. Most of these uses 
>> of Unsafe should be transitionable to `MemorySegment` using multi-release 
>> JARs, and a bit of exploratory work has already been done on this. However 
>> there is one unfortunate exception (that I know of).
>>
>> In order to avoid false sharing in certain specific high-concurrency 
>> situations, I have lately used arrays to space out certain value locations 
>> by using the smallest data cache line size (which is detected via an 
>> existing library) and dividing it by the array scale to determine the length 
>> of array to allocate in order to accommodate these values. I then use 
>> multiples of the cache line size (in bytes), offset from the array base, to 
>> locate the elements to access.
>>
>> It is possible to continue this more or less as-is for primitive types (at 
>> least, it is if one assumes certain facts around primitive data type size 
>> and alignment to be true), but for objects, without knowing their size, we 
>> can't know how much padding to reserve around the value location to ensure 
>> that the contended values are not falsely shared.
>>
>> I seem to recall (years ago now so I might be a bit fuzzy on it) that the 
>> lack of public API around `@Contended` was mildly controversial in the past. 
>> The proposed remedy was to use arrays for this purpose, if I recall 
>> correctly. However there does not seem to be any good way to do this anymore 
>> (at least for objects) without simply guessing, 

Re: Vector (and integer) API: unsigned min/max

2024-04-18 Thread John Rose
On Apr 17, 2024, at 7:14 AM, David Lloyd  wrote:
> 
> 2. Add .MIN_VALUE, min/max with a value or vector also offset by 
> .MIN_VALUE, and then subtract the offset again

I think that’s the path of least resistance. It’s just a vxor on each operand, 
with a constant mask. That can be done in Java code. CPU’s that implement 
native unsigned cmp can peephole optimize. 

Re: RFR: 8315454: Add an immutable BitSet

2023-09-02 Thread John Rose
On 1 Sep 2023, at 2:38, Maurizio Cimadamore wrote:

> On Fri, 1 Sep 2023 09:07:34 GMT, Per Minborg  wrote:
>
>>> Maybe it would make sense to start with a JDK-internal variant before 
>>> exploring potential public API?
>>
>> This is certainly one alternative.
>
> When looking at this I thought about that too. It seems like the particular 
> case where this optimization is applicable is very specific, and I'm on the 
> fence as to whether this deserves a public API. At the same time, there might 
> be near-misses that could be difficult to address once the API is set in 
> stone. I'd suggest to let it bake (inside the JDK) for some time, and later 
> on decide whether it should be exposed as a public API.
>
> All that said, the idea of exposing a fast predicate over an immutable 
> snapshot is very cute and clever :-)

There is a related API I call Classifier which takes a List l and returns an 
immutable IntFunction, where the function works like x->l.indexOf(x).  The 
reason it might deserve (someday) to be its own API is that the factory for a 
classifier might be allowed to do significant work to create an optimized 
instruction sequence for classifying its inputs.  The literature of perfect 
hash functions would apply in many cases, and the classifier spun up for some 
particular list might operate in O(1) time.  The reason I started thinking 
about such things was considering the optimization of switches (say, over 
strings or other values) and the optimization of RE character classes.  Perfect 
hashing applies in both cases, but the heuristic algorithms should be hidden 
behind an API, and Classifier is what came up for me as the centerpiece.  (That 
plus classifier factories on various kinds of lists and arrays and other 
sequences or sets.)

— John


Re: Questions about using `assert` in Java

2023-07-17 Thread John Rose

On 17 Jul 2023, at 2:08, Alan Bateman wrote:


On 15/07/2023 17:53, Daohan Qu wrote:

You will find places in the JDK code, esp. in performance critical 
code, where assertions are commented out. The reason is that asserts, 
even if disabled, increase the method size and can impact inlining by 
the compiler at run-time.  So while useful when debugging some issue 
in such code, they are commended out to avoid increasing the method 
size.


In JDK code that must be hand-tuned for performance, small methods are 
sometimes broken into 2-3 even smaller methods.  This is rarely done, 
because it is a compromise against maintainability.  It can provide a 
way to introduce asserts even into performance sensitive methods, if 
there is some particular need.


The basic idea is to have the assert expression be a call to a private 
static helper method, so its code size is minimal.  Then, if necessary, 
have the actual computation (before or after the assert) moved into a 
second helper method.


Here’s an example.  (It doesn’t do anything interesting, but it has 
a few mildly complicated expressions.)


```
class AssertExample {
public static double circleQueryFast(double x, double y) {
assert circleQueryChecks(x, y);
return circleQueryCompute(x, y);
}
private static boolean circleQueryChecks(double x, double y) {
return x*x + y*y < 1.0;
}
private static double circleQueryCompute(double x, double y) {
double x2 = x*x, y2 = y*y;
return x2 < y2 ? y - x : x - y;
}
public static double circleQuerySlow(double x, double y) {
double x2 = x*x, y2 = y*y;
assert x2 + y2 < 1.0;
return x2 < y2 ? y - x : x - y;
}
}
```

In this example, the “fast” public method will often inline and 
optimize better than the “slow” public method.  This is because the 
all-in-one “slow” method is rather large, and the other three, 
although they collectively are even larger, are each simple enough to 
get a very favorable rating from the inlining policy.


Specifically, if a method has 35 bytecodes or less it gets favorable 
inlining decision.  (That is, `MaxInlineSize=35`.)  All of the above 
methods except the “slow” one are within this limit.  (Try `javac 
AssertExample.java  && javap -c -p AssertExample.class`.)


That might seem an outrageous oversimplification, but if you think about 
it, the inlining policy of an online JIT needs to be relatively simple, 
so it can make its decisions very quickly.  Getting a more balanced 
decision here would require something like a backtracking inlining 
policy, or a search over potential call graphs.  That is certainly 
possible, but it was not the practical decision 25 years ago when this 
inlining policy was designed.


Should we fix it?  Well, yes, and there are RFEs to work on for this:

https://bugs.openjdk.org/browse/JDK-6445664

But it’s not just a matter of filing RFEs or willpower or resources; 
there are risks due to the scale of Java code already deployed on 
HotSpot.  Anything we do now, after years of people tuning their Java 
code against this (implicit) contract, had better not make the existing 
code run much slower.  Or maybe there could be some user-friendly path 
to detection and remediation, perhaps using diagnostic flags.


For asserts, I think it would be relatively safe and efficient to remove 
the effects of untaken blocks (including assert logic) from the metric 
that MaxInlineSize is compared against.


That is, since an assert compiles to an if/then construct, where the 
if-condition is a constant which is usually false (only true under `java 
-ea`), it would be reasonable to note somehow (at inline policy time) 
that the relevant if/then-block will never be taken.  This analysis 
could be gated by profile information, at least in part.  The JIT does 
this sort of thing anyway, but it is not connected fully to the inlining 
policy, at present.


But this will require some research.  It could perturb even existing 
programs, with a certain number of them having a slight loss of 
performance.  The C2 inlining policy has become almost fossilized by its 
own success.  It’s easy to show cases where it gives the wrong answer, 
but hard to show an alternative that would do no harm to the millions of 
workloads out there.


— John

Re: Exposing checked exceptions that a MethodHandle can throw

2023-05-22 Thread John Rose

Thanks for the good response Remi.
This is part of a larger FAQ, “why are MHs hard to use?”
Part of the answer is “they model bytecode behavior”.
(Perhaps they should have been called BytecodeBehaviorHandles.)

Lifting such raw VM-level behaviors up into Java source code is 
necessary, but it will never be pleasant, at least not until Java has 
enough capabilities in its syntax and type system to model such beasties 
without extra layers of object wrapping.


That would include fully incorporating exception checking into the 
generic type system, a hard problem.  It would also involve making some 
way to name a Java method (perhaps as `Foo::bar` or `myFoo::bar` or some 
lambda) but get a MH out of the expression.  Also some kinds of varargs 
processing might be needed to “add suger” to varargs-related MH 
transforms.


The amount of Java language engineering necessary for such things is so 
large it will never be done, if MHs are the only use case.  There are 
far too many important improvements to make.  (Thought experiment:  
Should we drop some part of the pattern matching work in order to make 
room for method handle exception checks?  I thought not.)


Perhaps in the future there will come a time when exception checking is 
tracked and/or `Foo::bar` syntax is made more generic, and that will 
benefit MHs, but if it happens it will be for a list of weighty reasons, 
apart from MHs.


For now, MH code has to be written in a low-level style in Java source 
code.  (But it works beautifully at the assembly code level, if you are 
spinning bytecodes.)  For example, Java source methods which exist to 
process MHs should just declare `throws Throwable`.  Then you catch the 
non-existent throwables at the boundary.


You can make this a little easier on yourself, sometimes, if you write a 
higher-order helper function that takes a Throwable-throwing lambda and 
sanitizes it down to the exceptions you expect.  Then there’s just one 
clunky `catch`, and your low-level Throwable-throwing code goes inside 
lambda bodies.


On 21 May 2023, at 6:47, Remi Forax wrote:


- Original Message -

From: "-" 
To: "core-libs-dev" 
Sent: Sunday, May 21, 2023 6:52:44 AM
Subject: Exposing checked exceptions that a MethodHandle can throw



Hello,
I am eliciting a discussion on the feasibility of tracking checked
exceptions thrown by a MethodHandle. It is already requested in
https://bugs.openjdk.org/browse/JDK-8268116 as it appears useful in
the development of Foreign Function Interface API.


At the bytecode level, checked exceptions are stored in an attribute 
associated to a method.

https://docs.oracle.com/javase/specs/jvms/se20/html/jvms-4.html#jvms-4.7.5

If you have a direct MethodHandle, you can already get the checked 
exceptions using Lookup.revealDirect() + MethodHandleInfo.reflectAs().




Currently, explicit MethodHandle usages are hampered by the catch
block as the invoke methods are declared to throw any Throwable. 
Could

it be possible that we specify the types of possible exceptions at
MethodHandle invocation, so that:
1. Javac accepts code that just call the handle without ugly 
try-catch block

2. If the exceptions anticipated at invocation site are incompatible
with (i.e. more specific than) those declared by the invoked handle,
it can throw an exception like the existing 
`WrongMethodTypeException`

eagerly.


The bug you reference seems to be about runtime information, but the 
paragraph above is about type-checking information.
The question here is "is the Java type system good enough to track 
checked exception in a backward compatible way ?"
Practically, I believe the answer is no, you can not compose function 
with different type parameters representing exception, there is no 
varargs of type parameters, there is no default type argument for a 
type parameter, etc.


It's also significant departure of the way the method handle API is 
created, the idea behind is that each combiner has the semantics of an 
existing bytecode. But all invoke* bytecodes are oblivious to 
exception. Said differently, the method handle API represent how the 
JVM works, not how Java the language works.




Is such a plan feasible? Tracking of exceptions should be easy from
Direct MH and the MethodHandles combinators already, while the
invocation semantics update might be more complex, maybe in the form
of compiler-inserted stubs before the actual invoke calls. I wish 
such

improvements can make MethodHandle more friendly to direct usages in
code, so users don't need to wrap MH invocations in try-catch blocks
everywhere.


see above.



Chen Liang


Rémi

Re: RFR: 8292914: Drop the counter from lambda class names

2023-02-22 Thread John Rose

On 15 Feb 2023, at 10:34, Mandy Chung wrote:

On Wed, 15 Feb 2023 17:32:38 GMT, David M. Lloyd  
wrote:


The class generated for lambda proxies is now defined as a hidden 
class. This means that the counter, which was used to ensure a unique 
class name and avoid clashes, is now redundant. In addition to 
performing redundant work, this also impacts build reproducibility 
for native image generators which might already have a strategy to 
cope with hidden classes but cannot cope with indeterminate 
definition order for lambda proxy classes.


This solves JDK-8292914 by making lambda proxy names always be stable 
without any configuration needed. This would also replace #10024.


we want the generated classes to be dumped for debugging before the 
hidden class is defined e.g. to troubleshoot class loading issue (see  
`-Djdk.internal.lambda.dumpProxyClasses=` system property)




I agree with David that the extra counter is a problem.  Further, I 
think it will continue to be a problem in any likely Leyden use cases, 
precisely because it blocks reproducibility and predictability.  So, 
although Brian is right that there are likely to be more problems like 
it in the future, and we will surely want a comprehensive solution, I 
think removing the counter is legitimate incremental progress, not 
likely to be contradicted by future developments.


While I’m agreeing with everybody here I’ll also agree with Mandy:  
We need a simple debugging story, and the global counter make it simple. 
 But surely there are other things we could do as well, including using 
the global counter to name the dump file but *not* to name the class 
itself.  That is, don’t pollute the HC classfile bytes with the 
counter “noise” but by all means use it to sequence debugging 
activities.


Maybe even better, we could use the “decorated” class name assigned 
by the JVM, *after* the HC is loaded, to form the dump file name.  One 
easy way to do this is rename the dump file after the JVM loads the HC 
and picks the decoration.


(Note that the decoration is just the I-hash of the class mirror of the 
HC, and *also* has the good property that it does *not* pollute the 
classfile bytes.  It’s OK that each fresh invocation of the JVM is 
likely to choose a different decoration, as long as we don’t try to 
predict it statically.  This means backtraces cannot be fully predicted 
statically; tough luck there.)


Another way to handle it is ask the JVM to do the file dumping.  This 
might seem excessively indirect, given the simple thing we do now in the 
Java code, but… we probably want to be able to ask the JVM (for Leyden 
training runs) to report all class files loaded (especially those 
dynamically generated!) so we can analyze them and do good stuff with 
them.


This implies that, whether or not we dump HC files from Java code, the 
*JVM also has to dump them* somewhere or other.


Where this takes me is:  The responsibility for dumping HC contents 
should be moved from Java code to the JVM.


I still think removing the counter is a good step in isolation, but (as 
Brian says) the root issue is that we want good reporting of HC contents 
for more than just ad hoc debugging.  So we need to think more broadly 
about how to preserve and report HC classfile contents, for Leyden as 
well as ad hoc debugging.


Re: Math.clamp method?

2023-01-28 Thread John Rose

> On Jan 27, 2023, at 2:46 AM, Tagir Valeev  wrote:
> 
> …
>> 
>> 
>> public static int clamp(long value, int min, int max)
>> public static long clamp(long value, long min, long max)
>> public static double clamp(double value, double min, double max)
>> public static float clamp(double value, float min, float max)
> 
> I really like the idea of int clamp(long value, int min, int max), as
> this makes the method much more useful. On the other hand, I have
> doubles about
> float clamp(double value, float min, float max), as it could be called
> accidentally and we will have accidental precision loss. E.g., imagine
> the following code:

That was a typo. Oops, sorry. All I am trying to suggest is to widen both the 
first argument and the return value to long or double. 
> 
> // bounds are declared as floats for some reason
> // this usually doesn't matter, as these numbers are exact in float
> and in double
> float lower = 0.0f;
> float upper = 10.0f;
> 
> double v = 1/3.;
> System.out.println(v);
> v = Math.clamp(v, lower, upper); // oops!
> System.out.println(v);
> 
> Having float clamp(double, float, float), this method will be linked
> here, resulting in accidental precision loss. The output is:
> 0.
> 0.333432674408
> 
> Given that float type is rarely used, I'd stick to the following signatures:
> public static int clamp(long value, int min, int max)
> public static long clamp(long value, long min, long max)
> public static double clamp(double value, double min, double max)
> public static float clamp(float value, float min, float max)
> 
>> I thought about cross-domain clamps, which preserve 64 bit
>> precision and double range at the same time in different places:
>> 
>> public static long clamp(long value, double min, double max)
>> public static long clamp(double value, long min, long max)
> 
> This is even more dangerous, as simply calling
> double value = ...
> value = clamp(value, 0, 1); // not 0.0 and 1.0
> you will get 0 instead of the original value that fits the range.
> I imagine that many people will step on this. If these methods are desired,
> it would be better to name them differently, to avoid overload ambiguity.
> For now, I would avoid this.

Of course. Widen the return if you widen the first argument. Then it works out. 
> 
>> The subexpression max-min never overflows, when it
>> is regarded as an unsigned number, and the “>>>1”
>> immediately converts that unsigned number into the
>> correctly signed positive number. That’s the only
>> way to avoid overflow for all inputs. The third
>> idiom above (the “smartest” of the three) will fail,
>> for example, if you feed it negative values.
>> For some uses that is OK, but for others it isn’t.
>> 
>> public static int midpoint(int min, int max)
>> public static long midpoint(long min, long max)
>> public static double midpoint(double min, double max)
>> public static float midpoint(float min, float max)
> 
> midpoint is another great idea, I like it! Could also be considered 
> separately.
> 
> With best regards,
> Tagir Valeev.


Re: Math.clamp method?

2023-01-25 Thread John Rose
P.S Sorry, I botched a use case, saying x where I meant y:


```
   long x = …, y = a*x*x + b*x + c;
   double xf = x, yf = a*xf*xf + b*xf + c;
   double yc = clamp(yf, Long.MIN_VALUE, Long.MAX_VALUE);
   boolean saturated = false;
   if (yc != yf) {   // error processing
  saturated = true;  // or throw an exception
  y = yc;  // saturate towards the required magnitude
   }
```




Re: Math.clamp method?

2023-01-25 Thread John Rose
Dealing with numeric ranges is error-prone, so I welcome these.

Another operation on ranges is membership.  This is easy to derive
from the clamp methods, I think:

```
   checkRange(x,min,max) := clamp(x,min,max) == x
```

That idiom works even for NaNs, which are never in any range.
I wonder about having a separate method for this. The idiom does
fail for empty ranges, so it’s almost worth making a method that
returns false with empty ranges instead of throwing.  Probably
not, though.

But I think there is no harm and real benefit from widening
the value parameters, so value is always either long or double.
That will save casts (which also cause bugs) for some use
cases, such as doing 32x32-to-64-bit arithmetic followed
by overflow checks and/or saturation.  That’s a common use
case for clamping.

```
public static int clamp(long value, int min, int max)
public static long clamp(long value, long min, long max)
public static double clamp(double value, double min, double max)
public static float clamp(double value, float min, float max)
```

I wonder if there is a use case for these methods (either clamp
or the range-membership method), with 64-bit overflow detection
for complex expressions.  The idea is that you compute the
expression in two domains, double and long.  The double domain
correctly tracks magnitude (but not precision) while long
tracks 64 bits of precision.  At the end, if the double value
is in the range representable by longs, you take the long
value, with full confidence.

```
   long x = …, y = a*x*x + b*x + c;
   double xf = x, yf = a*xf*xf + b*xf + c;
   double xc = clamp(xf, Long.MIN_VALUE, Long.MAX_VALUE);
   boolean saturated = false;
   if (xc != xf) {   // error processing
  saturated = true;  // or throw an exception
  x = xc;  // saturate towards the required magnitude
   }
```

That use case doesn’t require a membership operation per se,
just clamp.

I thought about cross-domain clamps, which preserve 64 bit
precision and double range at the same time in different places:

```
public static long clamp(long value, double min, double max)
public static long clamp(double value, long min, long max)
```

Those are perhaps overly subtle.  I think it would be OK for
nearly all use cases to cast the min/max values to the appropriate
type (same as value) and then call one of the clamp functions
originally proposed.  Note that (long)(double)Long.MAX_VALUE
is the same as Long.MAX_VALUE, and the same is true for MIN_VALUE.
So these critical parameters would be processed as the same
limits by both the regular and cross-typed overloading.

Now for another topic.

There is another common range operation which is very often the
source of bugs, and I’d like us to consider it for this group.

That is the midpoint operation, naively coded as (min+max)/2,
and more subtly as min+(max-min)/2 and even more subtly what
our friend Mr. Stackoverflow recommends, (min+max)>>>1.
All three of these idioms hide bugs, for some values of
max or min.

The fully correct idiom (for 2s complement numbers) is:

```
  midpoint(min, max) := min + ((max - min) >>> 1)
```

The subexpression max-min never overflows, when it
is regarded as an unsigned number, and the “>>>1”
immediately converts that unsigned number into the
correctly signed positive number.  That’s the only
way to avoid overflow for all inputs.  The third
idiom above (the “smartest” of the three) will fail,
for example, if you feed it negative values.
For some uses that is OK, but for others it isn’t.

```
public static int midpoint(int min, int max)
public static long midpoint(long min, long max)
public static double midpoint(double min, double max)
public static float midpoint(float min, float max)
```

All kinds of divide and conquer algorithms need a robust
midpoint operation.  And many of those are coded wrongly,
because affine ranges are hard to reason about.

The int and long midpoints are more useful than the floating
point ones, because ints and longs are used as indexes into
arrays (including Panama native arrays), where subdivision
and binary search is common (and very often buggy).

More generally, this midpoint operation performs an affine
weighted sum within the given numeric range, so one might
think that a floating point “alpha” parameter would be
interesting.  The midpoint operations hardwire alpha=0.5,
and they use clever fast fixpoint division to divide by 2.

There are applications for using alpha other than 0.5,
such as a 10x subdivision of a range using alpha=0.1, 0.2,
and so on.  I don’t recommend an “affine midpoint” method,
because I think the clamp methods combined with floating
point math can get the right answers.  For example, given
a 64-bit range, you can use double arithmetic to compute
affine inner points and then clamp them, like this:

```
final double ALPHA = 0.1;  //float would be OK too
final long min = …, max = …;
final double incr = ((double)max - min) * ALPHA;
for (int i = 1; i < 10; i++)
  addPoint(clamp(

Re: RFR of JDK-8285932 Implementation of JEP-430 String Templates (Preview)

2022-11-16 Thread John Rose

On 16 Nov 2022, at 11:10, Alex Buckley wrote:


…
For example, the following code contains a template expression that 
uses the template processor `RAW`, which simply yields the 
`StringTemplate` passed to it:


int x = 10;
int y = 20;
StringTemplate st = RAW."\{x} + \{y} = \{x + y}";
List fragments = st.fragments();
List values= st.values();

`fragments` will be equivalent to `List.of(" + ", " = ")` and `values` 
will be the equivalent of `List.of(10, 20, 30)`.

…
To preserve the semantics of string templates and text block 
templates, the list returned by `fragments()` must be one element 
larger than the list returned by `values()`.


And yet, in the example given above, the list of fragments `List.of(" + 
", " = ")` is one element *smaller* than the list of values `List.of(10, 
20, 30)`.  The example is wrong.  It’s worth a note in the doc that if 
an interpolated expression begins and/or ends the template, there will 
be a zero length fragment at the beginning or end of the fragments list.

Re: Collections.shuffle to accept RandomGenerator

2022-10-05 Thread John Rose
On 3 Oct 2022, at 14:32, Stuart Marks wrote:

> …
> The Arrays class does need some attention and probably should be considered 
> separately. It's lacking some other things too, like reverse(). One issue 
> with modifying the Arrays class is providing overloads for some or all of the 
> primitives. We've kind of held off because adding primitive overloads adds to 
> the clutter of an already-cluttered class. There are some functions that 
> support only "common" primitives int/long/double (see Arrays::setAll) which 
> reduces the clutter; but I've missed overloads for certain arrays like byte[].
>
> (I'm not saying not to do this; I'm saying this is more than dropping in a 
> single Arrays::shuffle method.)

FTR, since we are talking about shuffles and other permutations, I’d like to 
bring up a pet RFE for Arrays:


Arrays.sort should provide optional external comparator for int and long arrays

And to round out a story that’s been bubbling in my mind for a while, here is 
another pet RFE to keep it company:


Arrays.sort should be accompanied by permutation vector operations