Re: PR78888 - add value range info for tolower/toupper

2017-08-03 Thread Jakub Jelinek
On Thu, Aug 03, 2017 at 02:38:35PM +, Joseph Myers wrote:
> On Thu, 3 Aug 2017, Jakub Jelinek wrote:
> 
> > In any case, you should probably investigate all the locales say in glibc or
> > some other big locale repository whether tolower/toupper have the expected
> > properties there.
> 
> They don't.  In tr_TR.UTF-8, toupper ('i') == 'i', because 'İ', the 
> correct uppercase version (as returned in locale tr_TR.ISO-8859-9), is a 
> multibyte character and toupper can only return single-byte characters.

Indeed,
#include 
#include 

int
main ()
{
  setlocale (LC_ALL, "");
  int i;
  for (i = -1000; i < 1000; i++)
if (tolower (i) >= 'A' && tolower (i) <= 'Z')
  __builtin_abort ();
else if (toupper (i) >= 'a' && toupper (i) <= 'z')
  __builtin_abort ();
  return 0;
}
fails for LC_ALL=tr_TR.UTF-8, because tolower ('I') is 'I'.
Not to mention that the result is unspecified if the functions are called
with a value outside of the range of unsigned char or EOF.
Therefore, this optimization is invalid.

Jakub


Re: PR78888 - add value range info for tolower/toupper

2017-08-03 Thread Joseph Myers
On Thu, 3 Aug 2017, Jakub Jelinek wrote:

> In any case, you should probably investigate all the locales say in glibc or
> some other big locale repository whether tolower/toupper have the expected
> properties there.

They don't.  In tr_TR.UTF-8, toupper ('i') == 'i', because 'İ', the 
correct uppercase version (as returned in locale tr_TR.ISO-8859-9), is a 
multibyte character and toupper can only return single-byte characters.

-- 
Joseph S. Myers
jos...@codesourcery.com

Re: PR78888 - add value range info for tolower/toupper

2017-08-03 Thread Jakub Jelinek
On Thu, Aug 03, 2017 at 04:13:38PM +0530, Prathamesh Kulkarni wrote:
> > You are hardcoding here host characters and using it for target.
> > I think you need to use
> > lang_hooks.to_target_charset
> > (really no idea how it works or doesn't in LTO, but gimple-fold.c is already
> > using it among others).
> > Also, you're assuming that the 'a'-'z' and 'A'-'Z' ranges are without gaps,
> > which isn't the case for e.g. EBCDIC.  So I think you'd need to verify
> > (once?) that the target charset has this property.
> Hi Jakub,
> Thanks for the suggestions, I wasn't aware about lang_hooks.to_target_charset.
> Would following be a correct check that target has ascii charset and only then
> set range to ~[a, z] ?
> 
> target_a = lang_hooks.to_target_charset ('a');
> target_z = lang_hooks.to_target_charset('z');
> if (target_a == 'a' && target_z == 'z')
>   {
>  // set vr to ~['a', 'z']
>   }

No.  E.g. if host == target charset is EBCDIC, then the condition is true,
but it isn't a consecutive range.
A rough check that would maybe work (at least would be false for EBCDIC
and true for ASCII) could be something like:
if (target_z > target_a && target_z - target_a == 25)
(which is not exact, some strange charset could have say '0'-'9' block in
the middle of 'a'-'z' block, and say 'h'-'r' outside of the range), but
perhaps good enough for real-world charsets.
In any case, you should probably investigate all the locales say in glibc or
some other big locale repository whether tolower/toupper have the expected
properties there.
Plus of course, the set vr to ~[ ] needs to use target_a/target_z.

Jakub


Re: PR78888 - add value range info for tolower/toupper

2017-08-03 Thread Prathamesh Kulkarni
On 3 August 2017 at 13:21, Jakub Jelinek  wrote:
> On Thu, Aug 03, 2017 at 12:58:06PM +0530, Prathamesh Kulkarni wrote:
>> --- a/gcc/tree-vrp.c
>> +++ b/gcc/tree-vrp.c
>> @@ -3778,6 +3778,19 @@ extract_range_basic (value_range *vr, gimple *stmt)
>>   return;
>> }
>> break;
>> + case CFN_BUILT_IN_TOUPPER:
>> + case CFN_BUILT_IN_TOLOWER:
>> +   if (tree lhs = gimple_call_lhs (stmt))
>> + {
>> +   tree type = TREE_TYPE (lhs);
>> +   unsigned char min = (cfn == CFN_BUILT_IN_TOUPPER) ? 'a' : 'A';
>> +   unsigned char max = (cfn == CFN_BUILT_IN_TOUPPER) ? 'z' : 'Z';
>> +   tree range_min = build_int_cstu (type, min);
>> +   tree range_max = build_int_cstu (type, max);
>> +   set_value_range (vr, VR_ANTI_RANGE, range_min, range_max, NULL);
>> +   return;
>
> You are hardcoding here host characters and using it for target.
> I think you need to use
> lang_hooks.to_target_charset
> (really no idea how it works or doesn't in LTO, but gimple-fold.c is already
> using it among others).
> Also, you're assuming that the 'a'-'z' and 'A'-'Z' ranges are without gaps,
> which isn't the case for e.g. EBCDIC.  So I think you'd need to verify
> (once?) that the target charset has this property.
Hi Jakub,
Thanks for the suggestions, I wasn't aware about lang_hooks.to_target_charset.
Would following be a correct check that target has ascii charset and only then
set range to ~[a, z] ?

target_a = lang_hooks.to_target_charset ('a');
target_z = lang_hooks.to_target_charset('z');
if (target_a == 'a' && target_z == 'z')
  {
 // set vr to ~['a', 'z']
  }

Thanks,
Prathamesh
>
> Jakub


Re: PR78888 - add value range info for tolower/toupper

2017-08-03 Thread Jakub Jelinek
On Thu, Aug 03, 2017 at 12:58:06PM +0530, Prathamesh Kulkarni wrote:
> --- a/gcc/tree-vrp.c
> +++ b/gcc/tree-vrp.c
> @@ -3778,6 +3778,19 @@ extract_range_basic (value_range *vr, gimple *stmt)
>   return;
> }
> break;
> + case CFN_BUILT_IN_TOUPPER:
> + case CFN_BUILT_IN_TOLOWER:
> +   if (tree lhs = gimple_call_lhs (stmt))
> + {
> +   tree type = TREE_TYPE (lhs);
> +   unsigned char min = (cfn == CFN_BUILT_IN_TOUPPER) ? 'a' : 'A';
> +   unsigned char max = (cfn == CFN_BUILT_IN_TOUPPER) ? 'z' : 'Z';
> +   tree range_min = build_int_cstu (type, min);
> +   tree range_max = build_int_cstu (type, max);
> +   set_value_range (vr, VR_ANTI_RANGE, range_min, range_max, NULL);
> +   return;

You are hardcoding here host characters and using it for target.
I think you need to use
lang_hooks.to_target_charset
(really no idea how it works or doesn't in LTO, but gimple-fold.c is already
using it among others).
Also, you're assuming that the 'a'-'z' and 'A'-'Z' ranges are without gaps,
which isn't the case for e.g. EBCDIC.  So I think you'd need to verify
(once?) that the target charset has this property.

Jakub