[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2019-08-25 Thread Wei Mi via Phabricator via cfe-commits
wmi added a comment.

In D36562#1642403 , @chill wrote:

> In D36562#1641930 , @wmi wrote:
>
> > In D36562#1639441 , @chill wrote:
> >
> > > Shouldn't we disable `OPT_ffine_grained_bitfield_accesses` only if TSAN 
> > > is active?
> >
> >
> > I don't remember why it is disabled for all sanitizer modes. Seems you are 
> > right that the disabling the option is only necessary for TSAN. Do you have 
> > actual needs for the option to be functioning on other sanitizer modes?
>
>
> Well, yes and no. We have the option enabled by default and it causes a 
> warning when we use it together with `-fsanitize=memtag` (we aren't really 
> concerned with other sanitizers). That warning broke a few builds (e.g. CMake 
> doing tests and not wanting to see *any* diagnostics. We can work around that 
> in a number of ways, e.g. we can leave the default off for AArch64.
>
> I'd prefer though to have an upstream solution, if that's considered 
> beneficial for all LLVM users and this one seems like such a case: let anyone 
> use the option with sanitizers, unless it's known that some 
> sanitizers'utility is affected negatively (as with TSAN).


Thanks for providing the background in detail. I sent out a patch for it: 
https://reviews.llvm.org/D66726


Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D36562/new/

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2019-08-23 Thread Momchil Velikov via Phabricator via cfe-commits
chill added a comment.

In D36562#1641930 , @wmi wrote:

> In D36562#1639441 , @chill wrote:
>
> > Shouldn't we disable `OPT_ffine_grained_bitfield_accesses` only if TSAN is 
> > active?
>
>
> I don't remember why it is disabled for all sanitizer modes. Seems you are 
> right that the disabling the option is only necessary for TSAN. Do you have 
> actual needs for the option to be functioning on other sanitizer modes?


Well, yes and no. We have the option enabled by default and it causes a warning 
when we use it together with `-fsanitize=memtag` (we aren't really concerned 
with other sanitizers). That warning broke a few builds (e.g. CMake doing tests 
and not wanting to see *any* diagnostics. We can work around that in a number 
of ways, e.g. we can leave the default off for AArch64.

I'd prefer though to have an upstream solution, if that's considered beneficial 
for all LLVM users and this one seems like such a case: let anyone use the 
option with sanitizers, unless it's known that some sanitizers'utility is 
affected negatively (as with TSAN).


Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D36562/new/

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2019-08-22 Thread Wei Mi via Phabricator via cfe-commits
wmi added a comment.

In D36562#1639441 , @chill wrote:

> Shouldn't we disable `OPT_ffine_grained_bitfield_accesses` only if TSAN is 
> active?


I don't remember why it is disabled for all sanitizer modes. Seems you are 
right that the disabling the option is only necessary for TSAN. Do you have 
actual needs for the option to be functioning on other sanitizer modes?


Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D36562/new/

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2019-08-21 Thread Momchil Velikov via Phabricator via cfe-commits
chill added a comment.
Herald added a project: LLVM.

Shouldn't we disable `OPT_ffine_grained_bitfield_accesses` only if TSAN is 
active?


Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D36562/new/

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-10-16 Thread Wei Mi via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rL315915: [Bitfield] Add an option to access bitfield in a 
fine-grained manner. (authored by wmi).

Changed prior to commit:
  https://reviews.llvm.org/D36562?vs=118181=119170#toc

Repository:
  rL LLVM

https://reviews.llvm.org/D36562

Files:
  cfe/trunk/include/clang/Basic/DiagnosticDriverKinds.td
  cfe/trunk/include/clang/Driver/Options.td
  cfe/trunk/include/clang/Frontend/CodeGenOptions.def
  cfe/trunk/lib/CodeGen/CGRecordLayoutBuilder.cpp
  cfe/trunk/lib/Driver/ToolChains/Clang.cpp
  cfe/trunk/lib/Frontend/CompilerInvocation.cpp
  cfe/trunk/test/CodeGenCXX/finegrain-bitfield-access.cpp

Index: cfe/trunk/include/clang/Driver/Options.td
===
--- cfe/trunk/include/clang/Driver/Options.td
+++ cfe/trunk/include/clang/Driver/Options.td
@@ -1045,6 +1045,13 @@
   Group, Flags<[CC1Option]>,
   HelpText<"Filename defining the whitelist for imbuing the 'never instrument' XRay attribute.">;
 
+def ffine_grained_bitfield_accesses : Flag<["-"],
+  "ffine-grained-bitfield-accesses">, Group, Flags<[CC1Option]>,
+  HelpText<"Use separate accesses for bitfields with legal widths and alignments.">;
+def fno_fine_grained_bitfield_accesses : Flag<["-"],
+  "fno-fine-grained-bitfield-accesses">, Group, Flags<[CC1Option]>,
+  HelpText<"Use large-integer access for consecutive bitfield runs.">;
+
 def flat__namespace : Flag<["-"], "flat_namespace">;
 def flax_vector_conversions : Flag<["-"], "flax-vector-conversions">, Group;
 def flimited_precision_EQ : Joined<["-"], "flimited-precision=">, Group;
Index: cfe/trunk/include/clang/Basic/DiagnosticDriverKinds.td
===
--- cfe/trunk/include/clang/Basic/DiagnosticDriverKinds.td
+++ cfe/trunk/include/clang/Basic/DiagnosticDriverKinds.td
@@ -330,4 +330,8 @@
   "unable to find a Visual Studio installation; "
   "try running Clang from a developer command prompt">,
   InGroup>;
+
+def warn_drv_fine_grained_bitfield_accesses_ignored : Warning<
+  "option '-ffine-grained-bitfield-accesses' cannot be enabled together with a sanitizer; flag ignored">,
+  InGroup;
 }
Index: cfe/trunk/include/clang/Frontend/CodeGenOptions.def
===
--- cfe/trunk/include/clang/Frontend/CodeGenOptions.def
+++ cfe/trunk/include/clang/Frontend/CodeGenOptions.def
@@ -179,6 +179,7 @@
 CODEGENOPT(SanitizeStats , 1, 0) ///< Collect statistics for sanitizers.
 CODEGENOPT(SimplifyLibCalls  , 1, 1) ///< Set when -fbuiltin is enabled.
 CODEGENOPT(SoftFloat , 1, 0) ///< -soft-float.
+CODEGENOPT(FineGrainedBitfieldAccesses, 1, 0) ///< Enable fine-grained bitfield accesses.
 CODEGENOPT(StrictEnums   , 1, 0) ///< Optimize based on strict enum definition.
 CODEGENOPT(StrictVTablePointers, 1, 0) ///< Optimize based on the strict vtable pointers
 CODEGENOPT(TimePasses, 1, 0) ///< Set when -ftime-report is enabled.
Index: cfe/trunk/test/CodeGenCXX/finegrain-bitfield-access.cpp
===
--- cfe/trunk/test/CodeGenCXX/finegrain-bitfield-access.cpp
+++ cfe/trunk/test/CodeGenCXX/finegrain-bitfield-access.cpp
@@ -0,0 +1,162 @@
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -ffine-grained-bitfield-accesses \
+// RUN:   -emit-llvm -o - %s | FileCheck %s
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -ffine-grained-bitfield-accesses \
+// RUN:   -emit-llvm -fsanitize=address -o - %s | FileCheck %s --check-prefix=SANITIZE
+// Check -fsplit-bitfields will be ignored since sanitizer is enabled.
+
+struct S1 {
+  unsigned f1:2;
+  unsigned f2:6;
+  unsigned f3:8;
+  unsigned f4:4;
+  unsigned f5:8;
+};
+
+S1 a1;
+unsigned read8_1() {
+  // CHECK-LABEL: @_Z7read8_1v
+  // CHECK: %bf.load = load i8, i8* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 1), align 1
+  // CHECK-NEXT: %bf.cast = zext i8 %bf.load to i32
+  // CHECK-NEXT: ret i32 %bf.cast
+  // SANITIZE-LABEL: @_Z7read8_1v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE: %bf.lshr = lshr i32 %bf.load, 8
+  // SANITIZE: %bf.clear = and i32 %bf.lshr, 255
+  // SANITIZE: ret i32 %bf.clear
+  return a1.f3;
+}
+void write8_1() {
+  // CHECK-LABEL: @_Z8write8_1v
+  // CHECK: store i8 3, i8* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 1), align 1
+  // CHECK-NEXT: ret void
+  // SANITIZE-LABEL: @_Z8write8_1v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: %bf.clear = and i32 %bf.load, -65281
+  // SANITIZE-NEXT: %bf.set = or i32 %bf.clear, 768
+  // SANITIZE-NEXT: store i32 %bf.set, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: ret void
+  a1.f3 = 3;
+}
+
+unsigned read8_2() {
+  // CHECK-LABEL: @_Z7read8_2v
+  // CHECK: %bf.load = load i16, i16* 

[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-10-12 Thread Hal Finkel via Phabricator via cfe-commits
hfinkel accepted this revision.
hfinkel added a comment.
This revision is now accepted and ready to land.

LGTM




Comment at: include/clang/Frontend/CodeGenOptions.def:182
 CODEGENOPT(SoftFloat , 1, 0) ///< -soft-float.
+CODEGENOPT(FineGrainedBitfieldAccesses, 1, 0) ///< Enable finegrained bitfield 
accesses.
 CODEGENOPT(StrictEnums   , 1, 0) ///< Optimize based on strict enum 
definition.

finegrained -> fine-grained

(I suppose we're not sticking to 80 cols in this file anyway)



Repository:
  rL LLVM

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-10-08 Thread Hal Finkel via Phabricator via cfe-commits
hfinkel added inline comments.



Comment at: include/clang/Basic/DiagnosticDriverKinds.td:335
+def warn_drv_fine_grained_bitfield_accesses_ignored : Warning<
+  "option '-ffine-grained-bitfield-accesses' cannot be enabled together with 
sanitizer; flag ignored">,
+  InGroup;

with a sanitizer



Comment at: include/clang/Driver/Options.td:1041
+  "ffine-grained-bitfield-accesses">, Group, Flags<[CC1Option]>,
+  HelpText<"Use separate access for bitfields with legal widths and 
alignments.">;
+def fno_fine_grained_bitfield_accesses : Flag<["-"],

access -> accesses



Comment at: include/clang/Driver/Options.td:1044
+  "fno-fine-grained-bitfield-accesses">, Group, 
Flags<[CC1Option]>,
+  HelpText<"Use a big integer wrap for a consecutive run of bitfields.">;
+

Use large-integer access for consecutive bitfield runs.



Comment at: include/clang/Frontend/CodeGenOptions.def:182
 CODEGENOPT(SoftFloat , 1, 0) ///< -soft-float.
+CODEGENOPT(FineGrainedBitfieldAccesses, 1, 0) ///< Use separate access for 
bitfields
+  ///< with legal widths and 
alignments.

These lines are too long.



Comment at: lib/CodeGen/CGRecordLayoutBuilder.cpp:449
+// Otherwise, try to add bitfields to the run.
+if (Run != FieldEnd && !IsBetterAsSingleFieldRun(Run) &&
+Field != FieldEnd && !IsBetterAsSingleFieldRun(Field) &&

Why do you have the `IsBetterAsSingleFieldRun(Run)` check here (where we'll 
evaluate it multiple times (for all of the fields in the run)). Can't you make 
the predicate above directly?

  // Any non-zero-length bitfield can start a new run.
  if (Field->getBitWidthValue(Context) != 0 &&
   !IsBetterAsSingleFieldRun(Field)) {
Run = Field;
StartBitOffset = getFieldBitOffset(*Field);
...



Repository:
  rL LLVM

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-10-05 Thread Wei Mi via Phabricator via cfe-commits
wmi updated this revision to Diff 117947.
wmi added a comment.

Address Hal's comment.


Repository:
  rL LLVM

https://reviews.llvm.org/D36562

Files:
  include/clang/Basic/DiagnosticDriverKinds.td
  include/clang/Driver/Options.td
  include/clang/Frontend/CodeGenOptions.def
  lib/CodeGen/CGRecordLayoutBuilder.cpp
  lib/Driver/ToolChains/Clang.cpp
  lib/Frontend/CompilerInvocation.cpp
  test/CodeGenCXX/finegrain-bitfield-access.cpp

Index: test/CodeGenCXX/finegrain-bitfield-access.cpp
===
--- test/CodeGenCXX/finegrain-bitfield-access.cpp
+++ test/CodeGenCXX/finegrain-bitfield-access.cpp
@@ -0,0 +1,162 @@
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -ffine-grained-bitfield-accesses \
+// RUN:   -emit-llvm -o - %s | FileCheck %s
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -ffine-grained-bitfield-accesses \
+// RUN:   -emit-llvm -fsanitize=address -o - %s | FileCheck %s --check-prefix=SANITIZE
+// Check -fsplit-bitfields will be ignored since sanitizer is enabled.
+
+struct S1 {
+  unsigned f1:2;
+  unsigned f2:6;
+  unsigned f3:8;
+  unsigned f4:4;
+  unsigned f5:8;
+};
+
+S1 a1;
+unsigned read8_1() {
+  // CHECK-LABEL: @_Z7read8_1v
+  // CHECK: %bf.load = load i8, i8* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 1), align 1
+  // CHECK-NEXT: %bf.cast = zext i8 %bf.load to i32
+  // CHECK-NEXT: ret i32 %bf.cast
+  // SANITIZE-LABEL: @_Z7read8_1v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE: %bf.lshr = lshr i32 %bf.load, 8
+  // SANITIZE: %bf.clear = and i32 %bf.lshr, 255
+  // SANITIZE: ret i32 %bf.clear
+  return a1.f3;
+}
+void write8_1() {
+  // CHECK-LABEL: @_Z8write8_1v
+  // CHECK: store i8 3, i8* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 1), align 1
+  // CHECK-NEXT: ret void
+  // SANITIZE-LABEL: @_Z8write8_1v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: %bf.clear = and i32 %bf.load, -65281
+  // SANITIZE-NEXT: %bf.set = or i32 %bf.clear, 768
+  // SANITIZE-NEXT: store i32 %bf.set, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: ret void
+  a1.f3 = 3;
+}
+
+unsigned read8_2() {
+  // CHECK-LABEL: @_Z7read8_2v
+  // CHECK: %bf.load = load i16, i16* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 2), align 2
+  // CHECK-NEXT: %bf.lshr = lshr i16 %bf.load, 4
+  // CHECK-NEXT: %bf.clear = and i16 %bf.lshr, 255
+  // CHECK-NEXT: %bf.cast = zext i16 %bf.clear to i32
+  // CHECK-NEXT: ret i32 %bf.cast
+  // SANITIZE-LABEL: @_Z7read8_2v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: %bf.lshr = lshr i32 %bf.load, 20
+  // SANITIZE-NEXT: %bf.clear = and i32 %bf.lshr, 255
+  // SANITIZE-NEXT: ret i32 %bf.clear
+  return a1.f5;
+}
+void write8_2() {
+  // CHECK-LABEL: @_Z8write8_2v
+  // CHECK: %bf.load = load i16, i16* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 2), align 2
+  // CHECK-NEXT: %bf.clear = and i16 %bf.load, -4081
+  // CHECK-NEXT: %bf.set = or i16 %bf.clear, 48
+  // CHECK-NEXT: store i16 %bf.set, i16* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 2), align 2
+  // CHECK-NEXT: ret void
+  // SANITIZE-LABEL: @_Z8write8_2v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: %bf.clear = and i32 %bf.load, -267386881
+  // SANITIZE-NEXT: %bf.set = or i32 %bf.clear, 3145728
+  // SANITIZE-NEXT: store i32 %bf.set, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: ret void
+  a1.f5 = 3;
+}
+
+struct S2 {
+  unsigned long f1:16;
+  unsigned long f2:16;
+  unsigned long f3:6;
+};
+
+S2 a2;
+unsigned read16_1() {
+  // CHECK-LABEL: @_Z8read16_1v
+  // CHECK: %bf.load = load i16, i16* getelementptr inbounds (%struct.S2, %struct.S2* @a2, i32 0, i32 0), align 8
+  // CHECK-NEXT: %bf.cast = zext i16 %bf.load to i64
+  // CHECK-NEXT: %conv = trunc i64 %bf.cast to i32
+  // CHECK-NEXT: ret i32 %conv
+  // SANITIZE-LABEL: @_Z8read16_1v
+  // SANITIZE: %bf.load = load i64, i64* bitcast {{.*}}, align 8
+  // SANITIZE-NEXT: %bf.clear = and i64 %bf.load, 65535
+  // SANITIZE-NEXT: %conv = trunc i64 %bf.clear to i32
+  // SANITIZE-NEXT: ret i32 %conv
+  return a2.f1;
+}
+unsigned read16_2() {
+  // CHECK-LABEL: @_Z8read16_2v
+  // CHECK: %bf.load = load i16, i16* getelementptr inbounds (%struct.S2, %struct.S2* @a2, i32 0, i32 1), align 2
+  // CHECK-NEXT: %bf.cast = zext i16 %bf.load to i64
+  // CHECK-NEXT: %conv = trunc i64 %bf.cast to i32
+  // CHECK-NEXT: ret i32 %conv
+  // SANITIZE-LABEL: @_Z8read16_2v
+  // SANITIZE: %bf.load = load i64, i64* bitcast {{.*}}, align 8
+  // SANITIZE-NEXT: %bf.lshr = lshr i64 %bf.load, 16
+  // SANITIZE-NEXT: %bf.clear = and i64 %bf.lshr, 65535
+  // SANITIZE-NEXT: %conv = trunc i64 %bf.clear to i32
+  // SANITIZE-NEXT: ret i32 %conv
+  return a2.f2;
+}
+
+void write16_1() {
+  // 

[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-10-05 Thread Wei Mi via Phabricator via cfe-commits
wmi marked an inline comment as done.
wmi added inline comments.



Comment at: lib/CodeGen/CGRecordLayoutBuilder.cpp:444
 // Add bitfields to the run as long as they qualify.
-if (Field != FieldEnd && Field->getBitWidthValue(Context) != 0 &&
-Tail == getFieldBitOffset(*Field)) {
-  Tail += Field->getBitWidthValue(Context);
-  ++Field;
-  continue;
+if (Field != FieldEnd && !SingleFieldRun) {
+  SingleFieldRun = betterBeSingleFieldRun(Field);

hfinkel wrote:
> The logic here is not obvious. Can you please add a comment. SingleFieldRun 
> here is only not equal to `betterBeSingleFieldRun(Field)` if we've skipped 
> 0-length bitfields, right? Please explain what's going on and also please 
> make sure there's a test case.
I restructure the code a little bit and hope the logic is more clear. I already 
have a testcase added for it.


Repository:
  rL LLVM

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-10-05 Thread Hal Finkel via Phabricator via cfe-commits
hfinkel added inline comments.



Comment at: lib/CodeGen/CGRecordLayoutBuilder.cpp:411
+  // component so as it can be accessed directly with lower cost.
+  auto betterBeSingleFieldRun = [&](RecordDecl::field_iterator Field) {
+if (!Types.getCodeGenOpts().FineGrainedBitfieldAccesses)

betterBeSingleFieldRun -> IsBetterAsSingleFieldRun



Comment at: lib/CodeGen/CGRecordLayoutBuilder.cpp:444
 // Add bitfields to the run as long as they qualify.
-if (Field != FieldEnd && Field->getBitWidthValue(Context) != 0 &&
-Tail == getFieldBitOffset(*Field)) {
-  Tail += Field->getBitWidthValue(Context);
-  ++Field;
-  continue;
+if (Field != FieldEnd && !SingleFieldRun) {
+  SingleFieldRun = betterBeSingleFieldRun(Field);

The logic here is not obvious. Can you please add a comment. SingleFieldRun 
here is only not equal to `betterBeSingleFieldRun(Field)` if we've skipped 
0-length bitfields, right? Please explain what's going on and also please make 
sure there's a test case.


Repository:
  rL LLVM

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-27 Thread Wei Mi via Phabricator via cfe-commits
wmi updated this revision to Diff 116896.
wmi added a comment.

Address Hal's comment. Separate bitfields to shards separated by the 
naturally-sized-and-aligned fields.


Repository:
  rL LLVM

https://reviews.llvm.org/D36562

Files:
  include/clang/Basic/DiagnosticDriverKinds.td
  include/clang/Driver/Options.td
  include/clang/Frontend/CodeGenOptions.def
  lib/CodeGen/CGRecordLayoutBuilder.cpp
  lib/Driver/ToolChains/Clang.cpp
  lib/Frontend/CompilerInvocation.cpp
  test/CodeGenCXX/bitfield-split.cpp

Index: test/CodeGenCXX/bitfield-split.cpp
===
--- test/CodeGenCXX/bitfield-split.cpp
+++ test/CodeGenCXX/bitfield-split.cpp
@@ -0,0 +1,162 @@
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -ffine-grained-bitfield-accesses \
+// RUN:   -emit-llvm -o - %s | FileCheck %s
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -ffine-grained-bitfield-accesses \
+// RUN:   -emit-llvm -fsanitize=address -o - %s | FileCheck %s --check-prefix=SANITIZE
+// Check -fsplit-bitfields will be ignored since sanitizer is enabled.
+
+struct S1 {
+  unsigned f1:2;
+  unsigned f2:6;
+  unsigned f3:8;
+  unsigned f4:4;
+  unsigned f5:8;
+};
+
+S1 a1;
+unsigned read8_1() {
+  // CHECK-LABEL: @_Z7read8_1v
+  // CHECK: %bf.load = load i8, i8* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 1), align 1
+  // CHECK-NEXT: %bf.cast = zext i8 %bf.load to i32
+  // CHECK-NEXT: ret i32 %bf.cast
+  // SANITIZE-LABEL: @_Z7read8_1v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE: %bf.lshr = lshr i32 %bf.load, 8
+  // SANITIZE: %bf.clear = and i32 %bf.lshr, 255
+  // SANITIZE: ret i32 %bf.clear
+  return a1.f3;
+}
+void write8_1() {
+  // CHECK-LABEL: @_Z8write8_1v
+  // CHECK: store i8 3, i8* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 1), align 1
+  // CHECK-NEXT: ret void
+  // SANITIZE-LABEL: @_Z8write8_1v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: %bf.clear = and i32 %bf.load, -65281
+  // SANITIZE-NEXT: %bf.set = or i32 %bf.clear, 768
+  // SANITIZE-NEXT: store i32 %bf.set, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: ret void
+  a1.f3 = 3;
+}
+
+unsigned read8_2() {
+  // CHECK-LABEL: @_Z7read8_2v
+  // CHECK: %bf.load = load i16, i16* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 2), align 2
+  // CHECK-NEXT: %bf.lshr = lshr i16 %bf.load, 4
+  // CHECK-NEXT: %bf.clear = and i16 %bf.lshr, 255
+  // CHECK-NEXT: %bf.cast = zext i16 %bf.clear to i32
+  // CHECK-NEXT: ret i32 %bf.cast
+  // SANITIZE-LABEL: @_Z7read8_2v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: %bf.lshr = lshr i32 %bf.load, 20
+  // SANITIZE-NEXT: %bf.clear = and i32 %bf.lshr, 255
+  // SANITIZE-NEXT: ret i32 %bf.clear
+  return a1.f5;
+}
+void write8_2() {
+  // CHECK-LABEL: @_Z8write8_2v
+  // CHECK: %bf.load = load i16, i16* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 2), align 2
+  // CHECK-NEXT: %bf.clear = and i16 %bf.load, -4081
+  // CHECK-NEXT: %bf.set = or i16 %bf.clear, 48
+  // CHECK-NEXT: store i16 %bf.set, i16* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 2), align 2
+  // CHECK-NEXT: ret void
+  // SANITIZE-LABEL: @_Z8write8_2v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: %bf.clear = and i32 %bf.load, -267386881
+  // SANITIZE-NEXT: %bf.set = or i32 %bf.clear, 3145728
+  // SANITIZE-NEXT: store i32 %bf.set, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: ret void
+  a1.f5 = 3;
+}
+
+struct S2 {
+  unsigned long f1:16;
+  unsigned long f2:16;
+  unsigned long f3:6;
+};
+
+S2 a2;
+unsigned read16_1() {
+  // CHECK-LABEL: @_Z8read16_1v
+  // CHECK: %bf.load = load i16, i16* getelementptr inbounds (%struct.S2, %struct.S2* @a2, i32 0, i32 0), align 8
+  // CHECK-NEXT: %bf.cast = zext i16 %bf.load to i64
+  // CHECK-NEXT: %conv = trunc i64 %bf.cast to i32
+  // CHECK-NEXT: ret i32 %conv
+  // SANITIZE-LABEL: @_Z8read16_1v
+  // SANITIZE: %bf.load = load i64, i64* bitcast {{.*}}, align 8
+  // SANITIZE-NEXT: %bf.clear = and i64 %bf.load, 65535
+  // SANITIZE-NEXT: %conv = trunc i64 %bf.clear to i32
+  // SANITIZE-NEXT: ret i32 %conv
+  return a2.f1;
+}
+unsigned read16_2() {
+  // CHECK-LABEL: @_Z8read16_2v
+  // CHECK: %bf.load = load i16, i16* getelementptr inbounds (%struct.S2, %struct.S2* @a2, i32 0, i32 1), align 2
+  // CHECK-NEXT: %bf.cast = zext i16 %bf.load to i64
+  // CHECK-NEXT: %conv = trunc i64 %bf.cast to i32
+  // CHECK-NEXT: ret i32 %conv
+  // SANITIZE-LABEL: @_Z8read16_2v
+  // SANITIZE: %bf.load = load i64, i64* bitcast {{.*}}, align 8
+  // SANITIZE-NEXT: %bf.lshr = lshr i64 %bf.load, 16
+  // SANITIZE-NEXT: %bf.clear = and i64 %bf.lshr, 65535
+  // SANITIZE-NEXT: %conv = trunc i64 %bf.clear to i32
+  // SANITIZE-NEXT: ret i32 %conv
+  return 

[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-26 Thread Wei Mi via Phabricator via cfe-commits
wmi added a comment.

In https://reviews.llvm.org/D36562#880808, @hfinkel wrote:

> You seem to be only changing the behavior for the "separatable" fields, but I 
> suspect you want to change the behavior for the others too. The bitfield 
> would be decomposed into shards, separated by the naturally-sized-and-aligned 
> fields. Each access only loads its shard. For example, in your test case you 
> have:
>
>   struct S3 {
> unsigned long f1:14;
> unsigned long f2:18;
> unsigned long f3:32;
>   };
>   
>
> and you test that, with this option, loading/storing to a3.f3 only access the 
> specific 4 bytes composing f3. But if you load f1 or f2, we're still loading 
> all 8 bytes, right? I think we should only load/store the lower 4 bytes when 
> we access a3.f1 and/or a3.f2.


This is intentional. if the struct S3 is like following:
struct S3 {

  unsigned long f1:14;
  unsigned long f2:32;
  unsigned long f3:18;

};

and if there is no write of a.f2 between a.f1 and a.f3, the loads of a.f1 and 
a.f2 can still be shared. It is trying to keep the combining opportunity 
maximally while reducing the cost of accessing naturally-sized-and-aligned 
fields

> Otherwise, you can again end up with the narrow-store/wide-load problem for 
> nearby fields under a different set of circumstances.

Good catch. It is possible to have the problem indeed. Considering the big perf 
impact and triaging difficulty of store-forwarding problem, I have to sacrifice 
the combining opportunity above and take the suggestion just as you describe.

Thanks,
Wei.




Comment at: include/clang/Driver/Options.td:1032
 
+def fsplit_bitfields : Flag<["-"], "fsplit-bitfields">,
+  Group, Flags<[CC1Option]>,

hfinkel wrote:
> I'm not opposed to -fsplit-bitfields, but I'd prefer if we find something 
> more self-explanatory. It's not really clear what "splitting a bitfield" 
> means. Maybe?
> 
>   -fsplit-bitfield-accesses
>   -fdecomposed-bitfield-accesses
>   -fsharded-bitfield-accesses
>   -ffine-grained-bitfield-accesses
> 
> (I think that I prefer -ffine-grained-bitfield-accesses, although it's the 
> longest)
Ok. 



Comment at: include/clang/Driver/Options.td:1034
+  Group, Flags<[CC1Option]>,
+  HelpText<"Enable splitting bitfield with legal width and alignment in 
LLVM.">;
+def fno_split_bitfields : Flag<["-"], "fno-split-bitfields">,

hfinkel wrote:
> How about?
> 
>   Use separate access for bitfields with legal widths and alignments.
> 
> I don't think that "in LLVM" is needed here (or we could put "in LLVM" on an 
> awful lot of these options).
Sure.


Repository:
  rL LLVM

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-26 Thread Hal Finkel via Phabricator via cfe-commits
hfinkel added a comment.

You seem to be only changing the behavior for the "separatable" fields, but I 
suspect you want to change the behavior for the others too. The bitfield would 
be decomposed into shards, separated by the naturally-sized-and-aligned fields. 
Each access only loads its shard. For example, in your test case you have:

  struct S3 {
unsigned long f1:14;
unsigned long f2:18;
unsigned long f3:32;
  };

and you test that, with this option, loading/storing to a3.f3 only access the 
specific 4 bytes composing f3. But if you load f1 or f2, we're still loading 
all 8 bytes, right? I think we should only load/store the lower 4 bytes when we 
access a3.f1 and/or a3.f2.

Otherwise, you can again end up with the narrow-store/wide-load problem for 
nearby fields under a different set of circumstances.




Comment at: include/clang/Driver/Options.td:1032
 
+def fsplit_bitfields : Flag<["-"], "fsplit-bitfields">,
+  Group, Flags<[CC1Option]>,

I'm not opposed to -fsplit-bitfields, but I'd prefer if we find something more 
self-explanatory. It's not really clear what "splitting a bitfield" means. 
Maybe?

  -fsplit-bitfield-accesses
  -fdecomposed-bitfield-accesses
  -fsharded-bitfield-accesses
  -ffine-grained-bitfield-accesses

(I think that I prefer -ffine-grained-bitfield-accesses, although it's the 
longest)



Comment at: include/clang/Driver/Options.td:1034
+  Group, Flags<[CC1Option]>,
+  HelpText<"Enable splitting bitfield with legal width and alignment in 
LLVM.">;
+def fno_split_bitfields : Flag<["-"], "fno-split-bitfields">,

How about?

  Use separate access for bitfields with legal widths and alignments.

I don't think that "in LLVM" is needed here (or we could put "in LLVM" on an 
awful lot of these options).



Comment at: lib/CodeGen/CGExpr.cpp:1679
+/// field has legal integer width, and its bitfield offset is naturally
+/// aligned, it is better to access the bitfield like a separate integer var.
+static bool IsSeparatableBitfield(const CGBitFieldInfo ,

var -> variable


Repository:
  rL LLVM

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-21 Thread Wei Mi via Phabricator via cfe-commits
wmi updated this revision to Diff 116232.
wmi added a comment.

Changes following the discussion:

- Put the bitfield split logic under an option and off by default.
- When sanitizer is enabled, the option for bitfield split will be ignored and 
a warning message will be emitted.

In addition, a test is added.


Repository:
  rL LLVM

https://reviews.llvm.org/D36562

Files:
  include/clang/Basic/DiagnosticDriverKinds.td
  include/clang/Driver/Options.td
  include/clang/Frontend/CodeGenOptions.def
  lib/CodeGen/CGExpr.cpp
  lib/Driver/ToolChains/Clang.cpp
  lib/Frontend/CompilerInvocation.cpp
  test/CodeGenCXX/bitfield-split.cpp

Index: test/CodeGenCXX/bitfield-split.cpp
===
--- test/CodeGenCXX/bitfield-split.cpp
+++ test/CodeGenCXX/bitfield-split.cpp
@@ -0,0 +1,161 @@
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -fsplit-bitfields -emit-llvm \
+// RUN:  -o - %s | FileCheck %s
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -fsplit-bitfields -emit-llvm \
+// RUN:  -fsanitize=address -o - %s | FileCheck %s --check-prefix=SANITIZE
+// Check -fsplit-bitfields will be ignored since sanitizer is enabled.
+
+struct S1 {
+  unsigned f1:2;
+  unsigned f2:6;
+  unsigned f3:8;
+  unsigned f4:4;
+  unsigned f5:8;
+};
+
+S1 a1;
+unsigned read8_1() {
+  // CHECK-LABEL: @_Z7read8_1v
+  // CHECK: %bf.load = load i8, i8* getelementptr (i8, i8* bitcast (%struct.S1* @a1 to i8*), i32 1), align 1
+  // CHECK-NEXT: %bf.cast = zext i8 %bf.load to i32
+  // CHECK-NEXT: ret i32 %bf.cast
+  // SANITIZE-LABEL: @_Z7read8_1v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE: %bf.lshr = lshr i32 %bf.load, 8
+  // SANITIZE: %bf.clear = and i32 %bf.lshr, 255
+  // SANITIZE: ret i32 %bf.clear
+  return a1.f3;
+}
+void write8_1() {
+  // CHECK-LABEL: @_Z8write8_1v
+  // CHECK: store i8 3, i8* getelementptr (i8, i8* bitcast (%struct.S1* @a1 to i8*), i32 1), align 1
+  // CHECK-NEXT: ret void
+  // SANITIZE-LABEL: @_Z8write8_1v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: %bf.clear = and i32 %bf.load, -65281
+  // SANITIZE-NEXT: %bf.set = or i32 %bf.clear, 768
+  // SANITIZE-NEXT: store i32 %bf.set, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: ret void
+  a1.f3 = 3;
+}
+
+unsigned read8_2() {
+  // CHECK-LABEL: @_Z7read8_2v
+  // CHECK: %bf.load = load i32, i32* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 0), align 4
+  // CHECK-NEXT: %bf.lshr = lshr i32 %bf.load, 20
+  // CHECK-NEXT: %bf.clear = and i32 %bf.lshr, 255
+  // CHECK-NEXT: ret i32 %bf.clear
+  // SANITIZE-LABEL: @_Z7read8_2v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: %bf.lshr = lshr i32 %bf.load, 20
+  // SANITIZE-NEXT: %bf.clear = and i32 %bf.lshr, 255
+  // SANITIZE-NEXT: ret i32 %bf.clear
+  return a1.f5;
+}
+void write8_2() {
+  // CHECK-LABEL: @_Z8write8_2v
+  // CHECK: %bf.load = load i32, i32* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 0), align 4
+  // CHECK-NEXT: %bf.clear = and i32 %bf.load, -267386881
+  // CHECK-NEXT: %bf.set = or i32 %bf.clear, 3145728
+  // CHECK-NEXT: store i32 %bf.set, i32* getelementptr inbounds (%struct.S1, %struct.S1* @a1, i32 0, i32 0), align 4
+  // CHECK-NEXT: ret void
+  // SANITIZE-LABEL: @_Z8write8_2v
+  // SANITIZE: %bf.load = load i32, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: %bf.clear = and i32 %bf.load, -267386881
+  // SANITIZE-NEXT: %bf.set = or i32 %bf.clear, 3145728
+  // SANITIZE-NEXT: store i32 %bf.set, i32* getelementptr inbounds {{.*}}, align 4
+  // SANITIZE-NEXT: ret void
+  a1.f5 = 3;
+}
+
+struct S2 {
+  unsigned long f1:16;
+  unsigned long f2:16;
+  unsigned long f3:6;
+};
+
+S2 a2;
+unsigned read16_1() {
+  // CHECK-LABEL: @_Z8read16_1v
+  // CHECK: %bf.load = load i16, i16* bitcast (%struct.S2* @a2 to i16*), align 2
+  // CHECK-NEXT: %bf.cast = zext i16 %bf.load to i64
+  // CHECK-NEXT: %conv = trunc i64 %bf.cast to i32
+  // CHECK-NEXT: ret i32 %conv
+  // SANITIZE-LABEL: @_Z8read16_1v
+  // SANITIZE: %bf.load = load i64, i64* bitcast {{.*}}, align 8
+  // SANITIZE-NEXT: %bf.clear = and i64 %bf.load, 65535
+  // SANITIZE-NEXT: %conv = trunc i64 %bf.clear to i32
+  // SANITIZE-NEXT: ret i32 %conv
+  return a2.f1;
+}
+unsigned read16_2() {
+  // CHECK-LABEL: @_Z8read16_2v
+  // CHECK: %bf.load = load i16, i16* bitcast (i8* getelementptr (i8, i8* bitcast (%struct.S2* @a2 to i8*), i32 2) to i16*), align 2
+  // CHECK-NEXT: %bf.cast = zext i16 %bf.load to i64
+  // CHECK-NEXT: %conv = trunc i64 %bf.cast to i32
+  // CHECK-NEXT: ret i32 %conv
+  // SANITIZE-LABEL: @_Z8read16_2v
+  // SANITIZE: %bf.load = load i64, i64* bitcast {{.*}}, align 8
+  // SANITIZE-NEXT: %bf.lshr = lshr i64 %bf.load, 16
+  // SANITIZE-NEXT: %bf.clear = and i64 %bf.lshr, 65535
+  // SANITIZE-NEXT: %conv = trunc i64 %bf.clear to i32
+  // 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-04 Thread Xinliang David Li via cfe-commits
On Mon, Sep 4, 2017 at 1:57 AM, Chandler Carruth 
wrote:

> On Sun, Sep 3, 2017 at 10:41 PM Hal Finkel via llvm-commits <
> llvm-comm...@lists.llvm.org> wrote:
>
>> Nevertheless, I think that you've convinced me that this is a least-bad
>> solution. I'll want a flag preserving the old behavior. Something like
>> -fwiden-bitfield-accesses (modulo bikeshedding).
>>
> Several different approaches have been discussed in this thread, I'm not
> sure what you mean about "least-bad solution"...
>
> I remain completely unconvinced that we should change the default
> behavior. At most, I'm not strongly opposed to adding an attribute that
> indicates "please try to use narrow loads for this bitfield member" and is
> an error if applied to a misaligned or non-byte-sized bitfield member. But
> I remain strongly opposed to changing the default behavior.
>

Having an option and make it off by default for now is a fine solution. We
can also use this option to collect more empirical data on a wider range of
tests and architectures with the community's help.


> We have one or two cases that regress and are easily addressed by source
> changes (either to not use bitfields or to use an attribute). I don't think
> that is cause to change the lowering Clang has used for years and
> potentially regress many other use cases.
>

Note that this is not about that particular cases. It is more about doing
the right thing (to make compiler and HW work together to drive the best
performance for our users).

thanks,

David
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-04 Thread Xinliang David Li via cfe-commits
On Mon, Sep 4, 2017 at 9:17 AM, Hal Finkel  wrote:

>
> On 09/04/2017 03:57 AM, Chandler Carruth wrote:
>
> On Sun, Sep 3, 2017 at 10:41 PM Hal Finkel via llvm-commits <
> llvm-comm...@lists.llvm.org> wrote:
>
>> Nevertheless, I think that you've convinced me that this is a least-bad
>> solution. I'll want a flag preserving the old behavior. Something like
>> -fwiden-bitfield-accesses (modulo bikeshedding).
>>
> Several different approaches have been discussed in this thread, I'm not
> sure what you mean about "least-bad solution"...
>
> I remain completely unconvinced that we should change the default
> behavior. At most, I'm not strongly opposed to adding an attribute that
> indicates "please try to use narrow loads for this bitfield member" and is
> an error if applied to a misaligned or non-byte-sized bitfield member.
>
>
> I like this solution too (modulo the fact that I dislike all of these
> solutions). Restricting this only to affecting the loads, and not the
> stores, is also an interesting thought. The attribute could also be on the
> access itself (which, at least from the theoretical perspective, I'd
> prefer).
>
> But I remain strongly opposed to changing the default behavior. We have
> one or two cases that regress and are easily addressed by source changes
> (either to not use bitfields or to use an attribute). I don't think that is
> cause to change the lowering Clang has used for years and potentially
> regress many other use cases.
>
>
> I have mixed feelings about all of the potential fixes here. To walk
> through my thoughts on this:
>
>  1. I don't like any solutions that require changes affecting source level
> semantics. Something that the compiler does automatically is fine, as is an
> attribute.
>
>  2. Next, regarding default behavior, we have a trade off:
>
>A. Breaking apart the accesses, as proposed here, falls into the
> category of "generally, it makes everything a little bit slower." But it's
> worse than that because it comes on a spectrum. I can easily construct
> variants of the provided benchmark which make the separate loads have a bad
> performance penalty. For example:
>
> $ cat ~/tmp/3m.cpp
> class A {
> public:
> #ifdef BF
>   unsigned long f7:8;
>   unsigned long f6:8;
>   unsigned long f5:8;
>   unsigned long f4:8;
>   unsigned long f3:8;
>   unsigned long f2:8;
>   unsigned long f1:8;
>   unsigned long f0:8;
> #else
>   unsigned char f7;
>   unsigned char f6;
>   unsigned char f5;
>   unsigned char f4;
>   unsigned char f3;
>   unsigned char f2;
>   unsigned char f1;
>   unsigned char f0;
> #endif
> };
> A a;
> bool b;
> unsigned long N = 10;
>
> __attribute__((noinline))
> void foo() {
>   a.f4 = 3;
> }
>
> __attribute__((noinline))
> void goo() {
>   b = (a.f0 == 0 && a.f1 == (unsigned char)-1 &&
>a.f2 == 0 && a.f3 == 0 && a.f4 == 0 && a.f5 == 0 &&
>a.f6 == 0 && a.f7 == (unsigned char)-1);
> }
>
> int main() {
>   unsigned long i;
>   foo();
>   for (i = 0; i < N; i++) {
> goo();
>   }
> }
>
> Run this and you'll find that our current scheme, on Haswell, beats
> the separate-loads scheme by nearly 60% (e.g., 2.77s for separate loads vs.
> 1.75s for the current bitfield lowering).
>
> So, how common is it to have a bitfield with a large number of fields
> that could be loaded separately (i.e. have the right size and alignment)
> and have code that loads a large number of them like this (i.e. without
> other work to hide the relative costs)? I have no idea, but regardless,
> there is definitely a high-cost end to this spectrum.
>


Not sure about the answer. The test case itself is a little extreme. There
are many ways to change it slightly to tilt the balance. For instance
1) add one more field with store forwarding issue
2) reduce the number of tested fields in goo by 2 or 3 or more
3) having 2 or 3 more field stores in foo()
4) store non zero value to a.f0 so that short circuiting can happen without
load widening.
or
5) having more stores in foo, and enabling inlining and more aggressive
constant/copy propagation




>
>   B. However, our current scheme can trigger expensive architectural
> hazards. Moreover, there's no local after-the-fact analysis that can fix
> this consistently. I think that Wei has convincingly demonstrated both of
> these things. How common is this? I have no idea. More specifically, how do
> the relative costs of hitting these hazards compare to the costs of the
> increased number of loads under the proposed scheme? I have no idea (and
> this certainly has a workload-dependent answer).
>
>  C. This situation seems unlikely to change in the future: it seems like a
> physics problem. The data surrounding the narrower store is simply not in
> the pipeline to be matched with the wider load. Keeping the data in the
> pipeline would have extra costs, perhaps significant. I'm guessing the
> basic structure of this hazard is here to stay.
>
>  D. In the long run, could this be a 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-04 Thread Hal Finkel via cfe-commits


On 09/04/2017 03:57 AM, Chandler Carruth wrote:
On Sun, Sep 3, 2017 at 10:41 PM Hal Finkel via llvm-commits 
> wrote:


Nevertheless, I think that you've convinced me that this is a
least-bad solution. I'll want a flag preserving the old behavior.
Something like -fwiden-bitfield-accesses (modulo bikeshedding).

Several different approaches have been discussed in this thread, I'm 
not sure what you mean about "least-bad solution"...


I remain completely unconvinced that we should change the default 
behavior. At most, I'm not strongly opposed to adding an attribute 
that indicates "please try to use narrow loads for this bitfield 
member" and is an error if applied to a misaligned or non-byte-sized 
bitfield member.


I like this solution too (modulo the fact that I dislike all of these 
solutions). Restricting this only to affecting the loads, and not the 
stores, is also an interesting thought. The attribute could also be on 
the access itself (which, at least from the theoretical perspective, I'd 
prefer).


But I remain strongly opposed to changing the default behavior. We 
have one or two cases that regress and are easily addressed by source 
changes (either to not use bitfields or to use an attribute). I don't 
think that is cause to change the lowering Clang has used for years 
and potentially regress many other use cases.


I have mixed feelings about all of the potential fixes here. To walk 
through my thoughts on this:


 1. I don't like any solutions that require changes affecting source 
level semantics. Something that the compiler does automatically is fine, 
as is an attribute.


 2. Next, regarding default behavior, we have a trade off:

   A. Breaking apart the accesses, as proposed here, falls into the 
category of "generally, it makes everything a little bit slower." But 
it's worse than that because it comes on a spectrum. I can easily 
construct variants of the provided benchmark which make the separate 
loads have a bad performance penalty. For example:


$ cat ~/tmp/3m.cpp
class A {
public:
#ifdef BF
  unsigned long f7:8;
  unsigned long f6:8;
  unsigned long f5:8;
  unsigned long f4:8;
  unsigned long f3:8;
  unsigned long f2:8;
  unsigned long f1:8;
  unsigned long f0:8;
#else
  unsigned char f7;
  unsigned char f6;
  unsigned char f5;
  unsigned char f4;
  unsigned char f3;
  unsigned char f2;
  unsigned char f1;
  unsigned char f0;
#endif
};
A a;
bool b;
unsigned long N = 10;

__attribute__((noinline))
void foo() {
  a.f4 = 3;
}

__attribute__((noinline))
void goo() {
  b = (a.f0 == 0 && a.f1 == (unsigned char)-1 &&
   a.f2 == 0 && a.f3 == 0 && a.f4 == 0 && a.f5 == 0 &&
   a.f6 == 0 && a.f7 == (unsigned char)-1);
}

int main() {
  unsigned long i;
  foo();
  for (i = 0; i < N; i++) {
goo();
  }
}

Run this and you'll find that our current scheme, on Haswell, beats 
the separate-loads scheme by nearly 60% (e.g., 2.77s for separate loads 
vs. 1.75s for the current bitfield lowering).


So, how common is it to have a bitfield with a large number of 
fields that could be loaded separately (i.e. have the right size and 
alignment) and have code that loads a large number of them like this 
(i.e. without other work to hide the relative costs)? I have no idea, 
but regardless, there is definitely a high-cost end to this spectrum.


  B. However, our current scheme can trigger expensive architectural 
hazards. Moreover, there's no local after-the-fact analysis that can fix 
this consistently. I think that Wei has convincingly demonstrated both 
of these things. How common is this? I have no idea. More specifically, 
how do the relative costs of hitting these hazards compare to the costs 
of the increased number of loads under the proposed scheme? I have no 
idea (and this certainly has a workload-dependent answer).


 C. This situation seems unlikely to change in the future: it seems 
like a physics problem. The data surrounding the narrower store is 
simply not in the pipeline to be matched with the wider load. Keeping 
the data in the pipeline would have extra costs, perhaps significant. 
I'm guessing the basic structure of this hazard is here to stay.


 D. In the long run, could this be a PGO-driven decision? Yes, and this 
seems optimal. It would depend on infrastructure enhancements, and 
critically, the hardware having the right performance counter(s) to sample.


So, as to the question of what to do right now, I have two thoughts: 1) 
All of the solutions will be bad for someone. 2) Which is a least-bad 
default depends on the workload. Your colleagues seem to be asserting 
that, for Google, the separate loads are least bad (and, FWIW, you're 
more likely to have hot code like this than I am). This is definitely an 
issue on which reasonable people can disagree. In the end, I'll 
begrudgingly agree that this should be an empirical decision. We should 
have some 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-04 Thread Chandler Carruth via cfe-commits
On Sun, Sep 3, 2017 at 10:41 PM Hal Finkel via llvm-commits <
llvm-comm...@lists.llvm.org> wrote:

> Nevertheless, I think that you've convinced me that this is a least-bad
> solution. I'll want a flag preserving the old behavior. Something like
> -fwiden-bitfield-accesses (modulo bikeshedding).
>
Several different approaches have been discussed in this thread, I'm not
sure what you mean about "least-bad solution"...

I remain completely unconvinced that we should change the default behavior.
At most, I'm not strongly opposed to adding an attribute that indicates
"please try to use narrow loads for this bitfield member" and is an error
if applied to a misaligned or non-byte-sized bitfield member. But I remain
strongly opposed to changing the default behavior. We have one or two cases
that regress and are easily addressed by source changes (either to not use
bitfields or to use an attribute). I don't think that is cause to change
the lowering Clang has used for years and potentially regress many other
use cases.
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-03 Thread Hal Finkel via cfe-commits


On 09/03/2017 11:22 PM, Wei Mi wrote:

On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel  wrote:

On 09/03/2017 10:38 PM, Xinliang David Li wrote:

Store forwarding stall cost is usually much higher compared with a load
hitting L1 cache. For instance, on Haswell,  the latter is ~4 cycles, while
the store forwarding stalls cost about 10 cycles more than a successful
store forwarding, which is roughly 15 cycles. In some scenarios, the store
forwarding stalls can be as high as 50 cycles. See Agner's documentation.


I understand. As I understand it, there are two potential ways to fix this
problem:

  1. You can make the store wider (to match the size of the wide load, thus
permitting forwarding).
  2. You can make the load smaller (to match the size of the small store,
thus permitting forwarding).

At least in this benchmark, which is a better solution?

Thanks again,
Hal


For this benchmark, smaller load is better. On my sandybridge desktop,
wider store is 3.77s, smaller load is 3.45s. If store forwarding is
blocked, it costs 6.9s.

However, we don't have good way to narrow the load matching the store
shrinking because the field information has been lost.  For the IR
below:

define void @_Z3goov() local_unnamed_addr #0 {
   %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
   %2 = and i64 %1, 0xff
   %3 = icmp eq i64 %2, 0x
   %4 = zext i1 %3 to i8
   store i8 %4, i8* @b, align 1, !tbaa !2
   ret void
}

We know the 24bits range from bit 32 to bit 56 of @a are accessed, but
we don't know whether the 24bits ranges contain 8bits + 16bits
bitfields, or 16bits + 8bits bitfields, or 8bit + 8bit + 8bit
bitfields. Once the load shrinking done locally is inconsistent with
store shrinking, we will have store forwarding issue and will suffer
from huge regression.


Understood. This is a convincing argument. The cost of splitting the 
loads seems not high, at least not in isolation.


Thanks again,
Hal



Thanks,
Wei.





In other words, the optimizer needs to be taught to avoid defeating  the HW
pipeline feature as much as possible.

David

On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel  wrote:


On 09/03/2017 03:44 PM, Wei Mi wrote:

On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel  wrote:

On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:

On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li 
wrote:


On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
 wrote:

chandlerc added a comment.

I'm really not a fan of the degree of complexity and subtlety that
this
introduces into the frontend, all to allow particular backend
optimizations.

I feel like this is Clang working around a fundamental deficiency in
LLVM
and we should instead find a way to fix this in LLVM itself.

As has been pointed out before, user code can synthesize large
integers
that small bit sequences are extracted from, and Clang and LLVM
should
handle those just as well as actual bitfields.

Can we see how far we can push the LLVM side before we add complexity
to
Clang here? I understand that there remain challenges to LLVM's
stuff,
but I
don't think those challenges make *all* of the LLVM improvements off
the
table, I don't think we've exhausted all ways of improving the LLVM
changes
being proposed, and I think we should still land all of those and
re-evaluate how important these issues are when all of that is in
place.


The main challenge of doing  this in LLVM is that inter-procedural
analysis
(and possibly cross module) is needed (for store forwarding issues).

Wei, perhaps you can provide concrete test case to illustrate the
issue
so
that reviewers have a good understanding.

David

Here is a runable testcase:
 1.cc 
class A {
public:
 unsigned long f1:2;
 unsigned long f2:6;
 unsigned long f3:8;
 unsigned long f4:4;
};
A a;
unsigned long b;
unsigned long N = 10;

__attribute__((noinline))
void foo() {
 a.f3 = 3;
}

__attribute__((noinline))
void goo() {
 b = a.f3;
}

int main() {
 unsigned long i;
 for (i = 0; i < N; i++) {
   foo();
   goo();
 }
}

Now trunk takes about twice running time compared with trunk + this
patch. That is because trunk shrinks the store of a.f3 in foo (Done by
DagCombiner) but not shrink the load of a.f3 in goo, so store
forwarding will be blocked.


I can confirm that this is true on Haswell and also on an POWER8.
Interestingly, on a POWER7, the performance is the same either way (on
the
good side). I ran the test case as presented and where I replaced f3
with a
non-bitfield unsigned char member. Thinking that the POWER7 result might
be
because it's big-Endian, I flipped the order of the fields, and found
that
the version where f3 is not a bitfield is faster than otherwise, but
only by
12.5%.

Why, in this case, don't we shrink 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-03 Thread Hal Finkel via cfe-commits


On 09/04/2017 12:12 AM, Xinliang David Li wrote:



On Sun, Sep 3, 2017 at 9:23 PM, Hal Finkel > wrote:



On 09/03/2017 11:06 PM, Xinliang David Li wrote:

I think we can think this in another way.

For modern CPU architectures which supports store forwarding with
store queues, it is generally not "safe" to blindly do local
optimizations to widen the load/stores


Why not widen stores? Generally the problem with store forwarding
is where the load is wider than the store (plus alignment issues).


True, but probably with some caveats which are target dependent.  
Store widening also requires additional bit operations (and possibly 
addition load), so the it is tricky to model the the overall benefit.



without sophisticated inter-procedural analysis. Doing so will
run the risk of greatly reduce performance of some programs. Keep
the naturally aligned load/store using its natural type is safer.

Does it make sense?


It makes sense. I could, however, say the same thing about
inlining. We need to make inlining decisions locally, but they
have global impact. Nevertheless, we need to make inlining
decisions, and there's no practical way to make them in a truly
non-local way.


Speaking of inlining, we are actually thinking of ways to make the 
decisions more globally optimal, but that is off topic.


Neat.



We also don't pessimize common cases to improve performance in
rare cases. In the common case, reducing pressure on the memory
units, and reducing the critical path, seem likely to me to be
optimal. If that's not true, or doing otherwise has negligible
cost (but can prevent rare downsides), we should certainly
consider those options.


Since we don't do load widening for non-bitfield cases (but the only 
the very limited case of naturally aligned bitfields) so it is hard to 
say we pessimize common cases for rare cases:


1) the upside doing widening such access is not high from experience 
with other compiler (which does not do so)
2) there is obvious downside of introducing additional extract 
instructions which degrades performance
3) there is obvious downside of severely degrading performance when 
store forwarding is blocked.


I suspect that it's relatively rare to hit these store-to-load 
forwarding issues compared to the number of times the systems stores or 
loads to bitfields. In any case, I did some experiments on my Haswell 
system and found that the load from Wei's benchmark which is split into 
two loads, compared to the single load version, is 0.012% slower. I, 
indeed, won't worry about that too much. On my P8, I couldn't measure a 
difference. Obviously, this does somewhat miss the point, as the real 
cost in this kind of thing comes in stressing the memory units in code 
with a lot going on, not in isolated cases.


Nevertheless, I think that you've convinced me that this is a least-bad 
solution. I'll want a flag preserving the old behavior. Something like 
-fwiden-bitfield-accesses (modulo bikeshedding).






And none of this answers the question of whether it's better to
have the store wider or the load split and narrow.



It seems safer to do store widening more aggressively to avoid store 
forwarding stall issue, but doing this aggressively may also mean 
other runtime overhead introduced (extra load, data merge etc).


Yes. Wei confirmed that this is slower.

Thanks again,
Hal



Thanks,

David


Thanks again,
Hal




David



On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel > wrote:


On 09/03/2017 10:38 PM, Xinliang David Li wrote:

Store forwarding stall cost is usually much higher compared
with a load hitting L1 cache. For instance, on Haswell,  the
latter is ~4 cycles, while the store forwarding stalls cost
about 10 cycles more than a successful store forwarding,
which is roughly 15 cycles. In some scenarios, the store
forwarding stalls can be as high as 50 cycles. See Agner's
documentation.


I understand. As I understand it, there are two potential
ways to fix this problem:

 1. You can make the store wider (to match the size of the
wide load, thus permitting forwarding).
 2. You can make the load smaller (to match the size of the
small store, thus permitting forwarding).

At least in this benchmark, which is a better solution?

Thanks again,
Hal




In other words, the optimizer needs to be taught to avoid
defeating  the HW pipeline feature as much as possible.

David

On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel > wrote:


On 09/03/2017 03:44 PM, Wei Mi wrote:

On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel


Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-03 Thread Xinliang David Li via cfe-commits
On Sun, Sep 3, 2017 at 9:23 PM, Hal Finkel  wrote:

>
> On 09/03/2017 11:06 PM, Xinliang David Li wrote:
>
> I think we can think this in another way.
>
> For modern CPU architectures which supports store forwarding with store
> queues, it is generally not "safe" to blindly do local optimizations to
> widen the load/stores
>
>
> Why not widen stores? Generally the problem with store forwarding is where
> the load is wider than the store (plus alignment issues).
>
>
True, but probably with some caveats which are target dependent.  Store
widening also requires additional bit operations (and possibly addition
load), so the it is tricky to model the the overall benefit.



> without sophisticated inter-procedural analysis. Doing so will run the
> risk of greatly reduce performance of some programs. Keep the naturally
> aligned load/store using its natural type is safer.
>
> Does it make sense?
>
>
> It makes sense. I could, however, say the same thing about inlining. We
> need to make inlining decisions locally, but they have global impact.
> Nevertheless, we need to make inlining decisions, and there's no practical
> way to make them in a truly non-local way.
>

Speaking of inlining, we are actually thinking of ways to make the
decisions more globally optimal, but that is off topic.


>
> We also don't pessimize common cases to improve performance in rare cases.
> In the common case, reducing pressure on the memory units, and reducing the
> critical path, seem likely to me to be optimal. If that's not true, or
> doing otherwise has negligible cost (but can prevent rare downsides), we
> should certainly consider those options.
>

Since we don't do load widening for non-bitfield cases (but the only the
very limited case of naturally aligned bitfields) so it is hard to say we
pessimize common cases for rare cases:

1) the upside doing widening such access is not high from experience with
other compiler (which does not do so)
2) there is obvious downside of introducing additional extract instructions
which degrades performance
3) there is obvious downside of severely degrading performance when store
forwarding is blocked.



> And none of this answers the question of whether it's better to have the
> store wider or the load split and narrow.
>


It seems safer to do store widening more aggressively to avoid store
forwarding stall issue, but doing this aggressively may also mean other
runtime overhead introduced (extra load, data merge etc).

Thanks,

David


>
> Thanks again,
> Hal
>
>
>
> David
>
>
>
> On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel  wrote:
>
>>
>> On 09/03/2017 10:38 PM, Xinliang David Li wrote:
>>
>> Store forwarding stall cost is usually much higher compared with a load
>> hitting L1 cache. For instance, on Haswell,  the latter is ~4 cycles, while
>> the store forwarding stalls cost about 10 cycles more than a successful
>> store forwarding, which is roughly 15 cycles. In some scenarios, the store
>> forwarding stalls can be as high as 50 cycles. See Agner's documentation.
>>
>>
>> I understand. As I understand it, there are two potential ways to fix
>> this problem:
>>
>>  1. You can make the store wider (to match the size of the wide load,
>> thus permitting forwarding).
>>  2. You can make the load smaller (to match the size of the small store,
>> thus permitting forwarding).
>>
>> At least in this benchmark, which is a better solution?
>>
>> Thanks again,
>> Hal
>>
>>
>>
>> In other words, the optimizer needs to be taught to avoid defeating  the
>> HW pipeline feature as much as possible.
>>
>> David
>>
>> On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel  wrote:
>>
>>>
>>> On 09/03/2017 03:44 PM, Wei Mi wrote:
>>>
 On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel  wrote:

> On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
>
>> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li <
>> davi...@google.com>
>> wrote:
>>
>>>
>>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>>>  wrote:
>>>
 chandlerc added a comment.

 I'm really not a fan of the degree of complexity and subtlety that
 this
 introduces into the frontend, all to allow particular backend
 optimizations.

 I feel like this is Clang working around a fundamental deficiency in
 LLVM
 and we should instead find a way to fix this in LLVM itself.

 As has been pointed out before, user code can synthesize large
 integers
 that small bit sequences are extracted from, and Clang and LLVM
 should
 handle those just as well as actual bitfields.

 Can we see how far we can push the LLVM side before we add
 complexity to
 Clang here? I understand that there remain challenges to LLVM's
 stuff,
 but I
 don't think 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-03 Thread Hal Finkel via cfe-commits


On 09/03/2017 11:06 PM, Xinliang David Li wrote:

I think we can think this in another way.

For modern CPU architectures which supports store forwarding with 
store queues, it is generally not "safe" to blindly do local 
optimizations to widen the load/stores


Why not widen stores? Generally the problem with store forwarding is 
where the load is wider than the store (plus alignment issues).


without sophisticated inter-procedural analysis. Doing so will run the 
risk of greatly reduce performance of some programs. Keep the 
naturally aligned load/store using its natural type is safer.


Does it make sense?


It makes sense. I could, however, say the same thing about inlining. We 
need to make inlining decisions locally, but they have global impact. 
Nevertheless, we need to make inlining decisions, and there's no 
practical way to make them in a truly non-local way.


We also don't pessimize common cases to improve performance in rare 
cases. In the common case, reducing pressure on the memory units, and 
reducing the critical path, seem likely to me to be optimal. If that's 
not true, or doing otherwise has negligible cost (but can prevent rare 
downsides), we should certainly consider those options.


And none of this answers the question of whether it's better to have the 
store wider or the load split and narrow.


Thanks again,
Hal



David



On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel > wrote:



On 09/03/2017 10:38 PM, Xinliang David Li wrote:

Store forwarding stall cost is usually much higher compared with
a load hitting L1 cache. For instance, on Haswell,  the latter is
~4 cycles, while the store forwarding stalls cost about 10 cycles
more than a successful store forwarding, which is roughly 15
cycles. In some scenarios, the store forwarding stalls can be as
high as 50 cycles. See Agner's documentation.


I understand. As I understand it, there are two potential ways to
fix this problem:

 1. You can make the store wider (to match the size of the wide
load, thus permitting forwarding).
 2. You can make the load smaller (to match the size of the small
store, thus permitting forwarding).

At least in this benchmark, which is a better solution?

Thanks again,
Hal




In other words, the optimizer needs to be taught to avoid
defeating  the HW pipeline feature as much as possible.

David

On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel > wrote:


On 09/03/2017 03:44 PM, Wei Mi wrote:

On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel
> wrote:

On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:

On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David
Li >
wrote:


On Tue, Aug 22, 2017 at 6:37 PM, Chandler
Carruth via Phabricator
> wrote:

chandlerc added a comment.

I'm really not a fan of the degree of
complexity and subtlety that this
introduces into the frontend, all to
allow particular backend
optimizations.

I feel like this is Clang working around
a fundamental deficiency in
LLVM
and we should instead find a way to fix
this in LLVM itself.

As has been pointed out before, user code
can synthesize large integers
that small bit sequences are extracted
from, and Clang and LLVM should
handle those just as well as actual
bitfields.

Can we see how far we can push the LLVM
side before we add complexity to
Clang here? I understand that there
remain challenges to LLVM's stuff,
but I
don't think those challenges make *all*
of the LLVM improvements off the
table, I don't think we've exhausted all
ways of improving the LLVM
changes
being proposed, and I think we should
still land all of those and
re-evaluate how important these issues
are when all of that 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-03 Thread Wei Mi via cfe-commits
On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel  wrote:
>
> On 09/03/2017 10:38 PM, Xinliang David Li wrote:
>
> Store forwarding stall cost is usually much higher compared with a load
> hitting L1 cache. For instance, on Haswell,  the latter is ~4 cycles, while
> the store forwarding stalls cost about 10 cycles more than a successful
> store forwarding, which is roughly 15 cycles. In some scenarios, the store
> forwarding stalls can be as high as 50 cycles. See Agner's documentation.
>
>
> I understand. As I understand it, there are two potential ways to fix this
> problem:
>
>  1. You can make the store wider (to match the size of the wide load, thus
> permitting forwarding).
>  2. You can make the load smaller (to match the size of the small store,
> thus permitting forwarding).
>
> At least in this benchmark, which is a better solution?
>
> Thanks again,
> Hal
>

For this benchmark, smaller load is better. On my sandybridge desktop,
wider store is 3.77s, smaller load is 3.45s. If store forwarding is
blocked, it costs 6.9s.

However, we don't have good way to narrow the load matching the store
shrinking because the field information has been lost.  For the IR
below:

define void @_Z3goov() local_unnamed_addr #0 {
  %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
  %2 = and i64 %1, 0xff
  %3 = icmp eq i64 %2, 0x
  %4 = zext i1 %3 to i8
  store i8 %4, i8* @b, align 1, !tbaa !2
  ret void
}

We know the 24bits range from bit 32 to bit 56 of @a are accessed, but
we don't know whether the 24bits ranges contain 8bits + 16bits
bitfields, or 16bits + 8bits bitfields, or 8bit + 8bit + 8bit
bitfields. Once the load shrinking done locally is inconsistent with
store shrinking, we will have store forwarding issue and will suffer
from huge regression.

Thanks,
Wei.



>
>
> In other words, the optimizer needs to be taught to avoid defeating  the HW
> pipeline feature as much as possible.
>
> David
>
> On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel  wrote:
>>
>>
>> On 09/03/2017 03:44 PM, Wei Mi wrote:
>>>
>>> On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel  wrote:

 On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
>
> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li 
> wrote:
>>
>>
>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>>  wrote:
>>>
>>> chandlerc added a comment.
>>>
>>> I'm really not a fan of the degree of complexity and subtlety that
>>> this
>>> introduces into the frontend, all to allow particular backend
>>> optimizations.
>>>
>>> I feel like this is Clang working around a fundamental deficiency in
>>> LLVM
>>> and we should instead find a way to fix this in LLVM itself.
>>>
>>> As has been pointed out before, user code can synthesize large
>>> integers
>>> that small bit sequences are extracted from, and Clang and LLVM
>>> should
>>> handle those just as well as actual bitfields.
>>>
>>> Can we see how far we can push the LLVM side before we add complexity
>>> to
>>> Clang here? I understand that there remain challenges to LLVM's
>>> stuff,
>>> but I
>>> don't think those challenges make *all* of the LLVM improvements off
>>> the
>>> table, I don't think we've exhausted all ways of improving the LLVM
>>> changes
>>> being proposed, and I think we should still land all of those and
>>> re-evaluate how important these issues are when all of that is in
>>> place.
>>
>>
>> The main challenge of doing  this in LLVM is that inter-procedural
>> analysis
>> (and possibly cross module) is needed (for store forwarding issues).
>>
>> Wei, perhaps you can provide concrete test case to illustrate the
>> issue
>> so
>> that reviewers have a good understanding.
>>
>> David
>
> Here is a runable testcase:
>  1.cc 
> class A {
> public:
> unsigned long f1:2;
> unsigned long f2:6;
> unsigned long f3:8;
> unsigned long f4:4;
> };
> A a;
> unsigned long b;
> unsigned long N = 10;
>
> __attribute__((noinline))
> void foo() {
> a.f3 = 3;
> }
>
> __attribute__((noinline))
> void goo() {
> b = a.f3;
> }
>
> int main() {
> unsigned long i;
> for (i = 0; i < N; i++) {
>   foo();
>   goo();
> }
> }
> 
> Now trunk takes about twice running time compared with trunk + this
> patch. That is because trunk shrinks the store of a.f3 in foo (Done by
> DagCombiner) but not shrink the load of a.f3 in goo, so store
> forwarding will be blocked.


 I can confirm that this is true on 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-03 Thread Xinliang David Li via cfe-commits
I think we can think this in another way.

For modern CPU architectures which supports store forwarding with store
queues, it is generally not "safe" to blindly do local optimizations to
widen the load/stores without sophisticated inter-procedural analysis.
Doing so will run the risk of greatly reduce performance of some programs.
Keep the naturally aligned load/store using its natural type is safer.

Does it make sense?

David



On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel  wrote:

>
> On 09/03/2017 10:38 PM, Xinliang David Li wrote:
>
> Store forwarding stall cost is usually much higher compared with a load
> hitting L1 cache. For instance, on Haswell,  the latter is ~4 cycles, while
> the store forwarding stalls cost about 10 cycles more than a successful
> store forwarding, which is roughly 15 cycles. In some scenarios, the store
> forwarding stalls can be as high as 50 cycles. See Agner's documentation.
>
>
> I understand. As I understand it, there are two potential ways to fix this
> problem:
>
>  1. You can make the store wider (to match the size of the wide load, thus
> permitting forwarding).
>  2. You can make the load smaller (to match the size of the small store,
> thus permitting forwarding).
>
> At least in this benchmark, which is a better solution?
>
> Thanks again,
> Hal
>
>
>
> In other words, the optimizer needs to be taught to avoid defeating  the
> HW pipeline feature as much as possible.
>
> David
>
> On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel  wrote:
>
>>
>> On 09/03/2017 03:44 PM, Wei Mi wrote:
>>
>>> On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel  wrote:
>>>
 On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:

> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li  >
> wrote:
>
>>
>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>>  wrote:
>>
>>> chandlerc added a comment.
>>>
>>> I'm really not a fan of the degree of complexity and subtlety that
>>> this
>>> introduces into the frontend, all to allow particular backend
>>> optimizations.
>>>
>>> I feel like this is Clang working around a fundamental deficiency in
>>> LLVM
>>> and we should instead find a way to fix this in LLVM itself.
>>>
>>> As has been pointed out before, user code can synthesize large
>>> integers
>>> that small bit sequences are extracted from, and Clang and LLVM
>>> should
>>> handle those just as well as actual bitfields.
>>>
>>> Can we see how far we can push the LLVM side before we add
>>> complexity to
>>> Clang here? I understand that there remain challenges to LLVM's
>>> stuff,
>>> but I
>>> don't think those challenges make *all* of the LLVM improvements off
>>> the
>>> table, I don't think we've exhausted all ways of improving the LLVM
>>> changes
>>> being proposed, and I think we should still land all of those and
>>> re-evaluate how important these issues are when all of that is in
>>> place.
>>>
>>
>> The main challenge of doing  this in LLVM is that inter-procedural
>> analysis
>> (and possibly cross module) is needed (for store forwarding issues).
>>
>> Wei, perhaps you can provide concrete test case to illustrate the
>> issue
>> so
>> that reviewers have a good understanding.
>>
>> David
>>
> Here is a runable testcase:
>  1.cc 
> class A {
> public:
> unsigned long f1:2;
> unsigned long f2:6;
> unsigned long f3:8;
> unsigned long f4:4;
> };
> A a;
> unsigned long b;
> unsigned long N = 10;
>
> __attribute__((noinline))
> void foo() {
> a.f3 = 3;
> }
>
> __attribute__((noinline))
> void goo() {
> b = a.f3;
> }
>
> int main() {
> unsigned long i;
> for (i = 0; i < N; i++) {
>   foo();
>   goo();
> }
> }
> 
> Now trunk takes about twice running time compared with trunk + this
> patch. That is because trunk shrinks the store of a.f3 in foo (Done by
> DagCombiner) but not shrink the load of a.f3 in goo, so store
> forwarding will be blocked.
>

 I can confirm that this is true on Haswell and also on an POWER8.
 Interestingly, on a POWER7, the performance is the same either way (on
 the
 good side). I ran the test case as presented and where I replaced f3
 with a
 non-bitfield unsigned char member. Thinking that the POWER7 result
 might be
 because it's big-Endian, I flipped the order of the fields, and found
 that
 the version where f3 is not a bitfield is faster than otherwise, but
 only by
 12.5%.

 Why, in this case, don't 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-03 Thread Hal Finkel via cfe-commits


On 09/03/2017 10:38 PM, Xinliang David Li wrote:
Store forwarding stall cost is usually much higher compared with a 
load hitting L1 cache. For instance, on Haswell,  the latter is ~4 
cycles, while the store forwarding stalls cost about 10 cycles more 
than a successful store forwarding, which is roughly 15 cycles. In 
some scenarios, the store forwarding stalls can be as high as 50 
cycles. See Agner's documentation.


I understand. As I understand it, there are two potential ways to fix 
this problem:


 1. You can make the store wider (to match the size of the wide load, 
thus permitting forwarding).
 2. You can make the load smaller (to match the size of the small 
store, thus permitting forwarding).


At least in this benchmark, which is a better solution?

Thanks again,
Hal



In other words, the optimizer needs to be taught to avoid defeating 
 the HW pipeline feature as much as possible.


David

On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel > wrote:



On 09/03/2017 03:44 PM, Wei Mi wrote:

On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel > wrote:

On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:

On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li
>
wrote:


On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth
via Phabricator
> wrote:

chandlerc added a comment.

I'm really not a fan of the degree of
complexity and subtlety that this
introduces into the frontend, all to allow
particular backend
optimizations.

I feel like this is Clang working around a
fundamental deficiency in
LLVM
and we should instead find a way to fix this
in LLVM itself.

As has been pointed out before, user code can
synthesize large integers
that small bit sequences are extracted from,
and Clang and LLVM should
handle those just as well as actual bitfields.

Can we see how far we can push the LLVM side
before we add complexity to
Clang here? I understand that there remain
challenges to LLVM's stuff,
but I
don't think those challenges make *all* of the
LLVM improvements off the
table, I don't think we've exhausted all ways
of improving the LLVM
changes
being proposed, and I think we should still
land all of those and
re-evaluate how important these issues are
when all of that is in place.


The main challenge of doing  this in LLVM is that
inter-procedural
analysis
(and possibly cross module) is needed (for store
forwarding issues).

Wei, perhaps you can provide concrete test case to
illustrate the issue
so
that reviewers have a good understanding.

David

Here is a runable testcase:
 1.cc 
class A {
public:
unsigned long f1:2;
unsigned long f2:6;
unsigned long f3:8;
unsigned long f4:4;
};
A a;
unsigned long b;
unsigned long N = 10;

__attribute__((noinline))
void foo() {
a.f3 = 3;
}

__attribute__((noinline))
void goo() {
b = a.f3;
}

int main() {
unsigned long i;
for (i = 0; i < N; i++) {
  foo();
  goo();
}
}

Now trunk takes about twice running time compared with
trunk + this
patch. That is because trunk shrinks the store of a.f3
in foo (Done by
DagCombiner) but not shrink the load of a.f3 in goo,

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-03 Thread Xinliang David Li via cfe-commits
Store forwarding stall cost is usually much higher compared with a load
hitting L1 cache. For instance, on Haswell,  the latter is ~4 cycles, while
the store forwarding stalls cost about 10 cycles more than a successful
store forwarding, which is roughly 15 cycles. In some scenarios, the store
forwarding stalls can be as high as 50 cycles. See Agner's documentation.

In other words, the optimizer needs to be taught to avoid defeating  the HW
pipeline feature as much as possible.

David

On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel  wrote:

>
> On 09/03/2017 03:44 PM, Wei Mi wrote:
>
>> On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel  wrote:
>>
>>> On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
>>>
 On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li 
 wrote:

>
> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>  wrote:
>
>> chandlerc added a comment.
>>
>> I'm really not a fan of the degree of complexity and subtlety that
>> this
>> introduces into the frontend, all to allow particular backend
>> optimizations.
>>
>> I feel like this is Clang working around a fundamental deficiency in
>> LLVM
>> and we should instead find a way to fix this in LLVM itself.
>>
>> As has been pointed out before, user code can synthesize large
>> integers
>> that small bit sequences are extracted from, and Clang and LLVM should
>> handle those just as well as actual bitfields.
>>
>> Can we see how far we can push the LLVM side before we add complexity
>> to
>> Clang here? I understand that there remain challenges to LLVM's stuff,
>> but I
>> don't think those challenges make *all* of the LLVM improvements off
>> the
>> table, I don't think we've exhausted all ways of improving the LLVM
>> changes
>> being proposed, and I think we should still land all of those and
>> re-evaluate how important these issues are when all of that is in
>> place.
>>
>
> The main challenge of doing  this in LLVM is that inter-procedural
> analysis
> (and possibly cross module) is needed (for store forwarding issues).
>
> Wei, perhaps you can provide concrete test case to illustrate the issue
> so
> that reviewers have a good understanding.
>
> David
>
 Here is a runable testcase:
  1.cc 
 class A {
 public:
 unsigned long f1:2;
 unsigned long f2:6;
 unsigned long f3:8;
 unsigned long f4:4;
 };
 A a;
 unsigned long b;
 unsigned long N = 10;

 __attribute__((noinline))
 void foo() {
 a.f3 = 3;
 }

 __attribute__((noinline))
 void goo() {
 b = a.f3;
 }

 int main() {
 unsigned long i;
 for (i = 0; i < N; i++) {
   foo();
   goo();
 }
 }
 
 Now trunk takes about twice running time compared with trunk + this
 patch. That is because trunk shrinks the store of a.f3 in foo (Done by
 DagCombiner) but not shrink the load of a.f3 in goo, so store
 forwarding will be blocked.

>>>
>>> I can confirm that this is true on Haswell and also on an POWER8.
>>> Interestingly, on a POWER7, the performance is the same either way (on
>>> the
>>> good side). I ran the test case as presented and where I replaced f3
>>> with a
>>> non-bitfield unsigned char member. Thinking that the POWER7 result might
>>> be
>>> because it's big-Endian, I flipped the order of the fields, and found
>>> that
>>> the version where f3 is not a bitfield is faster than otherwise, but
>>> only by
>>> 12.5%.
>>>
>>> Why, in this case, don't we shrink the load? It seems like we should
>>> (and it
>>> seems like a straightforward case).
>>>
>>> Thanks again,
>>> Hal
>>>
>>> Hal, thanks for trying the test.
>>
>> Yes, it is straightforward to shrink the load in the test. I can
>> change the testcase a little to show why it is sometime difficult to
>> shrink the load:
>>
>> class A {
>> public:
>>unsigned long f1:16;
>>unsigned long f2:16;
>>unsigned long f3:16;
>>unsigned long f4:8;
>> };
>> A a;
>> bool b;
>> unsigned long N = 10;
>>
>> __attribute__((noinline))
>> void foo() {
>>a.f4 = 3;
>> }
>>
>> __attribute__((noinline))
>> void goo() {
>>b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
>> }
>>
>> int main() {
>>unsigned long i;
>>for (i = 0; i < N; i++) {
>>  foo();
>>  goo();
>>}
>> }
>>
>> For the load a.f4 in goo, it is diffcult to motivate its shrink after
>> instcombine because the comparison with a.f3 and the comparison with
>> a.f4 are merged:
>>
>> define void @_Z3goov() local_unnamed_addr #0 {
>>%1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
>>

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-03 Thread Hal Finkel via cfe-commits


On 09/03/2017 03:44 PM, Wei Mi wrote:

On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel  wrote:

On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:

On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li 
wrote:


On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
 wrote:

chandlerc added a comment.

I'm really not a fan of the degree of complexity and subtlety that this
introduces into the frontend, all to allow particular backend
optimizations.

I feel like this is Clang working around a fundamental deficiency in
LLVM
and we should instead find a way to fix this in LLVM itself.

As has been pointed out before, user code can synthesize large integers
that small bit sequences are extracted from, and Clang and LLVM should
handle those just as well as actual bitfields.

Can we see how far we can push the LLVM side before we add complexity to
Clang here? I understand that there remain challenges to LLVM's stuff,
but I
don't think those challenges make *all* of the LLVM improvements off the
table, I don't think we've exhausted all ways of improving the LLVM
changes
being proposed, and I think we should still land all of those and
re-evaluate how important these issues are when all of that is in place.


The main challenge of doing  this in LLVM is that inter-procedural
analysis
(and possibly cross module) is needed (for store forwarding issues).

Wei, perhaps you can provide concrete test case to illustrate the issue
so
that reviewers have a good understanding.

David

Here is a runable testcase:
 1.cc 
class A {
public:
unsigned long f1:2;
unsigned long f2:6;
unsigned long f3:8;
unsigned long f4:4;
};
A a;
unsigned long b;
unsigned long N = 10;

__attribute__((noinline))
void foo() {
a.f3 = 3;
}

__attribute__((noinline))
void goo() {
b = a.f3;
}

int main() {
unsigned long i;
for (i = 0; i < N; i++) {
  foo();
  goo();
}
}

Now trunk takes about twice running time compared with trunk + this
patch. That is because trunk shrinks the store of a.f3 in foo (Done by
DagCombiner) but not shrink the load of a.f3 in goo, so store
forwarding will be blocked.


I can confirm that this is true on Haswell and also on an POWER8.
Interestingly, on a POWER7, the performance is the same either way (on the
good side). I ran the test case as presented and where I replaced f3 with a
non-bitfield unsigned char member. Thinking that the POWER7 result might be
because it's big-Endian, I flipped the order of the fields, and found that
the version where f3 is not a bitfield is faster than otherwise, but only by
12.5%.

Why, in this case, don't we shrink the load? It seems like we should (and it
seems like a straightforward case).

Thanks again,
Hal


Hal, thanks for trying the test.

Yes, it is straightforward to shrink the load in the test. I can
change the testcase a little to show why it is sometime difficult to
shrink the load:

class A {
public:
   unsigned long f1:16;
   unsigned long f2:16;
   unsigned long f3:16;
   unsigned long f4:8;
};
A a;
bool b;
unsigned long N = 10;

__attribute__((noinline))
void foo() {
   a.f4 = 3;
}

__attribute__((noinline))
void goo() {
   b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
}

int main() {
   unsigned long i;
   for (i = 0; i < N; i++) {
 foo();
 goo();
   }
}

For the load a.f4 in goo, it is diffcult to motivate its shrink after
instcombine because the comparison with a.f3 and the comparison with
a.f4 are merged:

define void @_Z3goov() local_unnamed_addr #0 {
   %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
   %2 = and i64 %1, 0xff
   %3 = icmp eq i64 %2, 0x
   %4 = zext i1 %3 to i8
   store i8 %4, i8* @b, align 1, !tbaa !2
   ret void
}


Exactly. But this behavior is desirable, locally. There's no good answer 
here: We either generate extra load traffic here (because we need to 
load the fields separately), or we widen the store (which generates 
extra load traffic there). Do you know, in terms of performance, which 
is better in this case (i.e., is it better to widen the store or split 
the load)?


 -Hal



Thanks,
Wei.


The testcases shows the potential problem of store shrinking. Before
we decide to do store shrinking, we need to know all the related loads
will be shrunk,  and that requires IPA analysis. Otherwise, when load
shrinking was blocked for some difficult case (Like the instcombine
case described in
https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html),
performance regression will happen.

Wei.




Repository:
rL LLVM

https://reviews.llvm.org/D36562




___
llvm-commits mailing list
llvm-comm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits


--
Hal Finkel
Lead, Compiler Technology and 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-03 Thread Wei Mi via cfe-commits
On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel  wrote:
>
> On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
>>
>> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li 
>> wrote:
>>>
>>>
>>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>>>  wrote:

 chandlerc added a comment.

 I'm really not a fan of the degree of complexity and subtlety that this
 introduces into the frontend, all to allow particular backend
 optimizations.

 I feel like this is Clang working around a fundamental deficiency in
 LLVM
 and we should instead find a way to fix this in LLVM itself.

 As has been pointed out before, user code can synthesize large integers
 that small bit sequences are extracted from, and Clang and LLVM should
 handle those just as well as actual bitfields.

 Can we see how far we can push the LLVM side before we add complexity to
 Clang here? I understand that there remain challenges to LLVM's stuff,
 but I
 don't think those challenges make *all* of the LLVM improvements off the
 table, I don't think we've exhausted all ways of improving the LLVM
 changes
 being proposed, and I think we should still land all of those and
 re-evaluate how important these issues are when all of that is in place.
>>>
>>>
>>> The main challenge of doing  this in LLVM is that inter-procedural
>>> analysis
>>> (and possibly cross module) is needed (for store forwarding issues).
>>>
>>> Wei, perhaps you can provide concrete test case to illustrate the issue
>>> so
>>> that reviewers have a good understanding.
>>>
>>> David
>>
>> Here is a runable testcase:
>>  1.cc 
>> class A {
>> public:
>>unsigned long f1:2;
>>unsigned long f2:6;
>>unsigned long f3:8;
>>unsigned long f4:4;
>> };
>> A a;
>> unsigned long b;
>> unsigned long N = 10;
>>
>> __attribute__((noinline))
>> void foo() {
>>a.f3 = 3;
>> }
>>
>> __attribute__((noinline))
>> void goo() {
>>b = a.f3;
>> }
>>
>> int main() {
>>unsigned long i;
>>for (i = 0; i < N; i++) {
>>  foo();
>>  goo();
>>}
>> }
>> 
>> Now trunk takes about twice running time compared with trunk + this
>> patch. That is because trunk shrinks the store of a.f3 in foo (Done by
>> DagCombiner) but not shrink the load of a.f3 in goo, so store
>> forwarding will be blocked.
>
>
> I can confirm that this is true on Haswell and also on an POWER8.
> Interestingly, on a POWER7, the performance is the same either way (on the
> good side). I ran the test case as presented and where I replaced f3 with a
> non-bitfield unsigned char member. Thinking that the POWER7 result might be
> because it's big-Endian, I flipped the order of the fields, and found that
> the version where f3 is not a bitfield is faster than otherwise, but only by
> 12.5%.
>
> Why, in this case, don't we shrink the load? It seems like we should (and it
> seems like a straightforward case).
>
> Thanks again,
> Hal
>

Hal, thanks for trying the test.

Yes, it is straightforward to shrink the load in the test. I can
change the testcase a little to show why it is sometime difficult to
shrink the load:

class A {
public:
  unsigned long f1:16;
  unsigned long f2:16;
  unsigned long f3:16;
  unsigned long f4:8;
};
A a;
bool b;
unsigned long N = 10;

__attribute__((noinline))
void foo() {
  a.f4 = 3;
}

__attribute__((noinline))
void goo() {
  b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
}

int main() {
  unsigned long i;
  for (i = 0; i < N; i++) {
foo();
goo();
  }
}

For the load a.f4 in goo, it is diffcult to motivate its shrink after
instcombine because the comparison with a.f3 and the comparison with
a.f4 are merged:

define void @_Z3goov() local_unnamed_addr #0 {
  %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
  %2 = and i64 %1, 0xff
  %3 = icmp eq i64 %2, 0x
  %4 = zext i1 %3 to i8
  store i8 %4, i8* @b, align 1, !tbaa !2
  ret void
}

Thanks,
Wei.

>>
>> The testcases shows the potential problem of store shrinking. Before
>> we decide to do store shrinking, we need to know all the related loads
>> will be shrunk,  and that requires IPA analysis. Otherwise, when load
>> shrinking was blocked for some difficult case (Like the instcombine
>> case described in
>> https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html),
>> performance regression will happen.
>>
>> Wei.
>>
>>


 Repository:
rL LLVM

 https://reviews.llvm.org/D36562



>> ___
>> llvm-commits mailing list
>> llvm-comm...@lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-09-02 Thread Hal Finkel via cfe-commits


On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:

On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li  wrote:


On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
 wrote:

chandlerc added a comment.

I'm really not a fan of the degree of complexity and subtlety that this
introduces into the frontend, all to allow particular backend optimizations.

I feel like this is Clang working around a fundamental deficiency in LLVM
and we should instead find a way to fix this in LLVM itself.

As has been pointed out before, user code can synthesize large integers
that small bit sequences are extracted from, and Clang and LLVM should
handle those just as well as actual bitfields.

Can we see how far we can push the LLVM side before we add complexity to
Clang here? I understand that there remain challenges to LLVM's stuff, but I
don't think those challenges make *all* of the LLVM improvements off the
table, I don't think we've exhausted all ways of improving the LLVM changes
being proposed, and I think we should still land all of those and
re-evaluate how important these issues are when all of that is in place.


The main challenge of doing  this in LLVM is that inter-procedural analysis
(and possibly cross module) is needed (for store forwarding issues).

Wei, perhaps you can provide concrete test case to illustrate the issue so
that reviewers have a good understanding.

David

Here is a runable testcase:
 1.cc 
class A {
public:
   unsigned long f1:2;
   unsigned long f2:6;
   unsigned long f3:8;
   unsigned long f4:4;
};
A a;
unsigned long b;
unsigned long N = 10;

__attribute__((noinline))
void foo() {
   a.f3 = 3;
}

__attribute__((noinline))
void goo() {
   b = a.f3;
}

int main() {
   unsigned long i;
   for (i = 0; i < N; i++) {
 foo();
 goo();
   }
}

Now trunk takes about twice running time compared with trunk + this
patch. That is because trunk shrinks the store of a.f3 in foo (Done by
DagCombiner) but not shrink the load of a.f3 in goo, so store
forwarding will be blocked.


I can confirm that this is true on Haswell and also on an POWER8. 
Interestingly, on a POWER7, the performance is the same either way (on 
the good side). I ran the test case as presented and where I replaced f3 
with a non-bitfield unsigned char member. Thinking that the POWER7 
result might be because it's big-Endian, I flipped the order of the 
fields, and found that the version where f3 is not a bitfield is faster 
than otherwise, but only by 12.5%.


Why, in this case, don't we shrink the load? It seems like we should 
(and it seems like a straightforward case).


Thanks again,
Hal



The testcases shows the potential problem of store shrinking. Before
we decide to do store shrinking, we need to know all the related loads
will be shrunk,  and that requires IPA analysis. Otherwise, when load
shrinking was blocked for some difficult case (Like the instcombine
case described in
https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html),
performance regression will happen.

Wei.





Repository:
   rL LLVM

https://reviews.llvm.org/D36562




___
llvm-commits mailing list
llvm-comm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits


--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-22 Thread Wei Mi via cfe-commits
On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li  wrote:
>
>
> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
>  wrote:
>>
>> chandlerc added a comment.
>>
>> I'm really not a fan of the degree of complexity and subtlety that this
>> introduces into the frontend, all to allow particular backend optimizations.
>>
>> I feel like this is Clang working around a fundamental deficiency in LLVM
>> and we should instead find a way to fix this in LLVM itself.
>>
>> As has been pointed out before, user code can synthesize large integers
>> that small bit sequences are extracted from, and Clang and LLVM should
>> handle those just as well as actual bitfields.
>>
>> Can we see how far we can push the LLVM side before we add complexity to
>> Clang here? I understand that there remain challenges to LLVM's stuff, but I
>> don't think those challenges make *all* of the LLVM improvements off the
>> table, I don't think we've exhausted all ways of improving the LLVM changes
>> being proposed, and I think we should still land all of those and
>> re-evaluate how important these issues are when all of that is in place.
>
>
> The main challenge of doing  this in LLVM is that inter-procedural analysis
> (and possibly cross module) is needed (for store forwarding issues).
>
> Wei, perhaps you can provide concrete test case to illustrate the issue so
> that reviewers have a good understanding.
>
> David

Here is a runable testcase:
 1.cc 
class A {
public:
  unsigned long f1:2;
  unsigned long f2:6;
  unsigned long f3:8;
  unsigned long f4:4;
};
A a;
unsigned long b;
unsigned long N = 10;

__attribute__((noinline))
void foo() {
  a.f3 = 3;
}

__attribute__((noinline))
void goo() {
  b = a.f3;
}

int main() {
  unsigned long i;
  for (i = 0; i < N; i++) {
foo();
goo();
  }
}

Now trunk takes about twice running time compared with trunk + this
patch. That is because trunk shrinks the store of a.f3 in foo (Done by
DagCombiner) but not shrink the load of a.f3 in goo, so store
forwarding will be blocked.

The testcases shows the potential problem of store shrinking. Before
we decide to do store shrinking, we need to know all the related loads
will be shrunk,  and that requires IPA analysis. Otherwise, when load
shrinking was blocked for some difficult case (Like the instcombine
case described in
https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html),
performance regression will happen.

Wei.


>>
>>
>>
>> Repository:
>>   rL LLVM
>>
>> https://reviews.llvm.org/D36562
>>
>>
>>
>
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-22 Thread Xinliang David Li via cfe-commits
On Tue, Aug 22, 2017 at 7:46 PM, Chandler Carruth 
wrote:

>
>
> On Tue, Aug 22, 2017 at 7:18 PM Xinliang David Li via llvm-commits <
> llvm-comm...@lists.llvm.org> wrote:
>
>> On Tue, Aug 22, 2017 at 7:10 PM, Chandler Carruth via llvm-commits <
>> llvm-comm...@lists.llvm.org> wrote:
>>
>>> On Tue, Aug 22, 2017 at 7:03 PM Xinliang David Li via cfe-commits <
>>> cfe-commits@lists.llvm.org> wrote:
>>>
 On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator <
 revi...@reviews.llvm.org> wrote:

> chandlerc added a comment.
>
> I'm really not a fan of the degree of complexity and subtlety that
> this introduces into the frontend, all to allow particular backend
> optimizations.
>
> I feel like this is Clang working around a fundamental deficiency in
> LLVM and we should instead find a way to fix this in LLVM itself.
>
> As has been pointed out before, user code can synthesize large
> integers that small bit sequences are extracted from, and Clang and LLVM
> should handle those just as well as actual bitfields.
>
> Can we see how far we can push the LLVM side before we add complexity
> to Clang here? I understand that there remain challenges to LLVM's stuff,
> but I don't think those challenges make *all* of the LLVM improvements off
> the table, I don't think we've exhausted all ways of improving the LLVM
> changes being proposed, and I think we should still land all of those and
> re-evaluate how important these issues are when all of that is in place.
>

 The main challenge of doing  this in LLVM is that inter-procedural
 analysis (and possibly cross module) is needed (for store forwarding
 issues).

 Wei, perhaps you can provide concrete test case to illustrate the issue
 so that reviewers have a good understanding.

>>>
>>> It doesn't seem like all options for addressing that have been
>>> exhausted. And even then, I feel like trying to fix this with non-obvious
>>> (to the programmer) frontend heuristics isn't a good solution. I actually
>>> *prefer* the source work around of "don't use a bitfield if you *must* have
>>> narrow width access across modules where the optimizer cannot see enough to
>>> narrow them and you happen to know that there is a legal narrow access that
>>> works". Because that way the programmer has *control* over this rather than
>>> being at the whim of whichever side of the heuristic they end up on.
>>>
>>
>>
>> The source workaround solution *does not* scale. Most importantly, user
>> may not even be aware of the problem (and performance loss) unless
>>  compiling the code with another compiler and notice the performance
>> difference.
>>
>
> I don't really understand this argument...
>
> I don't think we can realistically chase all performance problems that
> users aren't aware of or can't measure. And our users *do* care and seem
> very good at benchmarking and noticing problems. =]
>
>
Here you assume that there is single hotspot that user can notice easily,
but this is usually not the case. In a very large scale, we have a long
tail of not so hot functions, but similar problems like this can add up.

Besides if the problem is obvious for user to detect, why wasn't this
problem detected in LLVM earlier?   I admit, in this particular case, we
actually relied on the competitive analysis -- that is why I said depending
on users for this is not reliable.

For yearsI have told our users (optimization) not to hack source code/or do
manual performance tuning (for instance, adding always inlining, adding
inline keyword, adding manually inserted prefetchings, manual loop
unrolling/peelings, etc ). Not only does this method does not scale (it
only helps user's own program), but more importantly it may get stale
overtime -- for instance, due to hardware change or shifting of hot spots.
It is not uncommon to see user's manual optimiztion which used to work
later becomes performance 'bug' without being noticed for years.   I am
totally fine with temporary source workaround, but they should not relied
upon longer term.



> I also think it is OK if in rare cases programmers have to adapt their
> source code to optimize well. We can and should make this as rare as
> reasonable from an engineering tradeoff perspective, but at a certain point
> we need to work with programmers to find a way to solve the problem. There
> exist coding patterns that Clang and LLVM will never be particularly
> effective at optimizing, and that seems OK. We just need to have reasonable
> engineering requirements for those cases:
>

As I said, for rare cases or as a temporary solution, it is fine to use
source workaround, but we should continue to push for compiler based
solution because compiler is for everybody/every app.  If compiler really
sucks at something, it should at least produce warnings about (with
something like performance sanitizer), but that should 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-22 Thread Chandler Carruth via cfe-commits
On Tue, Aug 22, 2017 at 7:18 PM Xinliang David Li via llvm-commits <
llvm-comm...@lists.llvm.org> wrote:

> On Tue, Aug 22, 2017 at 7:10 PM, Chandler Carruth via llvm-commits <
> llvm-comm...@lists.llvm.org> wrote:
>
>> On Tue, Aug 22, 2017 at 7:03 PM Xinliang David Li via cfe-commits <
>> cfe-commits@lists.llvm.org> wrote:
>>
>>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator <
>>> revi...@reviews.llvm.org> wrote:
>>>
 chandlerc added a comment.

 I'm really not a fan of the degree of complexity and subtlety that this
 introduces into the frontend, all to allow particular backend 
 optimizations.

 I feel like this is Clang working around a fundamental deficiency in
 LLVM and we should instead find a way to fix this in LLVM itself.

 As has been pointed out before, user code can synthesize large integers
 that small bit sequences are extracted from, and Clang and LLVM should
 handle those just as well as actual bitfields.

 Can we see how far we can push the LLVM side before we add complexity
 to Clang here? I understand that there remain challenges to LLVM's stuff,
 but I don't think those challenges make *all* of the LLVM improvements off
 the table, I don't think we've exhausted all ways of improving the LLVM
 changes being proposed, and I think we should still land all of those and
 re-evaluate how important these issues are when all of that is in place.

>>>
>>> The main challenge of doing  this in LLVM is that inter-procedural
>>> analysis (and possibly cross module) is needed (for store forwarding
>>> issues).
>>>
>>> Wei, perhaps you can provide concrete test case to illustrate the issue
>>> so that reviewers have a good understanding.
>>>
>>
>> It doesn't seem like all options for addressing that have been exhausted.
>> And even then, I feel like trying to fix this with non-obvious (to the
>> programmer) frontend heuristics isn't a good solution. I actually *prefer*
>> the source work around of "don't use a bitfield if you *must* have narrow
>> width access across modules where the optimizer cannot see enough to narrow
>> them and you happen to know that there is a legal narrow access that
>> works". Because that way the programmer has *control* over this rather than
>> being at the whim of whichever side of the heuristic they end up on.
>>
>
>
> The source workaround solution *does not* scale. Most importantly, user
> may not even be aware of the problem (and performance loss) unless
>  compiling the code with another compiler and notice the performance
> difference.
>

I don't really understand this argument...

I don't think we can realistically chase all performance problems that
users aren't aware of or can't measure. And our users *do* care and seem
very good at benchmarking and noticing problems. =]

I also think it is OK if in rare cases programmers have to adapt their
source code to optimize well. We can and should make this as rare as
reasonable from an engineering tradeoff perspective, but at a certain point
we need to work with programmers to find a way to solve the problem. There
exist coding patterns that Clang and LLVM will never be particularly
effective at optimizing, and that seems OK. We just need to have reasonable
engineering requirements for those cases:

1) They need to be quite rare. So far, this issue has come up fairly
infrequently. Not never, but there are many issues that seem to come up
much more often such as inlining issues.

2) We need an effective way to explain how programmers should write code to
optimize well, and this explanation needs to be as simple as possible.

3) We likely need to give programmers *choices* about which direction their
code gets optimized for. And this is (IMO) the biggest practical issue with
the approach in this change (beyond race detection and other semantic
issues).


I'm suggesting that we teach programmers to choose between a bitfield to
get combining-friendly access to tightly packed data, and integer types of
an appropriate size to get inexpensive and predictable loads and stores.
For sub-byte-width and non-byte-aligned bitfields this is already a
necessary property and so it seems like a consistent and easily explained
and taught model.

While there are times where there is a single "right" answer and we only
have to provide that, I don't think this is one of them. I've seen rare but
serious complaints about *both* failing to combine adjacent bit fields and
failing to narrow to small legal integer types. These seem in fundamental
tension, and I would prefer to establish idiomatic patterns for programmers
to request the behavior then need in their application.


>
> David
>
>>
>>
>>>
>>> David
>>>


 Repository:
   rL LLVM

 https://reviews.llvm.org/D36562



 ___
>>> cfe-commits mailing list
>>> cfe-commits@lists.llvm.org
>>> 

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-22 Thread Hal Finkel via cfe-commits


On 08/22/2017 09:18 PM, Xinliang David Li via llvm-commits wrote:



On Tue, Aug 22, 2017 at 7:10 PM, Chandler Carruth via llvm-commits 
> wrote:


On Tue, Aug 22, 2017 at 7:03 PM Xinliang David Li via cfe-commits
>
wrote:

On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via
Phabricator > wrote:

chandlerc added a comment.

I'm really not a fan of the degree of complexity and
subtlety that this introduces into the frontend, all to
allow particular backend optimizations.

I feel like this is Clang working around a fundamental
deficiency in LLVM and we should instead find a way to fix
this in LLVM itself.

As has been pointed out before, user code can synthesize
large integers that small bit sequences are extracted
from, and Clang and LLVM should handle those just as well
as actual bitfields.

Can we see how far we can push the LLVM side before we add
complexity to Clang here? I understand that there remain
challenges to LLVM's stuff, but I don't think those
challenges make *all* of the LLVM improvements off the
table, I don't think we've exhausted all ways of improving
the LLVM changes being proposed, and I think we should
still land all of those and re-evaluate how important
these issues are when all of that is in place.


The main challenge of doing  this in LLVM is that
inter-procedural analysis (and possibly cross module) is
needed (for store forwarding issues).

Wei, perhaps you can provide concrete test case to illustrate
the issue so that reviewers have a good understanding.


It doesn't seem like all options for addressing that have been
exhausted. And even then, I feel like trying to fix this with
non-obvious (to the programmer) frontend heuristics isn't a good
solution. I actually *prefer* the source work around of "don't use
a bitfield if you *must* have narrow width access across modules
where the optimizer cannot see enough to narrow them and you
happen to know that there is a legal narrow access that works".
Because that way the programmer has *control* over this rather
than being at the whim of whichever side of the heuristic they end
up on.



The source workaround solution *does not* scale. Most importantly, 
user may not even be aware of the problem (and performance loss) 
unless  compiling the code with another compiler and notice the 
performance difference.


I agree with this, but it's not clear that this has to scale in that 
sense. I don't like basing this on the bitfield widths because it makes 
users pick between expressing semantic information and expressing target 
tuning information using the same construct. What if the optimal answer 
here is different on different platforms? I don't want to encourage 
users to ifdef their aggregates to sometimes be bitfields and sometimes 
not for tuning reasons. If need be, please add an attribute. Any 
heuristic that you pick here is going to help some cases and hurt 
others. If we're at the level of needing IPA to look at store-to-load 
forwarding effects, then we've really already lost. Either you need to 
actually do the IPA, or even in the backend, any heuristic that you 
choose will help some things and hurt others. Hopefully, we're not 
really there yet. I'm looking forward to seeing more examples of the 
kinds of problems you're trying to solve.


Thanks again,
Hal



David


David



Repository:
  rL LLVM

https://reviews.llvm.org/D36562




___
cfe-commits mailing list
cfe-commits@lists.llvm.org 
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits



___
llvm-commits mailing list
llvm-comm...@lists.llvm.org 
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits





___
llvm-commits mailing list
llvm-comm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits


--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

___
cfe-commits mailing list
cfe-commits@lists.llvm.org

Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-22 Thread Xinliang David Li via cfe-commits
On Tue, Aug 22, 2017 at 7:10 PM, Chandler Carruth via llvm-commits <
llvm-comm...@lists.llvm.org> wrote:

> On Tue, Aug 22, 2017 at 7:03 PM Xinliang David Li via cfe-commits <
> cfe-commits@lists.llvm.org> wrote:
>
>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator <
>> revi...@reviews.llvm.org> wrote:
>>
>>> chandlerc added a comment.
>>>
>>> I'm really not a fan of the degree of complexity and subtlety that this
>>> introduces into the frontend, all to allow particular backend optimizations.
>>>
>>> I feel like this is Clang working around a fundamental deficiency in
>>> LLVM and we should instead find a way to fix this in LLVM itself.
>>>
>>> As has been pointed out before, user code can synthesize large integers
>>> that small bit sequences are extracted from, and Clang and LLVM should
>>> handle those just as well as actual bitfields.
>>>
>>> Can we see how far we can push the LLVM side before we add complexity to
>>> Clang here? I understand that there remain challenges to LLVM's stuff, but
>>> I don't think those challenges make *all* of the LLVM improvements off the
>>> table, I don't think we've exhausted all ways of improving the LLVM changes
>>> being proposed, and I think we should still land all of those and
>>> re-evaluate how important these issues are when all of that is in place.
>>>
>>
>> The main challenge of doing  this in LLVM is that inter-procedural
>> analysis (and possibly cross module) is needed (for store forwarding
>> issues).
>>
>> Wei, perhaps you can provide concrete test case to illustrate the issue
>> so that reviewers have a good understanding.
>>
>
> It doesn't seem like all options for addressing that have been exhausted.
> And even then, I feel like trying to fix this with non-obvious (to the
> programmer) frontend heuristics isn't a good solution. I actually *prefer*
> the source work around of "don't use a bitfield if you *must* have narrow
> width access across modules where the optimizer cannot see enough to narrow
> them and you happen to know that there is a legal narrow access that
> works". Because that way the programmer has *control* over this rather than
> being at the whim of whichever side of the heuristic they end up on.
>


The source workaround solution *does not* scale. Most importantly, user may
not even be aware of the problem (and performance loss) unless  compiling
the code with another compiler and notice the performance difference.

David

>
>
>>
>> David
>>
>>>
>>>
>>> Repository:
>>>   rL LLVM
>>>
>>> https://reviews.llvm.org/D36562
>>>
>>>
>>>
>>> ___
>> cfe-commits mailing list
>> cfe-commits@lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
>>
>
> ___
> llvm-commits mailing list
> llvm-comm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
>
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-22 Thread Chandler Carruth via cfe-commits
On Tue, Aug 22, 2017 at 7:03 PM Xinliang David Li via cfe-commits <
cfe-commits@lists.llvm.org> wrote:

> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator <
> revi...@reviews.llvm.org> wrote:
>
>> chandlerc added a comment.
>>
>> I'm really not a fan of the degree of complexity and subtlety that this
>> introduces into the frontend, all to allow particular backend optimizations.
>>
>> I feel like this is Clang working around a fundamental deficiency in LLVM
>> and we should instead find a way to fix this in LLVM itself.
>>
>> As has been pointed out before, user code can synthesize large integers
>> that small bit sequences are extracted from, and Clang and LLVM should
>> handle those just as well as actual bitfields.
>>
>> Can we see how far we can push the LLVM side before we add complexity to
>> Clang here? I understand that there remain challenges to LLVM's stuff, but
>> I don't think those challenges make *all* of the LLVM improvements off the
>> table, I don't think we've exhausted all ways of improving the LLVM changes
>> being proposed, and I think we should still land all of those and
>> re-evaluate how important these issues are when all of that is in place.
>>
>
> The main challenge of doing  this in LLVM is that inter-procedural
> analysis (and possibly cross module) is needed (for store forwarding
> issues).
>
> Wei, perhaps you can provide concrete test case to illustrate the issue so
> that reviewers have a good understanding.
>

It doesn't seem like all options for addressing that have been exhausted.
And even then, I feel like trying to fix this with non-obvious (to the
programmer) frontend heuristics isn't a good solution. I actually *prefer*
the source work around of "don't use a bitfield if you *must* have narrow
width access across modules where the optimizer cannot see enough to narrow
them and you happen to know that there is a legal narrow access that
works". Because that way the programmer has *control* over this rather than
being at the whim of whichever side of the heuristic they end up on.


>
> David
>
>>
>>
>> Repository:
>>   rL LLVM
>>
>> https://reviews.llvm.org/D36562
>>
>>
>>
>> ___
> cfe-commits mailing list
> cfe-commits@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
>
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


Re: [PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-22 Thread Xinliang David Li via cfe-commits
On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator <
revi...@reviews.llvm.org> wrote:

> chandlerc added a comment.
>
> I'm really not a fan of the degree of complexity and subtlety that this
> introduces into the frontend, all to allow particular backend optimizations.
>
> I feel like this is Clang working around a fundamental deficiency in LLVM
> and we should instead find a way to fix this in LLVM itself.
>
> As has been pointed out before, user code can synthesize large integers
> that small bit sequences are extracted from, and Clang and LLVM should
> handle those just as well as actual bitfields.
>
> Can we see how far we can push the LLVM side before we add complexity to
> Clang here? I understand that there remain challenges to LLVM's stuff, but
> I don't think those challenges make *all* of the LLVM improvements off the
> table, I don't think we've exhausted all ways of improving the LLVM changes
> being proposed, and I think we should still land all of those and
> re-evaluate how important these issues are when all of that is in place.
>

The main challenge of doing  this in LLVM is that inter-procedural analysis
(and possibly cross module) is needed (for store forwarding issues).

Wei, perhaps you can provide concrete test case to illustrate the issue so
that reviewers have a good understanding.

David

>
>
> Repository:
>   rL LLVM
>
> https://reviews.llvm.org/D36562
>
>
>
>
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-22 Thread Chandler Carruth via Phabricator via cfe-commits
chandlerc added a comment.

I'm really not a fan of the degree of complexity and subtlety that this 
introduces into the frontend, all to allow particular backend optimizations.

I feel like this is Clang working around a fundamental deficiency in LLVM and 
we should instead find a way to fix this in LLVM itself.

As has been pointed out before, user code can synthesize large integers that 
small bit sequences are extracted from, and Clang and LLVM should handle those 
just as well as actual bitfields.

Can we see how far we can push the LLVM side before we add complexity to Clang 
here? I understand that there remain challenges to LLVM's stuff, but I don't 
think those challenges make *all* of the LLVM improvements off the table, I 
don't think we've exhausted all ways of improving the LLVM changes being 
proposed, and I think we should still land all of those and re-evaluate how 
important these issues are when all of that is in place.


Repository:
  rL LLVM

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-22 Thread Wei Mi via Phabricator via cfe-commits
wmi updated this revision to Diff 112271.
wmi added a comment.

Try another idea suggested by David.

All the bitfields in a single run are still wrapped inside of a large integer 
according to CGBitFieldInfo. For the bitfields with legal integer types and 
aligned, change their access manner when we generate load/store in llvm IR for 
bitfield (In EmitLoadOfBitfieldLValue and EmitStoreThroughBitfieldLValue). All 
the other bitfields will still be accessed using widen load/store plus masking 
operations. Here is an example:

class A {

  unsigned long f1:2;
  unsigned long f2:6;
  unsigned long f3:8;
  unsigned long f4:4;

};
A a;

f1, f2, f3 and f4 will still be wrapped as a large integer. f1, f2, f4 will 
have the same access code as before. f3 will be accessed as if it is a separate 
unsigned char variable.

In this way, we can reduce the chance of blocking bitfield access combining. 
a.f1 access and a.f4 access can be combined if only no a.f3 access stands in 
between a.f1 and a.f4.  We will generate two less instructions for foo, and one 
more instruction for goo. So it is better, but not perfect.

void foo (unsigned long n1, unsigned long n2, unsigned long n3) {

  a.f1 = n1;
  a.f4 = n4;
  a.f3 = n3;

}

void goo (unsigned long n1, unsigned long n2, unsigned long n3) {

  a.f1 = n1;
  a.f3 = n3;// a.f3 will still block the combining of a.f1 and a.f2 because 
a.f3 is accessed independently.
  a.f4 = n4;

}


Repository:
  rL LLVM

https://reviews.llvm.org/D36562

Files:
  lib/CodeGen/CGExpr.cpp
  test/CodeGen/2009-12-07-BitFieldAlignment.c

Index: test/CodeGen/2009-12-07-BitFieldAlignment.c
===
--- test/CodeGen/2009-12-07-BitFieldAlignment.c
+++ test/CodeGen/2009-12-07-BitFieldAlignment.c
@@ -4,7 +4,7 @@
 struct S {
   int a, b;
   void *c;
-  unsigned d : 8;
+  unsigned d : 7;
   unsigned e : 8;
 };
 
Index: lib/CodeGen/CGExpr.cpp
===
--- lib/CodeGen/CGExpr.cpp
+++ lib/CodeGen/CGExpr.cpp
@@ -1653,30 +1653,72 @@
   return EmitLoadOfBitfieldLValue(LV, Loc);
 }
 
+/// Check if current Field is better to be accessed separately. When current
+/// field has legal integer width, and its bitfield offset is naturally
+/// aligned, it is better to access the bitfield like a separate integer var.
+static bool IsSeparatableBitfield(const CGBitFieldInfo ,
+  const llvm::DataLayout ,
+  ASTContext ) {
+  if (!DL.isLegalInteger(Info.Size))
+return false;
+  // Make sure Field is natually aligned.
+  if (Info.Offset %
+  (DL.getABIIntegerTypeAlignment(Info.Size) * Context.getCharWidth()) !=
+  0)
+return false;
+  return true;
+}
+
 RValue CodeGenFunction::EmitLoadOfBitfieldLValue(LValue LV,
  SourceLocation Loc) {
   const CGBitFieldInfo  = LV.getBitFieldInfo();
 
   // Get the output type.
   llvm::Type *ResLTy = ConvertType(LV.getType());
 
   Address Ptr = LV.getBitFieldAddress();
-  llvm::Value *Val = Builder.CreateLoad(Ptr, LV.isVolatileQualified(), "bf.load");
-
-  if (Info.IsSigned) {
-assert(static_cast(Info.Offset + Info.Size) <= Info.StorageSize);
-unsigned HighBits = Info.StorageSize - Info.Offset - Info.Size;
-if (HighBits)
-  Val = Builder.CreateShl(Val, HighBits, "bf.shl");
-if (Info.Offset + HighBits)
-  Val = Builder.CreateAShr(Val, Info.Offset + HighBits, "bf.ashr");
+  llvm::Value *Val = nullptr;
+  if (IsSeparatableBitfield(Info, CGM.getDataLayout(), getContext())) {
+// Ptr is the pointer to the Bitfield Storage. Add Offset to the Ptr
+// if the Offset is not zero.
+if (Info.Offset != 0) {
+  Address BC = Builder.CreateBitCast(Ptr, Int8PtrTy);
+  llvm::Value *GEP = Builder.CreateGEP(
+  BC.getPointer(),
+  llvm::ConstantInt::get(Int32Ty,
+ Info.Offset / getContext().getCharWidth()));
+  Ptr = Address(GEP, Ptr.getAlignment());
+}
+// Adjust the element type of Ptr if it has different size with Info.Size.
+if (Ptr.getElementType()->getScalarSizeInBits() != Info.Size) {
+  llvm::Type *BFType = llvm::Type::getIntNTy(getLLVMContext(), Info.Size);
+  llvm::Type *BFTypePtr =
+  llvm::Type::getIntNPtrTy(getLLVMContext(), Info.Size);
+  unsigned Align = CGM.getDataLayout().getABITypeAlignment(BFType) *
+   getContext().getCharWidth();
+  Ptr = Builder.CreateBitCast(
+  Address(Ptr.getPointer(), getContext().toCharUnitsFromBits(Align)),
+  BFTypePtr);
+}
+Val = Builder.CreateLoad(Ptr, LV.isVolatileQualified(), "bf.load");
   } else {
-if (Info.Offset)
-  Val = Builder.CreateLShr(Val, Info.Offset, "bf.lshr");
-if (static_cast(Info.Offset) + Info.Size < Info.StorageSize)
-  Val = Builder.CreateAnd(Val, llvm::APInt::getLowBitsSet(Info.StorageSize,
-   

[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-10 Thread Wei Mi via Phabricator via cfe-commits
wmi added a comment.

I limit the bitfield separation in the last update to only happen at the 
beginning of a run so no bitfield combine will be blocked.

I think I need to explain more about why I change the direction and start to 
work on the problem in frontend. Keeping the information by generating widening 
type and letting llvm pass to do narrowing looks like a good solution to me 
originally. However, I saw in real cases that the narrowing approach in a late 
llvm stage has more difficulties than I originally thought. Some of them are 
solved but at the cost of code complexity, but others are more difficult.

- store forwarding issue: To extract legal integer width bitfields from a large 
integer generated by frontend, we need to split both stores and loads related 
with legal integer bitfields. If store is narrowed and load is not, the width 
of load may be smaller than the store and the target may have difficulty to do 
store forwarding and that fact will hit the performance.  Note, we found case 
that related load and store are in different funcs, so when deciding whether to 
narrow a store or not, it is possible that we have no idea that the related 
load is narrowed or not. If we cannot know all the related loads will be 
narrowed, the store is better not to be narrowed.

- After instcombine, some bitfield access information will be lost: The case we 
saw is: unsigned long bf1 : 16 unsigned long bf2 : 16 unsigned long bf3 : 16 
unsigned long bf4 : 8

  bool cond = "bf3 == 0 && bf4 == -1";

  Before instcombine, bf3 and bf4 are extracted from an i64 separately. We can 
know bf3 is a 16 bits access and bf4 is a 8 bit access from the extracting code 
pattern. After instcombine, bf3 and bf4 are merged into a 24 bit access, the 
comparison above is changed to: extract 24 bit data from the i64 (%bf.load = 
wide load i64, %extract = and %bf.load, 0xff) and compare %extract 
with 0x. The information that there are two legal integer bitfield 
accesses is lost, and we won't do narrowing for the load here.

  Because we cannot split the load here, we trigger store forwarding issue.

That is why I am exploring to work on the bitfield access issue in multiple 
directions.


Repository:
  rL LLVM

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-10 Thread Wei Mi via Phabricator via cfe-commits
wmi updated this revision to Diff 110641.
wmi added a comment.

Don't separate bitfield in the middle of a run because it is possible to hinder 
bitfields accesses combine. Only separate bitfield at the beginning of a run.


Repository:
  rL LLVM

https://reviews.llvm.org/D36562

Files:
  lib/CodeGen/CGRecordLayoutBuilder.cpp
  test/CodeGen/2009-12-07-BitFieldAlignment.c
  test/CodeGenCXX/2009-12-23-MissingSext.cpp
  test/CodeGenCXX/bitfield.cpp
  test/CodeGenCXX/pod-member-memcpys.cpp
  test/OpenMP/atomic_capture_codegen.cpp
  test/OpenMP/atomic_read_codegen.c
  test/OpenMP/atomic_update_codegen.cpp
  test/OpenMP/atomic_write_codegen.c

Index: test/OpenMP/atomic_write_codegen.c
===
--- test/OpenMP/atomic_write_codegen.c
+++ test/OpenMP/atomic_write_codegen.c
@@ -28,12 +28,14 @@
 int4 int4x;
 
 struct BitFields {
-  int : 32;
+  int : 2;
+  int : 30;
   int a : 31;
 } bfx;
 
 struct BitFields_packed {
-  int : 32;
+  int : 2;
+  int : 30;
   int a : 31;
 } __attribute__ ((__packed__)) bfx_packed;
 
@@ -58,13 +60,15 @@
 } __attribute__ ((__packed__)) bfx3_packed;
 
 struct BitFields4 {
-  short : 16;
+  short : 6;
+  short : 10;
   int a: 1;
   long b : 7;
 } bfx4;
 
 struct BitFields4_packed {
-  short : 16;
+  short : 6;
+  short : 10;
   int a: 1;
   long b : 7;
 } __attribute__ ((__packed__)) bfx4_packed;
Index: test/OpenMP/atomic_update_codegen.cpp
===
--- test/OpenMP/atomic_update_codegen.cpp
+++ test/OpenMP/atomic_update_codegen.cpp
@@ -27,12 +27,14 @@
 int4 int4x;
 
 struct BitFields {
-  int : 32;
+  int : 2;
+  int : 30;
   int a : 31;
 } bfx;
 
 struct BitFields_packed {
-  int : 32;
+  int : 2;
+  int : 30;
   int a : 31;
 } __attribute__ ((__packed__)) bfx_packed;
 
@@ -57,13 +59,15 @@
 } __attribute__ ((__packed__)) bfx3_packed;
 
 struct BitFields4 {
-  short : 16;
+  short : 6;
+  short : 10;
   int a: 1;
   long b : 7;
 } bfx4;
 
 struct BitFields4_packed {
-  short : 16;
+  short : 6;
+  short : 10;
   int a: 1;
   long b : 7;
 } __attribute__ ((__packed__)) bfx4_packed;
Index: test/OpenMP/atomic_read_codegen.c
===
--- test/OpenMP/atomic_read_codegen.c
+++ test/OpenMP/atomic_read_codegen.c
@@ -28,12 +28,14 @@
 int4 int4x;
 
 struct BitFields {
-  int : 32;
+  int : 2;
+  int : 30;
   int a : 31;
 } bfx;
 
 struct BitFields_packed {
-  int : 32;
+  int : 2;
+  int : 30;
   int a : 31;
 } __attribute__ ((__packed__)) bfx_packed;
 
@@ -58,13 +60,15 @@
 } __attribute__ ((__packed__)) bfx3_packed;
 
 struct BitFields4 {
-  short : 16;
+  short : 6;
+  short : 10;
   int a: 1;
   long b : 7;
 } bfx4;
 
 struct BitFields4_packed {
-  short : 16;
+  short : 6;
+  short : 10;
   int a: 1;
   long b : 7;
 } __attribute__ ((__packed__)) bfx4_packed;
Index: test/OpenMP/atomic_capture_codegen.cpp
===
--- test/OpenMP/atomic_capture_codegen.cpp
+++ test/OpenMP/atomic_capture_codegen.cpp
@@ -27,12 +27,14 @@
 int4 int4x;
 
 struct BitFields {
-  int : 32;
+  int : 2;
+  int : 30;
   int a : 31;
 } bfx;
 
 struct BitFields_packed {
-  int : 32;
+  int : 2;
+  int : 30;
   int a : 31;
 } __attribute__ ((__packed__)) bfx_packed;
 
@@ -57,13 +59,15 @@
 } __attribute__ ((__packed__)) bfx3_packed;
 
 struct BitFields4 {
-  short : 16;
+  short : 6;
+  short : 10;
   int a: 1;
   long b : 7;
 } bfx4;
 
 struct BitFields4_packed {
-  short : 16;
+  short : 6;
+  short : 10;
   int a: 1;
   long b : 7;
 } __attribute__ ((__packed__)) bfx4_packed;
Index: test/CodeGenCXX/pod-member-memcpys.cpp
===
--- test/CodeGenCXX/pod-member-memcpys.cpp
+++ test/CodeGenCXX/pod-member-memcpys.cpp
@@ -69,7 +69,7 @@
 
 struct BitfieldMember3 {
   virtual void f();
-  int   : 8;
+  int z : 8;
   int x : 1;
   int y;
 };
Index: test/CodeGenCXX/bitfield.cpp
===
--- test/CodeGenCXX/bitfield.cpp
+++ test/CodeGenCXX/bitfield.cpp
@@ -478,3 +478,42 @@
 s->b = x;
   }
 }
+
+namespace N8 {
+  // If a bitfield is at the beginning of a group of bitfields, has the width of legal integer type
+  // and it is natually aligned, the bitfield will be accessed like a separate memory location.
+  struct S {
+unsigned b1 : 16;
+unsigned b2 : 6;
+unsigned b3 : 2;
+unsigned b4 : 3;
+  };
+  unsigned read00(S* s) {
+// CHECK-X86-64-LABEL: define i32 @_ZN2N86read00EPNS_1SE
+// CHECK-X86-64:   %[[cast:.*]] = bitcast %{{.*}} to i16*
+// CHECK-X86-64:   %[[val:.*]]  = load i16, i16* %[[cast]]
+// CHECK-X86-64:   %[[ext:.*]]  = zext i16 %[[val]] to i32
+// CHECK-X86-64:  ret i32 %[[ext]]
+// CHECK-PPC64-LABEL: define zeroext i32 @_ZN2N86read00EPNS_1SE
+// CHECK-PPC64:   %[[val:.*]]   = load i32, i32* {{.*}}
+// 

[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

2017-08-09 Thread Wei Mi via Phabricator via cfe-commits
wmi added a comment.

In https://reviews.llvm.org/D36562#837594, @chandlerc wrote:

> This has been discussed before and I still pretty strongly disagree with it.
>
> This cripples the ability of TSan to find race conditions between accesses to 
> consecutive bitfields -- and these bugs have actually come up.


I guess you mean accessing different bitfields in a consecutive run 
simultaneously can cause data race. Because bitfields order in a consecutive 
run is implementation defined. With the change, Tsan may miss reporting that, 
but such data race can be exposed in a different compiler.

This can be solved by detecting tsan mode in the code. If tsan is enabled, we 
can stop splitting the bitfields.

> We also have had cases in the past where LLVM missed significant bitfield 
> combines and optimizations due to loading them as separate integers. Those 
> would become problems again, and I think they would be even harder to solve 
> than narrowing the access is going to be because we will have strictly less 
> information to work with.

how about only separating legal integer width bitfields at the beginning of a 
consecutive run? Then it won't hinder bitfields combines. This way, it can 
still help for some cases, including the important case we saw.

> Ultimately, while I understand the appeal of this approach, I don't think it 
> is correct and I think we should instead figure out how to optimize these 
> memory accesses well in LLVM. That approach will have the added benefit of 
> optimizing cases where the user has manually used a large integer to simulate 
> bitfields, and making combining and canonicalization substantially easier.




Repository:
  rL LLVM

https://reviews.llvm.org/D36562



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits