Re: AutoFDO profile toolchain is open-sourced

2015-05-08 Thread Ilya Palachev

On 11.04.2015 01:49, Xinliang David Li wrote:

On Fri, Apr 10, 2015 at 3:43 PM, Jan Hubicka  wrote:

LBR is used for both cfg edge profiling and indirect call Target value
profiling.

I see, that makes sense ;)  I guess if we want to support profile collection
on targets w/o this feature we could still use one of the algorithms that
try to guess edge profile from BB profile.

Our experience with sampling cycles or retired instructions to guess
BB profile has not been great -- the profile quality is significantly
worse than LBR (which can almost match instrumentation based profile).
Suppose that I have no opportunity to collect profile on x86 
architecture with LBR support and the only available architecture is 
arm/aarch64 (since the application code is significantly different when 
compiled for different architectures because of manual optimizations and 
different function names and structure).


Honza has mentioned that it's possible to guess edge profile from BB 
profile. How do you think, can this help in the above described situation?
Yes, this will be much worse than LBR, but can it give any performance 
benefit compared with no edge profile at all?


--
Ilya


Re: AutoFDO profile toolchain is open-sourced

2015-04-27 Thread Ilya Palachev

Hi,

On 21.04.2015 20:25, Dehao Chen wrote:

OTOH, the most important patch (insn-level discriminator support) is
not in yet. Cary has just retired. Do you know if anyone would be
interested in porting insn-level discriminator support to trunk?


Do you mean r210338, r210397, r210523, r214745 ?
Can you explain why these patches are important for autofdo?
What work should be done to port them to current 5 branch?
Do you expect them to be applied to 6 branch?

--
Ilya


Re: AutoFDO profile toolchain is open-sourced

2015-04-21 Thread Ilya Palachev

On 21.04.2015 14:57, Diego Novillo wrote:

>From the autofdo page: https://github.com/google/autofdo

[ ... ]
Inputs:

--profile: PERF_PROFILE collected using linux perf (with last branch record).
In order to collect this profile, you will need to have an Intel CPU that
have last branch record (LBR) support. You also need to have your linux
kernel configured with LBR support. To profile:
# perf record -c PERIOD -e EVENT -b -o perf.data -- ./command
EVENT is refering to BR_INST_RETIRED:TAKEN if available. For some
architectures, BR_INST_EXEC:TAKEN also works.
[ ... ]

The important one for autofdo is -b. It asks perf to use LBR registers
for branch tracking (assuming your architecture supports it).


Thanks! It worked. Now big programs produce big gcov files. Sorry for 
this confusing message.


But why create_gcov does not inform about that (no branch events were 
found)? It creates empty gcov file and says nothing :(


Moreover, in the mentioned README it is said that perf should also be 
executed with option -e BR_INST_RETIRED:TAKEN.

I tried to add it but perf said that

   invalid or unsupported event: 'BR_INST_RETIRED:TAKEN'
   Run 'perf list' for a list of valid events

For my architecture x86_64 the perf list contains

   $ sudo perf list | grep -i br
  branch-instructions OR branches[Hardware event]
  branch-misses  [Hardware event]
  branch-loads   [Hardware
   cache event]
  branch-load-misses [Hardware
   cache event]
  branch-instructions OR cpu/branch-instructions/[Kernel PMU event]
  branch-misses OR cpu/branch-misses/[Kernel PMU event]
  mem:[:access] [Hardware breakpoint]
  syscalls:sys_enter_brk [Tracepoint event]
  syscalls:sys_exit_brk  [Tracepoint event]

There is no BR_INST_RETIRED:TAKEN there. Do you use some specific 
configuration of perf for that?


However, I tried to use option "-e branch-instructions". Before that the 
following error was obtained:


   E0421 15:57:39.308374 11551 perf_parser.cc:210] Mapped 50% of
   samples, expected at least 95%

and now it disappeared (because of option "-e branch-instructions").

Though, the performance decreases after adding option 
"-fauto-profile=file.gcov" or "-fprofile-use=file.gcov" to the list of 
compiler options.

The program becomes 10% slower than before.
Can you explain that? Maybe I should configure perf so that it will be 
able to collect events BR_INST_RETIRED:TAKEN ? How can it be done?


--
Best regards,
Ilya Palachev


Re: AutoFDO profile toolchain is open-sourced

2015-04-21 Thread Ilya Palachev

ping?

On 15.04.2015 10:41, Ilya Palachev wrote:

Hi,

One more question.

Does anybody know with which options should the perf be executed so that 
to collect appropriate data for the autofdo converter?
I obtain the same data for different programs, and it seems to be empty 
(1600 Bytes).

They have the same md5sum for different programs:

   # Data for simple program with 30 lines of code:
   $ md5sum ytest.gcov
   d85481c9154aa606ce4893b64fe109e7  ytest.gcov

   # Data for program of 3D Delaunay triangulation construction of
   100 points.
   $ md5sum experimentCGAL_convexHullDynamic.gcov
   d85481c9154aa606ce4893b64fe109e7 experimentCGAL_convexHullDynamic.gcov


We tried to collect perf data using option --call-graph fp but it does 
not help: the output gcov data is still the same.

Sometimes create_gcov reports the following error:
E0421 13:10:37.125629  8732 perf_parser.cc:209] Mapped 50% of samples, 
expected at least 95%


But it does not mean that there are not enough samples collected in the 
profile, because 99% of samples are mapped in the case of very simple 
program (with 1 function).

I try to find working case for more than a week but did not suceeded.

Can anybody show me that create_gcov works at least for one case?

--
Best regards,
Ilya Palachev




Re: AutoFDO profile toolchain is open-sourced

2015-04-15 Thread Ilya Palachev

Hi,

One more question.

On 10.04.2015 23:39, Jan Hubicka wrote:

I must say I did not even try running AutoFDO myself (so I am happy to hear
it works).


I tried to use executable create_gcov built from AutoFDO repository at 
github.
The problem is that the data generated by this program has size 1600 
bytes not depending on the profile data given to it.

Steps to reproduce the issue:

1. Build AutoFDO under x86_64

2. Build, for example, the benchmark ytest.c (see attachment):

   g++ -O2 -o ytest ytest.c -g2

(I used g++ that was built just now from gcc-5-branch branch from 
git://gcc.gnu.org/git/gcc.git)


3. Run it under perf to collect the profile data:

   sudo perf record ./ytest


The perf reports no error and says that

   [ perf record: Woken up 1 times to write data ]
   [ perf record: Captured and wrote 0.125 MB perf.data (~5442 samples) ]


Perf generates perf.data.

4. Run create_gcov on the obtained data:

   create_gcov --binary ytest --profile perf.data --gcov ytest.gcov
   --debug_dump

It creates 2 files:
* ytest.gcov which is 1600 bytes of size
* ytest.gcov.imports which is empty

Also there is no debug output from the program.
If I run create_llvm_prof on the data

   create_llvm_prof --binary ytest --profile perf.data --out ytest.out
   --debug_dump

It reports the following log:

   Length of symbol map: 1
   Number of functions:  0

and creates an empty file ytest.out.

Which is not true: all functions in the benchmark are marked with 
__attribute__((noinline)) and readelf says that they stay in the binary:


   readelf -s ytest | grep px_cycle
56: 00400640   111 FUNCGLOBAL DEFAULT   12 _Z8px_cyclei
   readelf -s ytest | grep py_cycle
60: 004006b036 FUNCGLOBAL DEFAULT   12 _Z8py_cyclev

The size of resulting gcov data is the same (1600 bytes) for different 
levels of debug information (-g0, -g1, -g2) and for different input 
sources files.


What am I doing wrong?

--
Best regards,
Ilya Palachev

#define DX (480*4)

#define DY (640*4)

int* src = new int[DX*DY];
int* dst = new int[DX*DY];
int pxm = DX;
int pym = DY;

void px_cycle(int py) __attribute__((noinline));
void px_cycle(int py) {
int *p1 = dst + (py*pxm);
int *p2 = src + (pym - py - 1);
for (int px = 0; px < pxm; px++) {
if (px < pym && py < pxm) {
*p1 = *p2;
}
p1++;
p2 += pym;
}
}

void py_cycle() __attribute__((noinline));
void py_cycle() {
for (int py = 0; py < pym; py++) {
px_cycle(py);
}
}

int main() {
int i;
for (i = 0; i < 100; i++) {
py_cycle();
}
return 0;
}


Re: AutoFDO profile toolchain is open-sourced

2015-04-07 Thread Ilya Palachev

Hi,

Here are some questions about AutoFDO.

On 08.05.2014 02:55, Dehao Chen wrote:

We have open-sourced AutoFDO profile toolchain in:

https://github.com/google/autofdo

For GCC developers, the most important tool is create_gcov, which
converts sampling based profile to GCC-readable profile. Please refer
to the readme file
(https://raw.githubusercontent.com/google/autofdo/master/README) for
more details.


In the mentioned README file it is said that " In order to collect this 
profile, you will need to have an Intel CPU that have last branch record 
(LBR) support." Is this information obsolete? Chrome Canary builds use 
AutoFDO for ARMv7l 
(https://code.google.com/p/chromium/issues/detail?id=434587)


What about Aarch64 support? Is it supported?


To use the profile, one need to checkout
https://gcc.gnu.org/svn/gcc/branches/google/gcc-4_8. We are working on
porting AutoFDO to trunk
(http://gcc.gnu.org/ml/gcc-patches/2014-05/msg00438.html).


For now AutoFDO was merged into gcc-5.0 (trunk) branch.
Is it possible to backport it to 4.9 branch? Can you estimate required 
efforts for that?




We have limited doc inside the open-sourced package, and we are
planning to add more content to the wiki page
(https://github.com/google/autofdo/wiki). Feel free to send me emails
or discuss on github if you have any questions.

Cheers,
Dehao


--
Best regards,
Ilya


[RFC] cortex-a{53,57}-simd.md missing?

2015-02-19 Thread Ilya Palachev

Hi all.

This is a question related with current development of Aarch64 backend.

In latest trunk revision of GCC 5.0, in directory gcc/config/arm there 
are following files:


cortex-a{8,9,15,17}.md
cortex-a{8,9,15,17}-neon.md

These files contain constructions like

(define_insn_reservation insn-name default_latency condition regexp)

for both scalar and vector (NEON) instructions.

But for AdvSIMD aarch64 instructions in cortex-a53.md only the following 
lines can be found:



;; Crude Advanced SIMD approximation.


(define_insn_reservation "cortex_53_advsimd" 4
  (and (eq_attr "tune" "cortexa53")
   (eq_attr:q "is_neon_type" "yes"))
  "cortex_a53_simd0")

Does it mean that all AdvSIMD instructions for cortex-a53 are supposed 
to be of latency = 4?


In cortex-a57.md the description for "neon" instructions is more full, 
it contains a lot of statements for different SIMD instructions.

It appeared in trunk just a month ago.

Are there any plans to release detailed pipeline descriptions for SIMD 
instructions for cortex-a53?

How can it influence the performance of the generated code?

--
Best regards,
Ilya Palachev



Performance for AArch64 in ILP32 ABI

2014-10-31 Thread Ilya Palachev

Hi,

According to this mail thread 
https://gcc.gnu.org/ml/gcc-patches/2013-12/msg00282.html GCC has ILP32 
GNU/Linux support.


1. The question is: how reasonable it can be to use ILP32 mode for 
building of the *whole* Linux distribution from the side of view of 
performance?


IIRC gcc built for i686 can work faster than gcc built for x86_64 
architecture on the same hardware, because there are a lot of data 
structures with fields of pointer type, and if 32 pointers are used, 
less memory is allocated for these structures. As a result, smaller 
structures are loaded from memory faster and less cache misses happen. 
Is this the same case for AArch64 ILP32 ABI?


2nd idea is that if integers are of 32 bit size, than 2 times more 
integers can be saved in CPU registers than if they were of 64 bit size, 
and thus less loads/stores to the memory are needed.


2. What's the current status of ILP32 support implementation in GCC?

3. Did anybody try to benchmark AArch64 binaries ILP32 vs. LP64 builds? 
Is it possible to compare the performance of these modes?


Best regards,
Ilya Palachev


Re: What are open tasks about GIMPLE loop optimizations?

2014-08-18 Thread Ilya Palachev

Dear Evgeniya,

Maybe missed optimizations in vectorizer will be interesting to you

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947

It has a lot of open tasks that can highly influence the performance, 
but many of them have not been solved for long years.
For now gcc vectorizer works in some number of patterns, but there are a 
lot of ones that are implemented in icc or llvm and not implemented in gcc.


Best regards.
Ilya



*From:* Evgeniya Maenkova 
*Sent:* Friday, August 15, 2014 4:45PM
*To:* gcc@gcc.gnu.org
*Subject:* What are open tasks about GIMPLE loop optimizations?

Dear GCC Developers,

Nobody answers my question below, so perhaps something wrong with my email :)

So let me clarify in more details what I’m asking about.

I’ve made some very very very basic evaluation of GCC code ([1]) and
started to think about concrete task to contribute to GCC (language
and machine optimization would be interesting to me, in particular,
loop optimization).

I cannot invent this task myself because my knowledge of GCC and
compilers in general is not enough for this.  And even if I could
think out something perhaps GCC developers have their own
understanding of the world.

Then I have looked at GCC site to answer my question. What I could
find about loop optimizations is information from GNU Tools Cauldron
2012, “Status of High level Loop Optimizations”.  So perhaps this is
out-of-date in 2014.

Unfortunately, I have not enough time, so I would not commit to manage
a task which is on the critical task.  (Are you interested only in
full time developers?)

So it would be great if you could advise some tasks, which could be

useful to gcc in some future, however nobody will miss if I cannot do

it (as you had not time/people for these tasks anyway :) ).



What do you think?



Thanks,



Evgeniya



[1] Used GDB to look inside GCC. Wrote some notes in my blog which
could be useful to other newbies
(http://perfstories.wordpress.com/2013/11/17/compiler-internals-introduction-to-a-new-post-series/).




-- Forwarded message --
From: Evgeniya Maenkova 
Date: Fri, Aug 8, 2014 at 6:50 PM
Subject: GIMPLE optimization passes: any Road Map?
To: gcc@gcc.gnu.org


Dear GCC Developers!

Could you please clarify about GIMPLE loop passes?

Where could I find the latest changes in these passes? Is it trunk or
some of the branches? May I look at some RoadMap on GIMPLE loop
optimizations?

Actually, I ask these questions because I would like to contribute to
GCC. GIMPLE optimizations would be interesting to me (in particular,
loop optimizations).

However, I’m newbie at GCC and have not enough time, so would not
commit to manage a task which is on the critical path.

So it would be great if you could advise some tasks, which could be
useful to gcc in some future, however nobody will miss if I can’t do
it (as you had not time/people for these tasks anyway :) ).

Thank you!

Evgeniya





Re: Comparison of GCC-4.9 and LLVM-3.4 performance on SPECInt2000 for x86-64 and ARM

2014-07-09 Thread Ilya Palachev

Dear all,

Do you have any results of GCC and LLVM performance comparisons of 
different versions (for *ARM* architecture)?
It's not obvious question to find such comparisons in Web, since 
Phoronix usually publishes comparisons for x86 and x86_64, and last 
comparison for ARM was performed in 2012:


LLVM/Clang vs. GCC On The ARM Cortex-A15 Preview (1 December 2012):
http://www.phoronix.com/scan.php?page=article&item=llvm_gcc_a15&num=1

GCC vs. LLVM/Clang Compilers On ARMv7 Linux (09 May 2012 ):
http://www.phoronix.com/scan.php?page=news_item&px=MTA5OTM

Did anybody ever try to measure the dynamics of performance changes of 
GCC and LLVM (i. e. 2 comparative graphs from version to version) - for 
arm architecture?


Best regards,

Ilya Palachev


Re: Comparison of GCC-4.9 and LLVM-3.4 performance on SPECInt2000 for x86-64 and ARM

2014-07-09 Thread Ilya Palachev