[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-11-23 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #27 from Jerry DeLisle  ---
This should be fixed now fairly well.  At this point there are not a lot of
major issues in our own libgfortran library. I will leave this bug report open
for a while if any issues arise.

Here is my latest profile. On my machine I am actuall getting slightly better
performance on trunk (8) vs. gfortran 6

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self  
 time   seconds   seconds name
 15.11  4.30 4.30 printf_positional
 14.65  8.47 4.17 __printf_fp_l
  7.13 10.50 2.03 __parse_one_specmb
  6.08 12.23 1.73 build_float_string
  4.16 13.42 1.19 formatted_transfer_scalar_write
  2.39 14.10 0.68 next_format0
  2.28 14.75 0.65 _IO_no_init
  2.23 15.38 0.64 __mpn_mul_1
  2.21 16.01 0.63 strtol_l_internal
  2.18 16.63 0.62 vfprintf
  1.86 17.16 0.53 __strchrnul_sse2
  1.79 17.67 0.51 _gfortrani_next_format
  1.77 18.18 0.51 _IO_str_init_static_internal
  1.76 18.68 0.50 __mpn_extract_double
  1.72 19.17 0.49 vsnprintf
  1.72 19.66 0.49 _IO_default_xsputn
  1.69 20.14 0.48 _int_malloc
  1.62 20.60 0.46 __memmove_sse2_unaligned_erms
  1.28 20.96 0.37 get_float_string
  1.12 21.28 0.32 _IO_padn
  1.02 21.57 0.29 __memset_sse2_unaligned
  1.02 21.86 0.29 __wmemset_sse2_unaligned
  0.91 22.12 0.26 flt128_ais

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-11-23 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #26 from Jerry DeLisle  ---
Author: jvdelisle
Date: Thu Nov 23 17:19:18 2017
New Revision: 255108

URL: https://gcc.gnu.org/viewcvs?rev=255108=gcc=rev
Log:
2017-11-23  Jerry DeLisle  

Backport from trunk
PR libgfortran/78549
* io/io.h (newunit_free): Add declaration. Clean some whitespace.
* io/transfer.c (st_read_done, st_write_done): Call newunit_free.
* io/unit.c (newunit_free): Change type from static void to void.

Modified:
branches/gcc-7-branch/libgfortran/ChangeLog
branches/gcc-7-branch/libgfortran/io/io.h
branches/gcc-7-branch/libgfortran/io/transfer.c
branches/gcc-7-branch/libgfortran/io/unit.c

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-11-20 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #25 from Jerry DeLisle  ---
Author: jvdelisle
Date: Tue Nov 21 02:17:11 2017
New Revision: 254982

URL: https://gcc.gnu.org/viewcvs?rev=254982=gcc=rev
Log:
2017-11-20  Jerry DeLisle  

PR libgfortran/78549
* io/io.h (newunit_free): Add declaration. Clean some whitespace.
* io/transfer.c (st_read_done, st_write_done): Call newunit_free.
* io/unit.c (newunit_free): Change type from static void to void.

Modified:
trunk/libgfortran/ChangeLog
trunk/libgfortran/io/io.h
trunk/libgfortran/io/transfer.c
trunk/libgfortran/io/unit.c

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-11-18 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #24 from Jerry DeLisle  ---
After a minor test tweak.

$ gfc -static pr78549.f 
$ time ./a.out 

real0m24.049s
user0m22.941s
sys 0m0.962s

$ gfc6 -static pr78549.f 
$ time ./a.out 

real0m22.838s
user0m22.718s
sys 0m0.005s

Without the tweak on trunk:

$ gfc -static pr78549.f 
$ time ./a.out 

real0m28.916s
user0m27.697s
sys 0m1.024s

With the minor tweak we are still a bit slower than 6, but then we added a few
features since then also, so not too bad.  My initial tweak is going to require
a little logic to be added.

Next report will be after I have that sorted out.

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-11-16 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #23 from Jerry DeLisle  ---
Well, instrumenting a little bit I see that delete_root is getting called many
many many times. So, every call to newunit_alloc is assigning a new unit number
which is getting added to the treap and never reused in anyway.

The result of this is that at the end of the program when any units left on the
treap are deleted, we loop through over one million deletes.

So I now have a better idea of what we need to do.

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-11-10 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

Thomas Koenig  changed:

   What|Removed |Added

 CC||manfred99 at gmx dot ch

--- Comment #22 from Thomas Koenig  ---
*** Bug 82938 has been marked as a duplicate of this bug. ***

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-10-07 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #21 from Thomas Koenig  ---
Jerry,

I think you know more about this than I do, so please go ahead.

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-10-07 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #20 from Jerry DeLisle  ---
(In reply to Thomas Koenig from comment #19)
> Created attachment 42320 [details]
> patch which failes with dtio_14 and dtio_15, among others
> 
> Well, this one doesn't work yet.

Do you want to continue with it or would you like me to pick up from where you
are?

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-10-07 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #19 from Thomas Koenig  ---
Created attachment 42320
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42320=edit
patch which failes with dtio_14 and dtio_15, among others

Well, this one doesn't work yet.

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-10-07 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #18 from Dominique d'Humieres  ---
I do not see any timing difference between 7.2.1 (r253502) and trunk.

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-10-07 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #17 from Thomas Koenig  ---
I mean apart from the fact that this creates a huge memory leak :-)

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-10-07 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #16 from Thomas Koenig  ---
Jerry,

what do you think of this approach? This creates a local copy
of the gfc_unit, without putting it into the tree.

Index: transfer.c
===
--- transfer.c  (Revision 253377)
+++ transfer.c  (Arbeitskopie)
@@ -4170,6 +4170,9 @@ st_write_done (st_parameter_dt *dtp)
}
   unlock_unit (dtp->u.p.current_unit);
 }
+  if (is_internal_unit (dtp))
+free (dtp->u.p.current_unit);
+
   library_end ();
 }

Index: unit.c
===
--- unit.c  (Revision 253377)
+++ unit.c  (Arbeitskopie)
@@ -225,7 +225,7 @@ insert (gfc_unit *new, gfc_unit *t)
 /* insert_unit()-- Create a new node, insert it into the treap.  */

 static gfc_unit *
-insert_unit (int n)
+insert_unit (int n, int internal)
 {
   gfc_unit *u = xcalloc (1, sizeof (gfc_unit));
   u->unit_number = n;
@@ -238,8 +238,11 @@ static gfc_unit *
   __GTHREAD_MUTEX_INIT_FUNCTION (>lock);
 #endif
   __gthread_mutex_lock (>lock);
-  u->priority = pseudo_random ();
-  unit_root = insert (u, unit_root);
+  if (!internal)
+{
+  u->priority = pseudo_random ();
+  unit_root = insert (u, unit_root);
+}
   return u;
 }

@@ -316,14 +319,21 @@ delete_unit (gfc_unit *old)

 /* get_gfc_unit()-- Given an integer, return a pointer to the unit
structure.  Returns NULL if the unit does not exist,
-   otherwise returns a locked unit. */
+   otherwise returns a locked unit.  If internal, do not put it
+   in the unit cache.  */

 static gfc_unit *
-get_gfc_unit (int n, int do_create)
+get_gfc_unit (int n, int do_create, int internal)
 {
   gfc_unit *p;
   int c, created = 0;

+  if (internal)
+{
+  p = insert_unit (n, 1);
+  return p;
+}
+  
   __gthread_mutex_lock (_lock);
 retry:
   for (c = 0; c < CACHE_SIZE; c++)
@@ -347,7 +357,7 @@ retry:

   if (p == NULL && do_create)
 {
-  p = insert_unit (n);
+  p = insert_unit (n, 0);
   created = 1;
 }

@@ -405,7 +415,7 @@ found:
 gfc_unit *
 find_unit (int n)
 {
-  return get_gfc_unit (n, 0);
+  return get_gfc_unit (n, 0, 0);
 }


@@ -412,7 +422,7 @@ find_unit (int n)
 gfc_unit *
 find_or_create_unit (int n)
 {
-  return get_gfc_unit (n, 1);
+  return get_gfc_unit (n, 1, 0);
 }


@@ -548,7 +558,7 @@ get_unit (st_parameter_dt *dtp, int do_create)

   dtp->u.p.unit_is_internal = 1;
   dtp->common.unit = newunit_alloc ();
-  unit = get_gfc_unit (dtp->common.unit, do_create);
+  unit = get_gfc_unit (dtp->common.unit, do_create, 1);
   set_internal_unit (dtp, unit, kind);
   fbuf_init (unit, 128);
   return unit;
@@ -563,9 +573,9 @@ get_unit (st_parameter_dt *dtp, int do_create)
  is not allowed, such units must be created with
  OPEN(NEWUNIT=...).  */
   if (dtp->common.unit < 0)
-return get_gfc_unit (dtp->common.unit, 0);
+return get_gfc_unit (dtp->common.unit, 0, 0);

-  return get_gfc_unit (dtp->common.unit, do_create);
+  return get_gfc_unit (dtp->common.unit, do_create, 0);
 }


@@ -592,7 +602,7 @@ init_units (void)

   if (options.stdin_unit >= 0)
 {  /* STDIN */
-  u = insert_unit (options.stdin_unit);
+  u = insert_unit (options.stdin_unit, 0);
   u->s = input_stream ();

   u->flags.action = ACTION_READ;
@@ -624,7 +634,7 @@ init_units (void)

   if (options.stdout_unit >= 0)
 {  /* STDOUT */
-  u = insert_unit (options.stdout_unit);
+  u = insert_unit (options.stdout_unit, 0);
   u->s = output_stream ();

   u->flags.action = ACTION_WRITE;
@@ -655,7 +665,7 @@ init_units (void)

   if (options.stderr_unit >= 0)
 {  /* STDERR */
-  u = insert_unit (options.stderr_unit);
+  u = insert_unit (options.stderr_unit, 0);
   u->s = error_stream ();

   u->flags.action = ACTION_WRITE;

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-10-06 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #15 from Jerry DeLisle  ---
For my own baseline:

gcc6: real  0m6.948s

gcc7: real  0m9.906s

gcc8: real  0m10.415s

I backported removal of the caching mentioned in comment #14 to gcc7. The two
should be identical except that gcc7 is a released branch. IIRC released
branches have debugging code removed. I do not know if this accounts for the
difference between gcc7 and gcc8.

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-10-06 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

--- Comment #14 from Jerry DeLisle  ---
The real issue is that to support DTIO with internal units I had to actually
use a gfc_unit structure. Before DTIO we never did this.  At the time of doing
DTIO I did not have a 'better idea' since by the nature of it, DTIO passes a
UNIT to the child procedures. 

I do have something percolating in my head on this that will be better. I have
not finished thinking it through yet. I have to do some factoring on the
structures.

(In reply to Dominique d'Humieres from comment #9)
> Gfortran is even slower now (~9.5s), likely revision r251374 (pr78387).

This is because the gfc_unit caching scheme I tried before was incompatble with
large numbers of threads. I decided that whole approach to the issue was a
rabbit down a rabbit hole. I removed it, and we are slower.

Working on it.

[Bug libfortran/78549] [8 Regression] Very slow formatted internal file output

2017-10-06 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549

Thomas Koenig  changed:

   What|Removed |Added

 Status|WAITING |NEW
   Target Milestone|--- |8.0
Summary|Very slow formatted |[8 Regression] Very slow
   |internal file output|formatted internal file
   ||output

--- Comment #13 from Thomas Koenig  ---
Here is the timing for gcc 6:

Each sample counts as 0.01 seconds.
  %   cumulative   self  self total   
 time   seconds   secondscalls  ms/call  ms/call  name
 32.15  0.99 0.99 output_float
 12.34  1.37 0.38 formatted_transfer
 11.04  1.71 0.34 next_format0
  7.14  1.93 0.22 write_float
  3.73  2.05 0.12 _gfortrani_write_block
  3.25  2.15 0.10 next_char
  2.92  2.24 0.09 data_transfer_init
  2.60  2.32 0.08 _gfortrani_next_format
  2.60  2.40 0.08 get_fnode
  2.60  2.48 0.08 write_integer
  2.27  2.55 0.07 _gfortrani_parse_format
  2.27  2.62 0.07 finalize_transfer
  2.27  2.69 0.07 parse_format_list
  1.95  2.75 0.06 _gfortrani_next_record
  1.14  2.78 0.04 _gfortran_transfer_array
  0.97  2.81 0.03 _gfortrani_mem_alloc_r4

So, this does appear to be a regression for gcc8 at least.

The time is also longer (~20 s for trunk, ~14 s for gcc 6).