https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102435

            Bug ID: 102435
           Summary: gcc 9: aarch64 -ftree-loop-vectorize results in wrong
                    code
           Product: gcc
           Version: 9.4.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dimi...@unified-streaming.com
  Target Milestone: ---

We noticed a problem with a loop optimization enabled by -O3 on a program
targeting AArch64. It turns out that this problem is specifically caused by
-ftree-loop-vectorize, and has actually been fixed by (or as a side-effect of)
commit https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=c89366b12ff4f362
("[AArch64] Support vectorising with multiple vector sizes") by Richard
Sandiford.

However, this commit was made on master when it was gcc-10, so while the
problem does not occur with gcc 10.x and 11.x, it *does* occur with 9.x. In our
particular instance, this is the default version on Ubuntu 20.04 for arm64,
e.g. gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04).

Reduced test case:

// g++ -std=c++17 -O2 -ftree-loop-vectorize testcase.cpp
// or
// g++ -std=c++17 -O3 testcase.cpp

#include <cassert>
#include <cstdint>
#include <iostream>
#include <vector>

struct sample_t
{
  sample_t(uint64_t dts, uint32_t duration)
  : dts_(dts)
  , duration_(duration)
  , cto_(0)
  , sample_description_index_(0)
  , pos_(0)
  , size_(0)
  , flags_(0)
  , aux_pos_(0)
  , aux_size_(0)
  {
  }

  uint64_t dts_;
  uint32_t duration_;
  int32_t cto_;
  uint32_t sample_description_index_;
  uint64_t pos_;
  uint32_t size_;
  uint32_t flags_;
  uint64_t aux_pos_;
  uint32_t aux_size_;
};

typedef std::vector<sample_t> samples_t;

__attribute__((__noinline__))
samples_t get_result(samples_t&& samples)
{
  uint64_t base_media_decode_time = ~0;

  auto first = samples.begin();
  auto last = samples.end();
  if(first != last)
  {
    base_media_decode_time = first->dts_;

    uint32_t duration = 0;
    for(--last; first != last; ++first)
    {
      duration = static_cast<uint32_t>(first[1].dts_ - first->dts_);

      first->duration_ = duration;
    }

    first->duration_ = duration;
  }

  return samples;
}

int main(void)
{
  samples_t samples_in = { {0, 3}, {3, 3}, {6, 3}, {9, 1}, {10, 2} };
  samples_t samples_out = get_result(std::move(samples_in));

  for(sample_t sample : samples_out)
  {
    std::cout << sample.dts_ << ", " << sample.duration_ << '\n';
  }

  // Expected output:
  // 0, 3
  // 3, 3
  // 6, 3
  // 9, 1
  // 10, 1
  //
  // Bad output:
  // 0, 3
  // 3, 0
  // 6, 0
  // 9, 0
  // 10, 0

  return 0;
}

Not that it appears vital that the struct sample_t is pretty large, e.g.
removing all of the members after the first two makes the output correct, even
with gcc 9 and -ftree-loop-vectorize. I have not determined precisely what the
cutoff size is.

Reply via email to