[deal.II] Scaling behavior of Matrix-Free test program

'Maxi Miller' via deal.II User Group Mon, 02 Mar 2020 09:18:12 -0800

I wrote a small test program for solving a non-linear equation using the 
RK4-solver implemented in deal.II, and assembling the right hand side using 
the matrix-free framework (code is attached). Afterwards I wanted to check 
the scaling behavior, after it should serve as a base for a larger program. 
Therefore I run several tests both on the development machine (i7-6700) 
with 8 threads and the high-performance machine (E5-2560 v4 
<https://ark.intel.com/content/www/us/en/ark/products/91767/intel-xeon-processor-e5-2650-v4-30m-cache-2-20-ghz.html>)
 
with 24 threads. Both machines were configured for using AVX-extensions in 
deal.II, and the program itself was compiled in release mode.


When running the program in both configurations, I compared the time it 
took for taking the first step in time:

Local machine:
MPI-Threads    TBB-Threads    Time (s)
1                     8                      170
2                     4                      40
4                     2                      20

HPC:
MPI-Threads    TBB-Threads    Time (s)
1                     24                    840
2                     12                    887
4                     6                      424
8                     3                      41
12                   2                      28
24                   1                      14

I do not fully understand that behavior: Why is the code so much slower on 
the E5 compared to the i7, except for 24 threads? Due to a different clock 
frequency, or newer structure (Broadwell vs Skylake)? Why is the transition 
from 1 MPI thread to 2 MPI threads on the i7 four times faster, but going 
from 2 MPI threads to 4 MPI threads only twice (which is expected)?
Similarly for the E5: Going from 1 thread to 2 threads does not speed up 
the code at all. Going from two to four threads halves the execution time 
(as expected), but going from four to eight results in a factor of ten. The 
steps afterwards follow the expected pattern again.

Are there any explanations for the observed behavior? And if not, what can 
I do for a deeper investigation?

Thanks!

-- 
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dealii+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dealii/01ac3d77-d9bf-4197-a9a9-1f220c0af696%40googlegroups.com.

/* ---------------------------------------------------------------------
 *
 * Copyright (C) 2009 - 2019 by the deal.II authors
 *
 * This file is part of the deal.II library.
 *
 * The deal.II library is free software; you can use it, redistribute
 * it, and/or modify it under the terms of the GNU Lesser General
 * Public License as published by the Free Software Foundation; either
 * version 2.1 of the License, or (at your option) any later version.
 * The full text of the license can be found in the file LICENSE.md at
 * the top level directory of deal.II.
 *
 * ---------------------------------------------------------------------
 *
 * Author: Wolfgang Bangerth, Texas A&M University, 2009, 2010
 *         Timo Heister, University of Goettingen, 2009, 2010
 */
#include <deal.II/base/quadrature_lib.h>
#include <deal.II/base/function.h>
#include <deal.II/base/timer.h>
#include <deal.II/lac/generic_linear_algebra.h>

#include <deal.II/base/time_stepping.h>

#include <deal.II/base/utilities.h>
#include <deal.II/base/conditional_ostream.h>
#include <deal.II/base/index_set.h>
#include <deal.II/base/convergence_table.h>

#include <deal.II/lac/vector.h>
#include <deal.II/lac/full_matrix.h>
#include <deal.II/lac/solver_cg.h>
#include <deal.II/lac/affine_constraints.h>
#include <deal.II/lac/dynamic_sparsity_pattern.h>
#include <deal.II/lac/sparsity_tools.h>
#include <deal.II/lac/linear_operator.h>
#include <deal.II/lac/parpack_solver.h>
#include <deal.II/lac/solver_gmres.h>
#include <deal.II/lac/sparse_direct.h>
#include <deal.II/lac/la_parallel_vector.h>

#include <deal.II/grid/grid_generator.h>
#include <deal.II/grid/tria_accessor.h>
#include <deal.II/grid/tria_iterator.h>

#include <deal.II/dofs/dof_handler.h>
#include <deal.II/dofs/dof_accessor.h>
#include <deal.II/dofs/dof_tools.h>

#include <deal.II/fe/fe_values.h>
#include <deal.II/fe/fe_q.h>
#include <deal.II/fe/fe_system.h>

#include <deal.II/numerics/vector_tools.h>
#include <deal.II/numerics/data_out.h>
#include <deal.II/numerics/error_estimator.h>

#include <deal.II/matrix_free/fe_evaluation.h>
#include <deal.II/matrix_free/matrix_free.h>
#include <deal.II/matrix_free/operators.h>

#include <deal.II/differentiation/ad.h>

#include <deal.II/distributed/tria.h>
#include <deal.II/distributed/grid_refinement.h>

#include <fstream>
#include <iostream>

constexpr double time_step = 0.01;

constexpr unsigned int fe_degree            = 3;
constexpr unsigned int n_q_points_1d        = fe_degree + 2;
using Number                                = double;
constexpr unsigned int n_components			= 2;
constexpr unsigned int refinement_degree	= 6;

constexpr bool use_scale_function = true;
constexpr bool use_threads = use_scale_function;

namespace Step40
{
	using namespace dealii;

	using vector_t = LinearAlgebra::distributed::Vector<Number>;

	void print_vector(const vector_t &input_vector, const std::string &filename){
		std::ofstream data_out(filename);
		for(size_t i = 0; i < input_vector.size(); ++i)
			data_out << input_vector(i) << '\n';
		data_out.close();
	}

	template <int dim, int degree, int n_points_1d>
	class LaplaceOperator
	{
	public:
		static constexpr unsigned int n_quadrature_points_1d = n_points_1d;

		LaplaceOperator() : computing_times(6), local_time(0.){

		};

		void reinit(const Mapping<dim> &   mapping,
					const DoFHandler<dim> &dof_handler,
					const AffineConstraints<double> &hanging_node_constraints);

		void print_computing_times() const;

		void apply(const double                                      current_time,
				   const vector_t &src,
				   vector_t &      dst) const;

		void project(const Function<dim> &                       function,
					 vector_t &solution) const;

		void
		initialize_vector(vector_t &vector) const;

	private:
		MatrixFree<dim, Number>     data;
		mutable std::vector<double> computing_times;
		mutable double local_time;

		LinearAlgebra::distributed::Vector<double> inv_mass_matrix;

		void local_apply_inverse_mass_matrix(
				const MatrixFree<dim, Number> &                   data,
				vector_t &      dst,
				const vector_t &src,
				const std::pair<unsigned int, unsigned int> &     cell_range) const;

		void local_apply_cell(
				const MatrixFree<dim, Number> &                   data,
				vector_t &      dst,
				const vector_t &src,
				const std::pair<unsigned int, unsigned int> &     cell_range) const;
	};

	template <int dim, int degree, int n_points_1d>
	void LaplaceOperator<dim, degree, n_points_1d>::reinit(
			const Mapping<dim> &mapping,
			const DoFHandler<dim> &dof_handler,
			const AffineConstraints<double> &hanging_node_constraints){
		std::vector<const DoFHandler<dim> *>           dof_handlers({&dof_handler});
		AffineConstraints<double>                      dummy(hanging_node_constraints);
		std::vector<const AffineConstraints<double> *> constraints({&hanging_node_constraints});
		std::vector<Quadrature<1>>                     quadratures(
		{QGauss<1>(n_q_points_1d),
		 QGauss<1>(fe_degree + 1)});
		typename MatrixFree<dim, Number>::AdditionalData additional_data;
		additional_data.mapping_update_flags =
				(update_gradients
				 | update_JxW_values
				 | update_quadrature_points
				 | update_values);

		if(use_threads)
			additional_data.tasks_parallel_scheme =
					MatrixFree<dim, Number>::AdditionalData::partition_partition;
		else
			additional_data.tasks_parallel_scheme =
					MatrixFree<dim, Number>::AdditionalData::none;

		data.reinit(mapping,
					dof_handlers,
					constraints,
					quadratures,
					additional_data);

		data.initialize_dof_vector(inv_mass_matrix);
		FEEvaluation<dim, degree, n_points_1d, n_components, Number> fe_eval(data);
		const unsigned int           n_q_points = fe_eval.n_q_points;
		for (unsigned int cell = 0; cell < data.n_cell_batches(); ++cell)
		{
			fe_eval.reinit(cell);
			for (unsigned int q = 0; q < n_q_points; ++q)
				fe_eval.submit_value(Tensor<1, n_components, VectorizedArray<Number>>({make_vectorized_array(1.),
																					   make_vectorized_array(1.)}), q);
			fe_eval.integrate(true, false);
			fe_eval.distribute_local_to_global(inv_mass_matrix);
		}
		inv_mass_matrix.compress(VectorOperation::add);
		for (unsigned int k = 0; k < inv_mass_matrix.local_size(); ++k)
			if (inv_mass_matrix.local_element(k) > 1e-15)
				inv_mass_matrix.local_element(k) =
						1. / inv_mass_matrix.local_element(k);
			else
				inv_mass_matrix.local_element(k) = 1;

		local_time = 0.;
	}

	template <int dim, int degree, int n_points_1d>
	void LaplaceOperator<dim, degree, n_points_1d>::initialize_vector(
			vector_t &vector) const
	{
		data.initialize_dof_vector(vector);
	}

	template <int dim, int degree, int n_points_1d>
	void LaplaceOperator<dim, degree, n_points_1d>::print_computing_times() const
	{
		ConditionalOStream        pcout(std::cout,
										Utilities::MPI::this_mpi_process(MPI_COMM_WORLD) ==
										0);
		Utilities::MPI::MinMaxAvg data;
		if (computing_times[2] + computing_times[3] > 0)
		{
			std::ios_base::fmtflags flags(std::cout.flags());
			std::cout << std::setprecision(3);
			pcout << "   Apply called "
				  << static_cast<std::size_t>(computing_times[3]) << " times"
																  << std::endl;
			pcout << "   RK update called "
				  << static_cast<std::size_t>(computing_times[2]) << " times"
																  << std::endl;
			pcout << "   Evaluate L_h:            ";
			data = Utilities::MPI::min_max_avg(computing_times[0], MPI_COMM_WORLD);
			pcout << std::scientific << std::setw(10) << data.min << " (p"
				  << std::setw(4) << data.min_index << ") " << std::setw(10)
				  << data.avg << std::setw(10) << data.max << " (p" << std::setw(4)
				  << data.max_index << ")" << std::endl;
			pcout << "   Inverse mass (+RK upd):  ";
			data = Utilities::MPI::min_max_avg(computing_times[1], MPI_COMM_WORLD);
			pcout << std::scientific << std::setw(10) << data.min << " (p"
				  << std::setw(4) << data.min_index << ") " << std::setw(10)
				  << data.avg << std::setw(10) << data.max << " (p" << std::setw(4)
				  << data.max_index << ")" << std::endl;
			pcout << "   Compute transport speed: ";
			data = Utilities::MPI::min_max_avg(computing_times[4], MPI_COMM_WORLD);
			pcout << std::scientific << std::setw(10) << data.min << " (p"
				  << std::setw(4) << data.min_index << ") " << std::setw(10)
				  << data.avg << std::setw(10) << data.max << " (p" << std::setw(4)
				  << data.max_index << ")" << std::endl;
			pcout << "   Compute errors/norms:    ";
			data = Utilities::MPI::min_max_avg(computing_times[5], MPI_COMM_WORLD);
			pcout << std::scientific << std::setw(10) << data.min << " (p"
				  << std::setw(4) << data.min_index << ") " << std::setw(10)
				  << data.avg << std::setw(10) << data.max << " (p" << std::setw(4)
				  << data.max_index << ")" << std::endl;
			std::cout.flags(flags);
		}
	}

	template <int dim, int degree, int n_points_1d>
	void LaplaceOperator<dim, degree, n_points_1d>::local_apply_cell(
			const MatrixFree<dim, Number> &                   data,
			vector_t &      dst,
			const vector_t &src,
			const std::pair<unsigned int, unsigned int> &     cell_range) const
	{
		FEEvaluation<dim, degree, n_points_1d, n_components, Number> phi(data);

		for (unsigned int cell = cell_range.first; cell < cell_range.second; ++cell)
		{
			phi.reinit(cell);
			phi.read_dof_values_plain(src);
			phi.evaluate(true, true);
			for (unsigned int q = 0; q < phi.n_q_points; ++q)
			{
				Tensor<1, n_components, VectorizedArray<Number>> phi_value = phi.get_value(q);
				Tensor<2, n_components, VectorizedArray<Number>> phi_gradient = phi.get_gradient(q);

				VectorizedArray<Number> local_value =
						make_vectorized_array<Number>(1.);
				for(unsigned int i = 0; i < VectorizedArray<Number>::n_array_elements; ++i){
					Point<dim> p;
					for(unsigned int d = 0; d < dim; ++d){
						p(d) = phi.quadrature_point(q)(d)[i];
					}
					local_value[i] = (-2 * M_PI * M_PI * exp(-2 * M_PI * M_PI * local_time) * sin(M_PI * p(0)) * sin(M_PI * p(1))
									  - exp(-4 * M_PI * M_PI * local_time) * (0.5 * M_PI * M_PI * (-cos(2 * M_PI * (p(0) - p(1)))
																								   - cos(2 * M_PI * (p(0) + p(1)))
																								   + cos(2 * M_PI * p(0))
																								   + cos(2 * M_PI * p(1))
																								   )
																			  )
									  );
				}
				Tensor<1, n_components, VectorizedArray<Number>> local_value_tensor({local_value,
																					 local_value});
				phi.submit_value(local_value_tensor, q);

				Tensor<2, n_components, VectorizedArray<Number>> gradient_value;
				for(size_t c = 0; c < 2; ++c)
					for(size_t n = 0; n < n_components; ++n)
						//for(size_t i = 0; i < VectorizedArray<Number>::n_array_elements; ++i)
							gradient_value[c][n]/*[i]*/ = -1. * phi_value[n]/*[i]*/ * phi_gradient[c][n]/*[i]*/;
				phi.submit_gradient(gradient_value, q);
			}
			phi.integrate_scatter(true, true, dst);
		}
	}

	template <int dim, int degree, int n_points_1d>
	void LaplaceOperator<dim, degree, n_points_1d>::local_apply_inverse_mass_matrix(
			const MatrixFree<dim, Number> &                   data,
			vector_t &      dst,
			const vector_t &src,
			const std::pair<unsigned int, unsigned int> &     cell_range) const
	{
		(void) data;
		(void) cell_range;
		dst = src;
		dst.scale(inv_mass_matrix);
	}

	template <int dim, int degree, int n_points_1d>
	void LaplaceOperator<dim, degree, n_points_1d>::apply(
			const double                                      current_time,
			const vector_t &src,
			vector_t &      dst) const
	{
		Timer timer;
		local_time = current_time;
		vector_t tmp(src);

		data.initialize_dof_vector(tmp);
		data.initialize_dof_vector(dst);

		if(use_scale_function){
			data.cell_loop(&LaplaceOperator::local_apply_cell,
						   this,
						   dst,
						   src,
						   true);
			dst.scale(inv_mass_matrix);
		}
		else
			data.cell_loop(&LaplaceOperator::local_apply_cell,
						   this,
						   dst,
						   src,
						   [&](const unsigned int start_range, const unsigned int end_range){
				DEAL_II_OPENMP_SIMD_PRAGMA
						for(size_t i = start_range; i < end_range; ++i){
					dst.local_element(i) = 0.;
				}
			},
			[&](const unsigned int start_range, const unsigned int end_range){
				DEAL_II_OPENMP_SIMD_PRAGMA
						for(unsigned int i = start_range; i < end_range; ++i){
					dst.local_element(i) = dst.local_element(i) * inv_mass_matrix.local_element(i);
				}
			});
		computing_times[0] += timer.wall_time();
		computing_times[3] += 1.;
	}

	template <int dim>
	class InitialCondition : public Function<dim>
	{
	public:
		InitialCondition() : Function<dim>(n_components){

		}
		virtual double value(const Point<dim> &p, const unsigned int component) const override;
		virtual Tensor<1, dim> gradient(const Point<dim> &p, const unsigned int component) const override;
	};

	template <int dim>
	double InitialCondition<dim>::value(const Point<dim> &p, const unsigned int) const{
		const double x = p[0];
		const double y = p[1];

		return sin(M_PI * x) * sin(M_PI * y);
	}

	template <int dim>
	Tensor<1, dim> InitialCondition<dim>::gradient(const Point<dim> &p, const unsigned int) const{
		const double x = p[0];
		const double y = p[1];

		Tensor<1, dim> return_value;
		return_value[0] = M_PI * cos(M_PI * x) * sin(M_PI * y);
		return_value[1] = M_PI * sin(M_PI * x) * cos(M_PI * y);

		return return_value;
	}

	template <int dim>
	class Solution : public Function<dim>
	{
	public:
		Solution(const double time) : Function<dim>(n_components), time(time) {

		}
		virtual double value(const Point<dim> &p, const unsigned int component) const override;
		virtual Tensor<1, dim> gradient(const Point<dim> &p, const unsigned int component) const override;
	private:
		const double time;
	};

	template <int dim>
	double Solution<dim>::value(const Point<dim> &p, const unsigned int) const{
		const double x = p[0];
		const double y = p[1];

		return exp(-2 * M_PI * M_PI * time) * sin(M_PI * x) * sin(M_PI * y);
	}

	template <int dim>
	Tensor<1, dim> Solution<dim>::gradient(const Point<dim> &p, const unsigned int) const{
		const double x = p[0];
		const double y = p[1];

		Tensor<1, dim> return_value;
		return_value[0] = exp(-2 * M_PI * M_PI * time) * M_PI * cos(M_PI * x) * sin(M_PI * y);
		return_value[1] = exp(-2 * M_PI * M_PI * time) * M_PI * sin(M_PI * x) * cos(M_PI * y);

		return return_value;
	}

	template <int dim>
	class LaplaceProblem
	{
	public:
		LaplaceProblem();
		void run_with_MF();
	private:
		void setup_system();

		void output_results(const unsigned int cycle, const double time);
		void output_results(const vector_t & local_solution, const unsigned int cycle) const;

		vector_t evaluate_diffusion(const double          time,
									const vector_t &y) const;

		unsigned int
		embedded_explicit_method(const TimeStepping::runge_kutta_method method,
								 const unsigned int n_time_steps,
								 const double       initial_time,
								 const double       final_time);

		void process_solution(const unsigned int cycle);

		MPI_Comm mpi_communicator;
		parallel::distributed::Triangulation<dim> triangulation;
		FESystem<dim>       fe;
		MappingQGeneric<dim> mapping;
		DoFHandler<dim> dof_handler;
		AffineConstraints<double> constraints;
		LaplaceOperator<dim, fe_degree, n_q_points_1d> laplace_operator;
		vector_t locally_relevant_solution_MF;
		SparseMatrix<double> mass_matrix;
		SparseMatrix<double> system_matrix;
		SparsityPattern sparsity_pattern;

		ConditionalOStream pcout;
		TimerOutput        computing_timer, global_timer;

		ConvergenceTable convergence_table;
	};

	template <int dim>
	LaplaceProblem<dim>::LaplaceProblem()
		: mpi_communicator(MPI_COMM_WORLD)
		, triangulation(mpi_communicator,
						typename Triangulation<dim>::MeshSmoothing(
							Triangulation<dim>::smoothing_on_refinement |
							Triangulation<dim>::smoothing_on_coarsening))
		, fe(FE_Q<dim>(fe_degree), n_components)
		, mapping(fe.degree + 1)
		, dof_handler(triangulation)
		, pcout(std::cout,
				(Utilities::MPI::this_mpi_process(mpi_communicator) == 0))
		, computing_timer(mpi_communicator,
						  pcout,
						  TimerOutput::summary,
						  TimerOutput::wall_times)
		, global_timer (mpi_communicator,
						pcout,
						TimerOutput::summary,
						TimerOutput::wall_times)
	{}

	template <int dim>
	void LaplaceProblem<dim>::setup_system()
	{
		TimerOutput::Scope t(computing_timer, "setup");
		dof_handler.distribute_dofs(fe);

		IndexSet locally_relevant_dofs;
		DoFTools::extract_locally_relevant_dofs(dof_handler, locally_relevant_dofs);

		constraints.clear();
		constraints.reinit(locally_relevant_dofs);
		DoFTools::make_hanging_node_constraints(dof_handler, constraints);
		FEValuesExtractors::Scalar first_element(0),
				second_element(1);

		VectorTools::interpolate_boundary_values(dof_handler,
												 0,
												 Functions::ZeroFunction<dim>(n_components),
												 constraints,
												 fe.component_mask(first_element));

		VectorTools::interpolate_boundary_values(dof_handler,
												 0,
												 Functions::ZeroFunction<dim>(n_components),
												 constraints,
												 fe.component_mask(second_element));
		constraints.close();

		laplace_operator.reinit(mapping, dof_handler, constraints);
		laplace_operator.initialize_vector(locally_relevant_solution_MF);
	}

	template <int dim>
	vector_t LaplaceProblem<dim>::evaluate_diffusion(const double time,
													 const vector_t &y) const{
		vector_t tmp(y);
		laplace_operator.initialize_vector(tmp);
		laplace_operator.apply(time, y, tmp);
		return tmp;
	}

	template <int dim>
	unsigned int LaplaceProblem<dim>::embedded_explicit_method(const TimeStepping::runge_kutta_method method,
															   const unsigned int n_time_steps,
															   const double       initial_time,
															   const double       final_time){

		TimerOutput::Scope t(computing_timer, "Embedded method");
		double time_step =
				(final_time - initial_time) / static_cast<double>(n_time_steps);
		double       time          = initial_time;
		const double coarsen_param = 1.2;
		const double refine_param  = 0.8;
		const double min_delta     = 1e-8;
		const double max_delta     = 10 * time_step;
		const double refine_tol    = 1e-1;
		const double coarsen_tol   = 1e-5;

		TimeStepping::EmbeddedExplicitRungeKutta<vector_t>
				embedded_explicit_runge_kutta(method,
											  coarsen_param,
											  refine_param,
											  min_delta,
											  max_delta,
											  refine_tol,
											  coarsen_tol);

		unsigned int n_steps = 0;
		//		static int counter = 0;
		while (time < final_time)
		{
			if(n_steps == 0)
				output_results (locally_relevant_solution_MF, n_steps);
			if (time + time_step > final_time)
				time_step = final_time - time;
			time = embedded_explicit_runge_kutta.evolve_one_time_step(
						[this](const double time, const vector_t &y) {
				return this->evaluate_diffusion(time, y);
			},
			time,
			time_step,
			locally_relevant_solution_MF);
			constraints.distribute(locally_relevant_solution_MF);

			time_step = embedded_explicit_runge_kutta.get_status().delta_t_guess;
			++n_steps;

			//output_results (locally_relevant_solution_MF, n_steps);
		}
		//throw;
		//		counter++;
		return n_steps;
	}

	template <int dim>
	void LaplaceProblem<dim>::output_results(const unsigned int cycle, const double time)
	{
		TimerOutput::Scope t(computing_timer, "Output results");

		constraints.distribute(locally_relevant_solution_MF);

		DataOut<dim> data_out;

		vector_t exact_solution, local_solution;

		laplace_operator.initialize_vector (exact_solution);
		laplace_operator.initialize_vector (local_solution);

		VectorTools::interpolate(dof_handler,
								 Solution<dim>(time),
								 exact_solution);

		data_out.attach_dof_handler(dof_handler);

		locally_relevant_solution_MF.update_ghost_values();

		local_solution = locally_relevant_solution_MF;
		local_solution.update_ghost_values();
		exact_solution.update_ghost_values();

		std::vector<DataComponentInterpretation::DataComponentInterpretation> data_component_interpretation{DataComponentInterpretation::component_is_scalar, DataComponentInterpretation::component_is_scalar};

		data_out.add_data_vector(local_solution,
								 std::vector<std::string>{"first_MF_solution", "second_MF_solution"},
								 DataOut<dim>::type_dof_data,
								 data_component_interpretation);
		data_out.add_data_vector(exact_solution,
								 std::vector<std::string>{"first_exact_solution", "second_exact_solution"},
								 DataOut<dim>::type_dof_data,
								 data_component_interpretation);


		Vector<float> subdomain(triangulation.n_active_cells());
		for (unsigned int i = 0; i < subdomain.size(); ++i)
			subdomain(i) = triangulation.locally_owned_subdomain();
		data_out.add_data_vector(subdomain, "subdomain");

		data_out.build_patches();

		data_out.write_vtu_with_pvtu_record("",
											"solution",
											cycle,
											mpi_communicator);

		if(cycle > 0){
			convergence_table.set_precision("L2", 3);
			convergence_table.set_precision("H1", 3);
			convergence_table.set_precision("Linfty", 3);
			convergence_table.set_scientific("L2", true);
			convergence_table.set_scientific("H1", true);
			convergence_table.set_scientific("Linfty", true);

			convergence_table.add_column_to_supercolumn("cycle", "n cells");
			convergence_table.add_column_to_supercolumn("cells", "n cells");
			std::vector<std::string> new_order;
			new_order.emplace_back("n cells");
			new_order.emplace_back("H1");
			new_order.emplace_back("L2");
			convergence_table.set_column_order(new_order);
			convergence_table.evaluate_convergence_rates(
						"L2", ConvergenceTable::reduction_rate);
			convergence_table.evaluate_convergence_rates(
						"L2", ConvergenceTable::reduction_rate_log2);
			convergence_table.evaluate_convergence_rates(
						"H1", ConvergenceTable::reduction_rate);
			convergence_table.evaluate_convergence_rates(
						"H1", ConvergenceTable::reduction_rate_log2);
			pcout << std::endl;
			if(Utilities::MPI::this_mpi_process(MPI_COMM_WORLD) == 0)
				convergence_table.write_text(std::cout);
			convergence_table.clear();
		}
	}

	template <int dim>
	void LaplaceProblem<dim>::output_results(const vector_t & local_solution, const unsigned int cycle) const
	{
		//TimerOutput::Scope t(computing_timer, "Output results");
		DataOut<dim> data_out;

		data_out.attach_dof_handler(dof_handler);
		//constraints.distribute(local_solution);
		//local_solution.update_ghost_values();
		std::vector<DataComponentInterpretation::DataComponentInterpretation> data_component_interpretation{DataComponentInterpretation::component_is_scalar, DataComponentInterpretation::component_is_scalar};

		data_out.add_data_vector(local_solution,
								 std::vector<std::string>{"first_MF_solution", "second_MF_solution"},
								 DataOut<dim>::type_dof_data,
								 data_component_interpretation);

		Vector<float> subdomain(triangulation.n_active_cells());
		for (unsigned int i = 0; i < subdomain.size(); ++i)
			subdomain(i) = triangulation.locally_owned_subdomain();
		data_out.add_data_vector(subdomain, "subdomain");
		data_out.build_patches();

		data_out.write_vtu_with_pvtu_record("",
											"local_solution",
											cycle,
											mpi_communicator);
	}

	template <int dim>
	void LaplaceProblem<dim>::process_solution(const unsigned int cycle){
		Vector<float> difference_per_cell(triangulation.n_active_cells());

		vector_t local_solution;
		laplace_operator.initialize_vector (local_solution);
		constraints.distribute(locally_relevant_solution_MF);
		locally_relevant_solution_MF.update_ghost_values();

		local_solution = locally_relevant_solution_MF;

		VectorTools::integrate_difference(dof_handler,
										  local_solution,
										  Solution<dim>((cycle + 1) * time_step),
										  difference_per_cell,
										  QGauss<dim>(fe.degree + 1),
										  VectorTools::L2_norm);
		const double L2_error =
				VectorTools::compute_global_error(triangulation,
												  difference_per_cell,
												  VectorTools::L2_norm);

		VectorTools::integrate_difference(dof_handler,
										  local_solution,
										  Solution<dim>((cycle + 1) * time_step),
										  difference_per_cell,
										  QGauss<dim>(fe.degree + 1),
										  VectorTools::H1_seminorm);
		const double H1_error =
				VectorTools::compute_global_error(triangulation,
												  difference_per_cell,
												  VectorTools::H1_seminorm);

		const QTrapez<1>     q_trapez;
		const QIterated<dim> q_iterated(q_trapez, fe.degree * 2 + 1);
		VectorTools::integrate_difference(dof_handler,
										  local_solution,
										  Solution<dim>((cycle + 1) * time_step),
										  difference_per_cell,
										  q_iterated,
										  VectorTools::Linfty_norm);
		const double Linfty_error =
				VectorTools::compute_global_error(triangulation,
												  difference_per_cell,
												  VectorTools::Linfty_norm);

		const unsigned int n_active_cells = triangulation.n_active_cells();
		const unsigned int n_dofs         = dof_handler.n_dofs();
		//		pcout << "Cycle " << cycle << ':' << std::endl
		//			  << "   Number of active cells:       " << n_active_cells
		//			  << std::endl
		//			  << "   Number of degrees of freedom: " << n_dofs << std::endl;
		convergence_table.add_value("cycle", cycle);
		convergence_table.add_value("cells", n_active_cells);
		convergence_table.add_value("dofs", n_dofs);
		convergence_table.add_value("L2", L2_error);
		convergence_table.add_value("H1", H1_error);
		convergence_table.add_value("Linfty", Linfty_error);
	}

	template <int dim>
	void LaplaceProblem<dim>::run_with_MF(){
		//TimerOutput::Scope t(global_timer, "Run-function");
		{
			pcout << "Number of MPI ranks:            "
				  << Utilities::MPI::n_mpi_processes(MPI_COMM_WORLD) << std::endl;
			pcout << "Number of threads on each rank: "
				  << MultithreadInfo::n_threads() << std::endl;
			const unsigned int n_vect_doubles =
					VectorizedArray<double>::n_array_elements;
			const unsigned int n_vect_bits = 8 * sizeof(double) * n_vect_doubles;
			pcout << "Vectorization over " << n_vect_doubles
				  << " doubles = " << n_vect_bits << " bits ("
				  << Utilities::System::get_current_vectorization_level()
				  << "), VECTORIZATION_LEVEL=" << DEAL_II_COMPILER_VECTORIZATION_LEVEL
				  << std::endl
				  << std::endl;
		}

		const unsigned int n_cycles = 8;

		double local_time = 0., local_end_time = 0.;

		GridGenerator::hyper_cube(triangulation);
		triangulation.refine_global(refinement_degree);

		setup_system();

		for (unsigned int cycle = 0; cycle < n_cycles; ++cycle)
		{
			local_time = cycle * time_step;
			local_end_time = (cycle + 1) * time_step;
			pcout << "Cycle " << cycle << ':' << std::endl;

			//triangulation.refine_global(1);


			if(cycle == 0){

				vector_t local_solution;
				laplace_operator.initialize_vector(local_solution);

				VectorTools::project(dof_handler,
									 constraints,
									 QGauss<dim>(fe.degree + 1),
									 InitialCondition<dim>(),
									 local_solution);

				locally_relevant_solution_MF = local_solution;
			}

			if (Utilities::MPI::n_mpi_processes(mpi_communicator) <= 32 && cycle == 0)
			{
				output_results(cycle, local_time);
			}

			pcout << "   Number of active cells:       "
				  << triangulation.n_global_active_cells() << std::endl
				  << "   Number of degrees of freedom: " << dof_handler.n_dofs()
				  << std::endl;

			Timer        timer;
			unsigned int timestep_number = 1;

			process_solution (0.);

			timestep_number = embedded_explicit_method(TimeStepping::DOPRI,
													   10,
													   local_time,
													   local_end_time);

			pcout << "Number of time steps: " << timestep_number << '\n';
			timestep_number = 1;
			process_solution(1.);
			if (Utilities::MPI::n_mpi_processes(mpi_communicator) <= 32)
			{
				TimerOutput::Scope t(computing_timer, "output");
				output_results(cycle + 1, local_end_time);
			}
			computing_timer.print_summary();
			computing_timer.reset();
			pcout << std::endl;
		}
		//		global_timer.print_summary();
		//		global_timer.reset();
		local_time = 0.;
	}
} // namespace Step40
int main(int argc, char *argv[])
{
	try
	{
		using namespace dealii;
		using namespace Step40;
		Utilities::MPI::MPI_InitFinalize mpi_initialization(argc, argv, dealii::numbers::invalid_unsigned_int);
		LaplaceProblem<2> laplace_problem_2d;
		//laplace_problem_2d.run();
		laplace_problem_2d.run_with_MF();
	}
	catch (std::exception &exc)
	{
		std::cerr << std::endl
				  << std::endl
				  << "----------------------------------------------------"
				  << std::endl;
		std::cerr << "Exception on processing: " << std::endl
				  << exc.what() << std::endl
				  << "Aborting!" << std::endl
				  << "----------------------------------------------------"
				  << std::endl;
		return 1;
	}
	catch (...)
	{
		std::cerr << std::endl
				  << std::endl
				  << "----------------------------------------------------"
				  << std::endl;
		std::cerr << "Unknown exception!" << std::endl
				  << "Aborting!" << std::endl
				  << "----------------------------------------------------"
				  << std::endl;
		return 1;
	}
	return 0;
}

[deal.II] Scaling behavior of Matrix-Free test program

Reply via email to