Re: [Hdf-forum] Slow Reading 721GB File in Parallel

Mohamad Chaarawi Wed, 30 May 2012 12:49:13 -0700

The selection of each process actually stays the same size since theregion_count is not changing.


Ok, let me understand this again:

Your dataset size is constant (no matter what process count you executewith), and processes are reading parts of the dataset.When you are executing your program with say 16 processes, is yourdataset being divided equally (to some extent) among the 16 procs? Whenyou increase your process count to 36, is the dataset being dividedequally among 36 processes, meaning that the amount of data that aprocess reads decreases as you scale, since the file size is the same?If not, then this means you are reading parts of the dataset multipletimes as you scale, which makes the performance degradation expected.This is like comparing the performance, in the serial case, of 1 readoperation to n read operations.

If yes, then move on to the second part..


the result of running "lfs getstripe filename | grep stripe" is:

    lmm_stripe_count:   4
    lmm_stripe_size:    1048576
    lmm_stripe_offset:  286

The stripe count is way too small for ~1 TB byte.. your systemadministrator should have some guidelines on what the stripe count andsize should be for certain file sizes. I would check that, and readjustthose parameters accordingly.


Thanks,
Mohamad


Let me confirm with the second question.

On Wed, May 30, 2012 at 11:01 AM, Mohamad Chaarawi [via hdf-forum]<[hidden email] </user/SendEmail.jtp?type=node&node=4023160&i=0>> wrote:


    Hi Yucong ,

    On 5/30/2012 12:33 PM, Yucong Ye wrote:


    The region_index changes according to the mpi rank while the
    region_count stays the same, which is 16,16,16.


    Ok, I just needed to make sure that the selections for each
    process are done such that it is compatible with scaling being
    done (as the number of processes increase, the selection of each
    process decreases accordingly).. The performance numbers you
    provided are indeed troubling, but it could be for several
    reasons, some being:

      * The stripe size & count of your file on Lustre could be too
        small. Although this is a read operation (no file locking is
        done by the OSTs), increasing the number of io processes puts
        too much burden on the OSTs. Could you check those 2
        parameters of your file? you can do that by running this on
        the command line:
          o lfs getstripe filename | grep stripe
      * The MPI-I/O implementation is not doing aggregation. If you
        are using ROMIO, two phase should do this for you which sets
        the default to the number of nodes (not processes). I would
        also try and increase the cb_buffer_size (default is 4MBs).

    Thanks,
    Mohamad

    On May 30, 2012 8:19 AM, "Mohamad Chaarawi" <[hidden email]
    <http://user/SendEmail.jtp?type=node&node=4023015&i=0>> wrote:

        Hi Chrisyeshi,

        Is the region_index & region_count the same on all processes?
        i.e. Are you just reading the same data on all processes?

        Mohamad

        On 5/29/2012 3:02 PM, chrisyeshi wrote:

            Hi,

            I am having trouble to read from a 721GB file using 4096
            nodes.
            When I test with a few nodes, it works, but when I test
            with more nodes, it
            takes significantly more time.
            What the test program does it only read in the data and
            deleting it.
            Here's the timing information:

            Nodes    |    Time For Running Entire Program
            16              4:28
            32              6:55
            64              8:56
            128            11:22
            256            13:25
            512            15:34

            768            28:34
            800            29:04

            I am running the program in a Cray XK6 system, and the
            file system is Lustre

            *There is a big gap after 512 nodes, and with 4096 nodes,
            it couldn't finish
            in 6 hours.
            Is this normal? Shouldn't it be a lot faster?*

            Here is my reading function, it's similar to the sample
            hdf5 parallel
            program:

            #include<hdf5.h>
            #include<stdio.h>
            #include<stdlib.h>
            #include<assert.h>

            void readData(const char* filename, int region_index[3], int
            region_count[3], float* flow_field[6])
            {
              char attributes[6][50];
              sprintf(attributes[0], "/uvel");
              sprintf(attributes[1], "/vvel");
              sprintf(attributes[2], "/wvel");
              sprintf(attributes[3], "/pressure");
              sprintf(attributes[4], "/temp");
              sprintf(attributes[5], "/OH");

              herr_t status;
              hid_t file_id;
              hid_t dset_id;
              hid_t dset_plist;
              // open file spaces
              hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
              status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD,
            MPI_INFO_NULL);
              file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
              status = H5Pclose(acc_tpl);
              for (int i = 0; i<  6; ++i)
              {
                // open dataset
                dset_id = H5Dopen(file_id, attributes[i], H5P_DEFAULT);

                // get dataset space
                hid_t spac_id = H5Dget_space(dset_id);
                hsize_t htotal_size3[3];
                status = H5Sget_simple_extent_dims(spac_id,
            htotal_size3, NULL);
                hsize_t region_size3[3] = {htotal_size3[0] /
            region_count[0],
                                           htotal_size3[1] /
            region_count[1],
                                           htotal_size3[2] /
            region_count[2]};

                // hyperslab
                hsize_t start[3] = {region_index[0] * region_size3[0],
                                    region_index[1] * region_size3[1],
                                    region_index[2] * region_size3[2]};
                hsize_t count[3] = {region_size3[0], region_size3[1],
            region_size3[2]};
                status = H5Sselect_hyperslab(spac_id, H5S_SELECT_SET,
            start, NULL,
            count, NULL);
                hid_t memspace = H5Screate_simple(3, count, NULL);

                // read
                hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
                status = H5Pset_dxpl_mpio(xfer_plist,
            H5FD_MPIO_COLLECTIVE);

                flow_field[i] = (float *) malloc(count[0] * count[1]
            * count[2] *
            sizeof(float));
                status = H5Dread(dset_id, H5T_NATIVE_FLOAT, memspace,
            spac_id,
            xfer_plist, flow_field[i]);

                // clean up
                H5Dclose(dset_id);
                H5Sclose(spac_id);
                H5Pclose(xfer_plist);
              }
              H5Fclose(file_id);
            }

            *Do you see any problem with this function? I am new to
            hdf5 parallel.*

            Thanks in advance!

            --
            View this message in context:
            
http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
            Sent from the hdf-forum mailing list archive at Nabble.com.

            _______________________________________________
            Hdf-forum is for HDF software users discussion.
            [hidden email]
            <http://user/SendEmail.jtp?type=node&node=4023015&i=1>
            http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org



        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        [hidden email]
        <http://user/SendEmail.jtp?type=node&node=4023015&i=2>
        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org



    _______________________________________________ Hdf-forum is for
    HDF software users discussion.
    [hidden email]  <http://user/SendEmail.jtp?type=node&node=4023015&i=3>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org



    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=4>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


    ------------------------------------------------------------------------
    If you reply to this email, your message will be added to the
    discussion below:
    
http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023015.html

    To unsubscribe from Slow Reading 721GB File in Parallel, click here.
    NAML
    
<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




------------------------------------------------------------------------

View this message in context: Re: Slow Reading 721GB File in Parallel<http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023160.html>Sent from the hdf-forum mailing list archive<http://hdf-forum.184993.n3.nabble.com/> at Nabble.com.



_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Slow Reading 721GB File in Parallel

Reply via email to