It might be worth trying with --fair-sched=yes, just in case what you see
is due to the unfairness of thread scheduling.

Philippe

On Fri, 2018-01-26 at 06:57 +0000, Wuweijia wrote:
> Hi:
> 
> How large is 'm_nStep'?  [Are you sure?]
> 
> The source  as below, all are the integer. Do you care what value ?.
> class CDynamicScheduling
> {
> public:
>         static const int m_nDefaultStepUnit;
>         static const int m_nDefaultStepFactor;
> 
> private:
>         int m_nBegin;
>         int m_nEnd;
>         int m_nStep;
> #if defined(_MSC_VER)
>         std::atomic<int> m_nCurrent;
> #else
>         int m_nCurrent;
> #endif
> 
> 
> I hope the actual source contains a comment such as:
>      Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of 
> pixels in pSrc[].
> 
>     Yes, you are right. It just compute the average of 2 * 2 blocks
> 
> I show you just the aarch64 neon code:
> This is same function, but implement is x86.
> 
>       UINT16 *pDstL;
>         UINT16 *pSrcL;
>         INT32 dstWDiv2 = srcW >> 1;
> //      INT32 dstHDiv2 = srcH >> 1;
>         INT32 x, y;
>         INT32 posDst,posSrc;
> 
>         pSrcL = pSrc;
>         pDstL = pDst;
> 
>         int beginY, endY;
>         while (pDS->GetProcLoop(beginY, endY))
>         {
> //              for (y = 0; y < dstHDiv2; y++)
>                 for (y = beginY; y < endY; y++)
>                 {
>                         for (x = 0; x < dstWDiv2; x++)
>                         {
>                                 posDst = y*dstStride + x;
>                                 posSrc = (y<<1)*srcStride + (x<<1);
>                                 pDstL[posDst] = (pSrcL[posSrc] + pSrcL[posSrc 
> + 1] + pSrcL[posSrc+srcStride] + pSrcL[posSrc+srcStride + 1] + 2) >> 2;
>                         }
>                 }
>         }
>      
>        pSrc is image  buffer,  about 11m.  Width:3968  Height: 2976  
> srcStride: 3968
>       It meant  four thread compute the average of 2 * 2 blocks
>       pSrc is divided into many small pieces , and compute the average of 
> every piceces, not by designed,  by status of the running threads, maybe some 
> threads  hold the cpu ,so they compute more pieces, Maybe some thread not 
> hold the cpu, compute less pieces ;
>      
>        
> BR
> Owen
> 
> -----邮件原件-----
> 发件人: John Reiser [mailto:jrei...@bitwagon.com] 
> 发送时间: 2018年1月26日 12:44
> 收件人: valgrind-users@lists.sourceforge.net
> 主题: Re: [Valgrind-users] 答复: 答复: 答复: [Help] Valgrind sometime run the program 
> very slowly sometimes , it last at least one hour. can you show me why or 
> some way to analyze it?
> 
> On 01/25/2018 15:37 UTC, Wuweijia wrote:
> 
> >     Function1:
> > bool CDynamicScheduling::GetProcLoop(
> >          int& nBegin,
> >          int& nEndPlusOne)
> > {
> >          int curr = __sync_fetch_and_add(&m_nCurrent, m_nStep);
> 
> How large is 'm_nStep'?  [Are you sure?] The overhead expense of switching 
> threads in valgrind would be reduced by making m_nStep as large as possible.  
> It looks like the code in Function2 would produce the same values regardless.
> 
> 
> >          if (curr > m_nEnd)
> >          {
> >                  return false;
> >          }
> > 
> >          nBegin = curr;
> >          int limit = m_nEnd + 1;
> 
> Local variable 'limit' is unused.  By itself this is unimportant, but it 
> might be a clue to something that is not shown here.
> 
> >          nEndPlusOne = curr + m_nStep;
> >          return true;
> > }
> >     
> >     
> >     Function2:
> >     ....
> >     int beginY, endY;
> >    while (pDS->GetProcLoop(beginY, endY)){
> >      for (y = beginY; y < endY; y++){
> >        for(x = 0; x < dstWDiv2-7; x+=8){
> >          vtmp0 = vld2q_u16(&pSrc[(y<<1)*srcStride+(x<<1)]);
> >          vtmp1 = vld2q_u16(&pSrc[((y<<1)+1)*srcStride+(x<<1)]);
> 
> I hope the actual source contains a comment such as:
>      Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of 
> pixels in pSrc[].
> 
> >          vst1q_u16(&pDst[y*dstStride+x], (vtmp0.val[0] + vtmp0.val[1] + 
> > vtmp1.val[0] + vtmp1.val[1] + vdupq_n_u16(2)) >> vdupq_n_u16(2));
> >        }
> >        for(; x < dstWDiv2; x++){
> >          pDst[y*dstStride+x] = (pSrc[(y<<1)*srcStride+(x<<1)] + 
> > pSrc[(y<<1)*srcStride+(x<<1)+1] + pSrc[((y<<1)+1)*srcStride+(x<<1)] + 
> > pSrc[((y<<1)+1)*srcStride+((x<<1)+1)] + 2) >> 2;
> >        }
> >      }
> >    }
> > 
> >    return;
> > }   
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most engaging tech 
> sites, Slashdot.org! http://sdm.link/slashdot 
> _______________________________________________
> Valgrind-users mailing list
> Valgrind-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/valgrind-users
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Valgrind-users mailing list
> Valgrind-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/valgrind-users

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Reply via email to