Re: [Chapel-developers] slow locale comm

Brian Guarraci Thu, 14 May 2015 10:04:24 -0700

OK, so started from a clean distribution and CHPL_TARGET_ARCH=native is now
running w/o the CHPL_MEM error.  I think there must be a bug somewhere in
the make clean (if I were using a git repo, i would have used git clean
-fdx and probably avoided this problem).  I'm glad it's a build issue.


Now looking into perf issues w/ binary compiled with --fast.

On Tue, May 12, 2015 at 8:12 PM, Brian Guarraci <[email protected]> wrote:

> Yesterday, I did multiple full clean builds (for various combos)
> suspecting what you suggest.  I think there are some bugs and I need to dig
> in to provide better clues for this list.  I'm using 1.11.0 official build.
>
> One strange symptom was that using -nl 1 showed this error but -nl 16 just
> hung indefinitely, with no cpu or network activity.  Tried this with no
> optimizations as well, same result.  Could also be partly related to the
> weird gasnet issue I mentioned.
>
> > On May 12, 2015, at 7:40 PM, Michael Ferguson <[email protected]>
> wrote:
> >
> > Hi Brian -
> >
> > I've seen errors like that if I don't run 'make' again in the
> > compiler directory after setting the CHPL_TARGET_ARCH environment
> > variable. You need to run 'make' again since the runtime builds
> > with the target architecture setting.
> >
> > Of course this and the CHPL_MEM errors could be bugs...
> >
> > Hope that helps,
> >
> > -michael
> >
> >> On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote:
> >>
> >> quick, partial, follow up:
> >>
> >> recompiled chapel with the CHPL_TARGET_ARCH=native and hit the following
> >> error:
> >>
> >> /home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19:
> >>
> /home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nati
> >> ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file or
> >> directory
> >> make: *** No rule to make target
> >>
> `/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nat
> >> ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop.
> >> error: compiling generated source
> >>
> >> I tweaked the gastnet folder names, but never got it to work yet.
> >> Subsequent attempts to run the compiled program resulted in "set
> CHPL_MEM
> >> to a more appropriate mem type".  I tried setting the mem type to a few
> >> different choices and didn't help.  Needs more
> >> investigation.
> >>
> >>
> >> I did managed to run some local --fast tests though and the --local
> >> version ran in 5s while the --no-local version ran in about 12s.
> >>
> >> Additionally, I played around with the local keyword and that also seems
> >> to make some difference, at least locally.  I need to try this on the
> >> distributed version when I stand it back up.
> >>
> >>> On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]>
> wrote:
> >>>
> >>>
> >>> For well-written programs, most of the --local vs. --no-local
> >>> differences show up as CPU overhead rather than network overhead.
> I.e.,
> >>> we tend not to do unnecessary communications, we simply execute extra
> >>> scalar code to determine that communication is unnecessary,
> >> and the presence of this code hinders the back-end compiler's ability to
> >> optimize the per-node computation.
> >>>
> >>> Here's an example:  A given array access like A[i] may not know whether
> >>> the access is local or remote, so will introduce communication-related
> >>> code to disambiguate.  Even if that code doesn't generate
> communication,
> >>> it can be ugly enough to throw the back-end
> >> C compiler off.
> >>>
> >>> Some workarounds to deal with this:
> >>>
> >>> * the 'local' block (documented in doc/release/technotes/README.local
> >>>  -- this is a big hammer and likely to be replaced with more data-
> >>>  centric capabilities going forward, but can be helpful in the
> >>>  meantime if you can get it working in a chunk of code.
> >>>
> >>> * I don't know how fully-fleshed out these features are, but there
> >>>  are at least draft capabilities for .localAccess() and .localSlice()
> >>>  methods on some array types to reduce overheads like the ones in
> >>>  my simple example above.  I.e., if I know that A[i] is local for a
> >>>  given distributed array, A.localAccess(i) is likely to give better
> >>>  performance.
> >>>
> >>>
> >>> But maybe I should start with a higher-level question:  What kinds of
> >>> data structures does your code use, and what types of idioms do you use
> >>> to get multi-locale executions going?  (e.g., distributed arrays +
> >>> foralls? Or more manually-distributed data structures
> >> + task parallelism + on-clauses?)
> >>>
> >>> Thanks,
> >>>
> >>> -Brad
> >>>
> >>>
> >>>> On Mon, 11 May 2015, Brian Guarraci wrote:
> >>>>
> >>>> I was aware of the on-going progress in optimizing the comm, but I'll
> >>>> take
> >>>> a look at your docs.  I'll also give the -local vs --no-local
> >>>> experiment a
> >>>> try.
> >>>>
> >>>> I tested the network layer and saw my nodes were operating near peak
> >>>> network capacity so it wasn't a transport issue.  Yes, the CPUs (of
> >>>> which
> >>>> there are 4 per node) were nearly fully pegged.  Considering the level
> >>>> of
> >>>> complexity of the code, I suspect it was mostly overhead.  I even was
> >>>> looking for a way to pin the execution to a single proc as I wonder if
> >>>> there was some kind of thrashing going on between procs.  The funny
> >>>> thing
> >>>> was the more I tried to optimize the program to do less network
> >>>> traffic,
> >>>> the slower it got.
> >>>>
> >>>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> Hi Brian --
> >>>>>
> >>>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you
> >>>>> better
> >>>>> results as long as you're not cross-compiling.  Alternatively, you
> >>>>> can set
> >>>>> it to 'none' which will squash the warning you're getting.  In any
> >>>>> case, I
> >>>>> wouldn't expect the lack of --specialize optimizations to be the
> >>>>> problem
> >>>>> here (but if you're throwing components of --fast manually, you'd
> >>>>> want to
> >>>>> be sure to add -O in addition to --no-checks).
> >>>>>
> >>>>> Generally speaking, Chapel programs compiled for --no-local
> >>>>> (multi-locale
> >>>>> execution) tend to generate much worse per-node code than those
> >>>>> compiled
> >>>>> for --local (single-locale execution), and this is an area of active
> >>>>> optimization effort.  See the "Performance Optimizations and
> >>>>> Generated Code
> >>>>> Improvements" release note slides at:
> >>>>>
> >>>>>        http://chapel.cray.com/download.html#releaseNotes
> >> http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf
> >> <http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf>
> >>>>>
> >>>>> and particularly, the section entitled "the 'local field' pragma" for
> >>>>> more
> >>>>> details on this effort (starts at slide 34).
> >>>>>
> >>>>> In a nutshell, the Chapel compiler conservatively assumes that things
> >>>>> are
> >>>>> remote rather than local when in doubt (to emphasize correctness over
> >>>>> fast
> >>>>> but incorrect programs), and then gets into doubt far more often than
> >>>>> it
> >>>>> should.  We're currently working on tightening up this gap.
> >>>>>
> >>>>> This could explain the full difference in performance that you're
> >>>>> seeing,
> >>>>> or something else may be happening.  One way to check into this might
> >>>>> be to
> >>>>> run a --local vs. --no-local execution with CHPL_COMM=none to see how
> >>>>> much
> >>>>> overhead is added.  The fact that all CPUs are pegged is a good
> >>>>> indication
> >>>>> that you don't have a problem with load balance or distributing
> >>>>> data/computation across nodes, I'd guess?
> >>>>>
> >>>>> -Brad
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, 11 May 2015, Brian Guarraci wrote:
> >>>>>
> >>>>> I should add that I did supply --no-checks and that helped about 10%.
> >>>>>>
> >>>>>>
> >>>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>> It says:
> >>>>>>>
> >>>>>>>
> >>>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'.
> >>>>>>> If you
> >>>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a
> >>>>>>> proper
> >>>>>>> value.
> >>>>>>> It's unclear which target arch is appropriate.
> >>>>>>>
> >>>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi Brian --
> >>>>>>>>
> >>>>>>>> Getting --fast working should definitely be the first priority.
> >>>>>>>> What
> >>>>>>>> about it fails to work?
> >>>>>>>>
> >>>>>>>> -Brad
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I've been testing my search index on my 16 node ARM system and
> >>>>>>>>> have
> >>>>>>>>> been
> >>>>>>>>> running into some strange behavior.  The cool part is that the
> >>>>>>>>> locale
> >>>>>>>>> partitioning concept seems to work well, the downside is that the
> >>>>>>>>> system
> >>>>>>>>> is
> >>>>>>>>> very slow.  I've rewritten the approach a few different ways and
> >>>>>>>>> haven't
> >>>>>>>>> made a dent, so wanted to ask a few questions.
> >>>>>>>>>
> >>>>>>>>> On the ARM processors, I can only use FIFO and can't optimize
> >>>>>>>>> (--fast
> >>>>>>>>> doesn't work).  Is this going to significantly affect
> cross-locale
> >>>>>>>>> performance?
> >>>>>>>>>
> >>>>>>>>> I've looked at the generated C code and tried to minimize the
> >>>>>>>>> _comm_
> >>>>>>>>> operations in core methods, but doesn't seem to help.  Network
> >>>>>>>>> usage is
> >>>>>>>>> still quite low (100K/s) while CPUs are pegged.  Are there any
> >>>>>>>>> profiling
> >>>>>>>>> tools I can use to understand what might be going on here?
> >>>>>>>>>
> >>>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM
> >>>>>>>>> records
> >>>>>>>>> in
> >>>>>>>>> under 10s.  With 16 nodes, it takes 10min to do 100k records.
> >>>>>>>>>
> >>>>>>>>> Wondering if there's some systemic issue at play here and how can
> >>>>>>>>> further
> >>>>>>>>> investigate.
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>> Brian
> >
>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to