Re: [Chapel-developers] slow locale comm

Brian Guarraci Tue, 12 May 2015 20:44:38 -0700

Yesterday, I did multiple full clean builds (for various combos) suspecting 
what you suggest.  I think there are some bugs and I need to dig in to provide 
better clues for this list.  I'm using 1.11.0 official build.


One strange symptom was that using -nl 1 showed this error but -nl 16 just hung 
indefinitely, with no cpu or network activity.  Tried this with no 
optimizations as well, same result.  Could also be partly related to the weird 
gasnet issue I mentioned.

> On May 12, 2015, at 7:40 PM, Michael Ferguson <[email protected]> wrote:
> 
> Hi Brian -
> 
> I've seen errors like that if I don't run 'make' again in the
> compiler directory after setting the CHPL_TARGET_ARCH environment
> variable. You need to run 'make' again since the runtime builds
> with the target architecture setting.
> 
> Of course this and the CHPL_MEM errors could be bugs...
> 
> Hope that helps,
> 
> -michael
> 
>> On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote:
>> 
>> quick, partial, follow up:
>> 
>> recompiled chapel with the CHPL_TARGET_ARCH=native and hit the following
>> error:
>> 
>> /home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19:
>> /home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nati
>> ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file or
>> directory
>> make: *** No rule to make target
>> `/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nat
>> ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop.
>> error: compiling generated source
>> 
>> I tweaked the gastnet folder names, but never got it to work yet.
>> Subsequent attempts to run the compiled program resulted in "set CHPL_MEM
>> to a more appropriate mem type".  I tried setting the mem type to a few
>> different choices and didn't help.  Needs more
>> investigation.
>> 
>> 
>> I did managed to run some local --fast tests though and the --local
>> version ran in 5s while the --no-local version ran in about 12s.
>> 
>> Additionally, I played around with the local keyword and that also seems
>> to make some difference, at least locally.  I need to try this on the
>> distributed version when I stand it back up.
>> 
>>> On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]> wrote:
>>> 
>>> 
>>> For well-written programs, most of the --local vs. --no-local
>>> differences show up as CPU overhead rather than network overhead.  I.e.,
>>> we tend not to do unnecessary communications, we simply execute extra
>>> scalar code to determine that communication is unnecessary,
>> and the presence of this code hinders the back-end compiler's ability to
>> optimize the per-node computation.
>>> 
>>> Here's an example:  A given array access like A[i] may not know whether
>>> the access is local or remote, so will introduce communication-related
>>> code to disambiguate.  Even if that code doesn't generate communication,
>>> it can be ugly enough to throw the back-end
>> C compiler off.
>>> 
>>> Some workarounds to deal with this:
>>> 
>>> * the 'local' block (documented in doc/release/technotes/README.local
>>>  -- this is a big hammer and likely to be replaced with more data-
>>>  centric capabilities going forward, but can be helpful in the
>>>  meantime if you can get it working in a chunk of code.
>>> 
>>> * I don't know how fully-fleshed out these features are, but there
>>>  are at least draft capabilities for .localAccess() and .localSlice()
>>>  methods on some array types to reduce overheads like the ones in
>>>  my simple example above.  I.e., if I know that A[i] is local for a
>>>  given distributed array, A.localAccess(i) is likely to give better
>>>  performance.
>>> 
>>> 
>>> But maybe I should start with a higher-level question:  What kinds of
>>> data structures does your code use, and what types of idioms do you use
>>> to get multi-locale executions going?  (e.g., distributed arrays +
>>> foralls? Or more manually-distributed data structures
>> + task parallelism + on-clauses?)
>>> 
>>> Thanks,
>>> 
>>> -Brad
>>> 
>>> 
>>>> On Mon, 11 May 2015, Brian Guarraci wrote:
>>>> 
>>>> I was aware of the on-going progress in optimizing the comm, but I'll
>>>> take
>>>> a look at your docs.  I'll also give the -local vs --no-local
>>>> experiment a
>>>> try.
>>>> 
>>>> I tested the network layer and saw my nodes were operating near peak
>>>> network capacity so it wasn't a transport issue.  Yes, the CPUs (of
>>>> which
>>>> there are 4 per node) were nearly fully pegged.  Considering the level
>>>> of
>>>> complexity of the code, I suspect it was mostly overhead.  I even was
>>>> looking for a way to pin the execution to a single proc as I wonder if
>>>> there was some kind of thrashing going on between procs.  The funny
>>>> thing
>>>> was the more I tried to optimize the program to do less network
>>>> traffic,
>>>> the slower it got.
>>>> 
>>>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]>
>>>> wrote:
>>>> 
>>>>> 
>>>>> Hi Brian --
>>>>> 
>>>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you
>>>>> better
>>>>> results as long as you're not cross-compiling.  Alternatively, you
>>>>> can set
>>>>> it to 'none' which will squash the warning you're getting.  In any
>>>>> case, I
>>>>> wouldn't expect the lack of --specialize optimizations to be the
>>>>> problem
>>>>> here (but if you're throwing components of --fast manually, you'd
>>>>> want to
>>>>> be sure to add -O in addition to --no-checks).
>>>>> 
>>>>> Generally speaking, Chapel programs compiled for --no-local
>>>>> (multi-locale
>>>>> execution) tend to generate much worse per-node code than those
>>>>> compiled
>>>>> for --local (single-locale execution), and this is an area of active
>>>>> optimization effort.  See the "Performance Optimizations and
>>>>> Generated Code
>>>>> Improvements" release note slides at:
>>>>> 
>>>>>        http://chapel.cray.com/download.html#releaseNotes
>> http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf
>> <http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf>
>>>>> 
>>>>> and particularly, the section entitled "the 'local field' pragma" for
>>>>> more
>>>>> details on this effort (starts at slide 34).
>>>>> 
>>>>> In a nutshell, the Chapel compiler conservatively assumes that things
>>>>> are
>>>>> remote rather than local when in doubt (to emphasize correctness over
>>>>> fast
>>>>> but incorrect programs), and then gets into doubt far more often than
>>>>> it
>>>>> should.  We're currently working on tightening up this gap.
>>>>> 
>>>>> This could explain the full difference in performance that you're
>>>>> seeing,
>>>>> or something else may be happening.  One way to check into this might
>>>>> be to
>>>>> run a --local vs. --no-local execution with CHPL_COMM=none to see how
>>>>> much
>>>>> overhead is added.  The fact that all CPUs are pegged is a good
>>>>> indication
>>>>> that you don't have a problem with load balance or distributing
>>>>> data/computation across nodes, I'd guess?
>>>>> 
>>>>> -Brad
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, 11 May 2015, Brian Guarraci wrote:
>>>>> 
>>>>> I should add that I did supply --no-checks and that helped about 10%.
>>>>>> 
>>>>>> 
>>>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>> It says:
>>>>>>> 
>>>>>>> 
>>>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'.
>>>>>>> If you
>>>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a
>>>>>>> proper
>>>>>>> value.
>>>>>>> It's unclear which target arch is appropriate.
>>>>>>> 
>>>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> Hi Brian --
>>>>>>>> 
>>>>>>>> Getting --fast working should definitely be the first priority.
>>>>>>>> What
>>>>>>>> about it fails to work?
>>>>>>>> 
>>>>>>>> -Brad
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I've been testing my search index on my 16 node ARM system and
>>>>>>>>> have
>>>>>>>>> been
>>>>>>>>> running into some strange behavior.  The cool part is that the
>>>>>>>>> locale
>>>>>>>>> partitioning concept seems to work well, the downside is that the
>>>>>>>>> system
>>>>>>>>> is
>>>>>>>>> very slow.  I've rewritten the approach a few different ways and
>>>>>>>>> haven't
>>>>>>>>> made a dent, so wanted to ask a few questions.
>>>>>>>>> 
>>>>>>>>> On the ARM processors, I can only use FIFO and can't optimize
>>>>>>>>> (--fast
>>>>>>>>> doesn't work).  Is this going to significantly affect cross-locale
>>>>>>>>> performance?
>>>>>>>>> 
>>>>>>>>> I've looked at the generated C code and tried to minimize the
>>>>>>>>> _comm_
>>>>>>>>> operations in core methods, but doesn't seem to help.  Network
>>>>>>>>> usage is
>>>>>>>>> still quite low (100K/s) while CPUs are pegged.  Are there any
>>>>>>>>> profiling
>>>>>>>>> tools I can use to understand what might be going on here?
>>>>>>>>> 
>>>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM
>>>>>>>>> records
>>>>>>>>> in
>>>>>>>>> under 10s.  With 16 nodes, it takes 10min to do 100k records.
>>>>>>>>> 
>>>>>>>>> Wondering if there's some systemic issue at play here and how can
>>>>>>>>> further
>>>>>>>>> investigate.
>>>>>>>>> 
>>>>>>>>> Thanks!
>>>>>>>>> Brian
> 

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to