Re: [Chapel-developers] slow locale comm

Brad Chamberlain Mon, 11 May 2015 13:06:07 -0700

For well-written programs, most of the --local vs. --no-local differences 
show up as CPU overhead rather than network overhead.  I.e., we tend not 
to do unnecessary communications, we simply execute extra scalar code to 
determine that communication is unnecessary, and the presence of this code 
hinders the back-end compiler's ability to optimize the per-node 
computation.


Here's an example:  A given array access like A[i] may not know whether 
the access is local or remote, so will introduce communication-related 
code to disambiguate.  Even if that code doesn't generate communication, 
it can be ugly enough to throw the back-end C compiler off.

Some workarounds to deal with this:

* the 'local' block (documented in doc/release/technotes/README.local
   -- this is a big hammer and likely to be replaced with more data-
   centric capabilities going forward, but can be helpful in the
   meantime if you can get it working in a chunk of code.

* I don't know how fully-fleshed out these features are, but there
   are at least draft capabilities for .localAccess() and .localSlice()
   methods on some array types to reduce overheads like the ones in
   my simple example above.  I.e., if I know that A[i] is local for a
   given distributed array, A.localAccess(i) is likely to give better
   performance.


But maybe I should start with a higher-level question:  What kinds of data 
structures does your code use, and what types of idioms do you use to get 
multi-locale executions going?  (e.g., distributed arrays + foralls? Or 
more manually-distributed data structures + task parallelism + 
on-clauses?)

Thanks,
-Brad


On Mon, 11 May 2015, Brian Guarraci wrote:

> I was aware of the on-going progress in optimizing the comm, but I'll take
> a look at your docs.  I'll also give the -local vs --no-local experiment a
> try.
>
> I tested the network layer and saw my nodes were operating near peak
> network capacity so it wasn't a transport issue.  Yes, the CPUs (of which
> there are 4 per node) were nearly fully pegged.  Considering the level of
> complexity of the code, I suspect it was mostly overhead.  I even was
> looking for a way to pin the execution to a single proc as I wonder if
> there was some kind of thrashing going on between procs.  The funny thing
> was the more I tried to optimize the program to do less network traffic,
> the slower it got.
>
> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]> wrote:
>
>>
>> Hi Brian --
>>
>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you better
>> results as long as you're not cross-compiling.  Alternatively, you can set
>> it to 'none' which will squash the warning you're getting.  In any case, I
>> wouldn't expect the lack of --specialize optimizations to be the problem
>> here (but if you're throwing components of --fast manually, you'd want to
>> be sure to add -O in addition to --no-checks).
>>
>> Generally speaking, Chapel programs compiled for --no-local (multi-locale
>> execution) tend to generate much worse per-node code than those compiled
>> for --local (single-locale execution), and this is an area of active
>> optimization effort.  See the "Performance Optimizations and Generated Code
>> Improvements" release note slides at:
>>
>>         http://chapel.cray.com/download.html#releaseNotes
>>         http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf
>>
>> and particularly, the section entitled "the 'local field' pragma" for more
>> details on this effort (starts at slide 34).
>>
>> In a nutshell, the Chapel compiler conservatively assumes that things are
>> remote rather than local when in doubt (to emphasize correctness over fast
>> but incorrect programs), and then gets into doubt far more often than it
>> should.  We're currently working on tightening up this gap.
>>
>> This could explain the full difference in performance that you're seeing,
>> or something else may be happening.  One way to check into this might be to
>> run a --local vs. --no-local execution with CHPL_COMM=none to see how much
>> overhead is added.  The fact that all CPUs are pegged is a good indication
>> that you don't have a problem with load balance or distributing
>> data/computation across nodes, I'd guess?
>>
>> -Brad
>>
>>
>>
>>
>> On Mon, 11 May 2015, Brian Guarraci wrote:
>>
>>  I should add that I did supply --no-checks and that helped about 10%.
>>>
>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]> wrote:
>>>
>>>  It says:
>>>>
>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'. If you
>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a proper
>>>> value.
>>>> It's unclear which target arch is appropriate.
>>>>
>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]>
>>>> wrote:
>>>>
>>>>
>>>>> Hi Brian --
>>>>>
>>>>> Getting --fast working should definitely be the first priority.  What
>>>>> about it fails to work?
>>>>>
>>>>> -Brad
>>>>>
>>>>>
>>>>>
>>>>> On Sun, 10 May 2015, Brian Guarraci wrote:
>>>>>
>>>>>  Hi,
>>>>>
>>>>>>
>>>>>> I've been testing my search index on my 16 node ARM system and have
>>>>>> been
>>>>>> running into some strange behavior.  The cool part is that the locale
>>>>>> partitioning concept seems to work well, the downside is that the
>>>>>> system
>>>>>> is
>>>>>> very slow.  I've rewritten the approach a few different ways and
>>>>>> haven't
>>>>>> made a dent, so wanted to ask a few questions.
>>>>>>
>>>>>> On the ARM processors, I can only use FIFO and can't optimize (--fast
>>>>>> doesn't work).  Is this going to significantly affect cross-locale
>>>>>> performance?
>>>>>>
>>>>>> I've looked at the generated C code and tried to minimize the _comm_
>>>>>> operations in core methods, but doesn't seem to help.  Network usage is
>>>>>> still quite low (100K/s) while CPUs are pegged.  Are there any
>>>>>> profiling
>>>>>> tools I can use to understand what might be going on here?
>>>>>>
>>>>>> Generally, on my laptop or single node, I can index about 1.1MM records
>>>>>> in
>>>>>> under 10s.  With 16 nodes, it takes 10min to do 100k records.
>>>>>>
>>>>>> Wondering if there's some systemic issue at play here and how can
>>>>>> further
>>>>>> investigate.
>>>>>>
>>>>>> Thanks!
>>>>>> Brian
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to