Re: [Chapel-developers] slow locale comm

Michael Ferguson Tue, 12 May 2015 19:40:35 -0700

Hi Brian -

I've seen errors like that if I don't run 'make' again in the
compiler directory after setting the CHPL_TARGET_ARCH environment
variable. You need to run 'make' again since the runtime builds
with the target architecture setting.


Of course this and the CHPL_MEM errors could be bugs...

Hope that helps,

-michael

On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote:

>quick, partial, follow up:
>
>recompiled chapel with the CHPL_TARGET_ARCH=native and hit the following
>error:
>
>/home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19:
>/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nati
>ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file or
>directory
>make: *** No rule to make target
>`/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nat
>ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop.
>error: compiling generated source
>
>I tweaked the gastnet folder names, but never got it to work yet.
>Subsequent attempts to run the compiled program resulted in "set CHPL_MEM
>to a more appropriate mem type".  I tried setting the mem type to a few
>different choices and didn't help.  Needs more
> investigation.
>
>
>I did managed to run some local --fast tests though and the --local
>version ran in 5s while the --no-local version ran in about 12s.
>
>Additionally, I played around with the local keyword and that also seems
>to make some difference, at least locally.  I need to try this on the
>distributed version when I stand it back up.
>
>On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]> wrote:
>>
>>
>> For well-written programs, most of the --local vs. --no-local
>>differences show up as CPU overhead rather than network overhead.  I.e.,
>>we tend not to do unnecessary communications, we simply execute extra
>>scalar code to determine that communication is unnecessary,
> and the presence of this code hinders the back-end compiler's ability to
>optimize the per-node computation.
>>
>> Here's an example:  A given array access like A[i] may not know whether
>>the access is local or remote, so will introduce communication-related
>>code to disambiguate.  Even if that code doesn't generate communication,
>>it can be ugly enough to throw the back-end
> C compiler off.
>>
>> Some workarounds to deal with this:
>>
>> * the 'local' block (documented in doc/release/technotes/README.local
>>   -- this is a big hammer and likely to be replaced with more data-
>>   centric capabilities going forward, but can be helpful in the
>>   meantime if you can get it working in a chunk of code.
>>
>> * I don't know how fully-fleshed out these features are, but there
>>   are at least draft capabilities for .localAccess() and .localSlice()
>>   methods on some array types to reduce overheads like the ones in
>>   my simple example above.  I.e., if I know that A[i] is local for a
>>   given distributed array, A.localAccess(i) is likely to give better
>>   performance.
>>
>>
>> But maybe I should start with a higher-level question:  What kinds of
>>data structures does your code use, and what types of idioms do you use
>>to get multi-locale executions going?  (e.g., distributed arrays +
>>foralls? Or more manually-distributed data structures
> + task parallelism + on-clauses?)
>>
>> Thanks,
>>
>> -Brad
>>
>>
>> On Mon, 11 May 2015, Brian Guarraci wrote:
>>
>>> I was aware of the on-going progress in optimizing the comm, but I'll
>>>take
>>> a look at your docs.  I'll also give the -local vs --no-local
>>>experiment a
>>> try.
>>>
>>> I tested the network layer and saw my nodes were operating near peak
>>> network capacity so it wasn't a transport issue.  Yes, the CPUs (of
>>>which
>>> there are 4 per node) were nearly fully pegged.  Considering the level
>>>of
>>> complexity of the code, I suspect it was mostly overhead.  I even was
>>> looking for a way to pin the execution to a single proc as I wonder if
>>> there was some kind of thrashing going on between procs.  The funny
>>>thing
>>> was the more I tried to optimize the program to do less network
>>>traffic,
>>> the slower it got.
>>>
>>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]>
>>>wrote:
>>>
>>>>
>>>> Hi Brian --
>>>>
>>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you
>>>>better
>>>> results as long as you're not cross-compiling.  Alternatively, you
>>>>can set
>>>> it to 'none' which will squash the warning you're getting.  In any
>>>>case, I
>>>> wouldn't expect the lack of --specialize optimizations to be the
>>>>problem
>>>> here (but if you're throwing components of --fast manually, you'd
>>>>want to
>>>> be sure to add -O in addition to --no-checks).
>>>>
>>>> Generally speaking, Chapel programs compiled for --no-local
>>>>(multi-locale
>>>> execution) tend to generate much worse per-node code than those
>>>>compiled
>>>> for --local (single-locale execution), and this is an area of active
>>>> optimization effort.  See the "Performance Optimizations and
>>>>Generated Code
>>>> Improvements" release note slides at:
>>>>
>>>>         http://chapel.cray.com/download.html#releaseNotes
>>>>         
>http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf
><http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf>
>>>>
>>>> and particularly, the section entitled "the 'local field' pragma" for
>>>>more
>>>> details on this effort (starts at slide 34).
>>>>
>>>> In a nutshell, the Chapel compiler conservatively assumes that things
>>>>are
>>>> remote rather than local when in doubt (to emphasize correctness over
>>>>fast
>>>> but incorrect programs), and then gets into doubt far more often than
>>>>it
>>>> should.  We're currently working on tightening up this gap.
>>>>
>>>> This could explain the full difference in performance that you're
>>>>seeing,
>>>> or something else may be happening.  One way to check into this might
>>>>be to
>>>> run a --local vs. --no-local execution with CHPL_COMM=none to see how
>>>>much
>>>> overhead is added.  The fact that all CPUs are pegged is a good
>>>>indication
>>>> that you don't have a problem with load balance or distributing
>>>> data/computation across nodes, I'd guess?
>>>>
>>>> -Brad
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 11 May 2015, Brian Guarraci wrote:
>>>>
>>>>  I should add that I did supply --no-checks and that helped about 10%.
>>>>>
>>>>>
>>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]>
>>>>>wrote:
>>>>>
>>>>>  It says:
>>>>>>
>>>>>>
>>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'.
>>>>>>If you
>>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a
>>>>>>proper
>>>>>> value.
>>>>>> It's unclear which target arch is appropriate.
>>>>>>
>>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>> Hi Brian --
>>>>>>>
>>>>>>> Getting --fast working should definitely be the first priority.
>>>>>>>What
>>>>>>> about it fails to work?
>>>>>>>
>>>>>>> -Brad
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote:
>>>>>>>
>>>>>>>  Hi,
>>>>>>>
>>>>>>>>
>>>>>>>> I've been testing my search index on my 16 node ARM system and
>>>>>>>>have
>>>>>>>> been
>>>>>>>> running into some strange behavior.  The cool part is that the
>>>>>>>>locale
>>>>>>>> partitioning concept seems to work well, the downside is that the
>>>>>>>> system
>>>>>>>> is
>>>>>>>> very slow.  I've rewritten the approach a few different ways and
>>>>>>>> haven't
>>>>>>>> made a dent, so wanted to ask a few questions.
>>>>>>>>
>>>>>>>> On the ARM processors, I can only use FIFO and can't optimize
>>>>>>>>(--fast
>>>>>>>> doesn't work).  Is this going to significantly affect cross-locale
>>>>>>>> performance?
>>>>>>>>
>>>>>>>> I've looked at the generated C code and tried to minimize the
>>>>>>>>_comm_
>>>>>>>> operations in core methods, but doesn't seem to help.  Network
>>>>>>>>usage is
>>>>>>>> still quite low (100K/s) while CPUs are pegged.  Are there any
>>>>>>>> profiling
>>>>>>>> tools I can use to understand what might be going on here?
>>>>>>>>
>>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM
>>>>>>>>records
>>>>>>>> in
>>>>>>>> under 10s.  With 16 nodes, it takes 10min to do 100k records.
>>>>>>>>
>>>>>>>> Wondering if there's some systemic issue at play here and how can
>>>>>>>> further
>>>>>>>> investigate.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Brian
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>
>
>


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to