I was aware of the on-going progress in optimizing the comm, but I'll take
a look at your docs.  I'll also give the -local vs --no-local experiment a
try.

I tested the network layer and saw my nodes were operating near peak
network capacity so it wasn't a transport issue.  Yes, the CPUs (of which
there are 4 per node) were nearly fully pegged.  Considering the level of
complexity of the code, I suspect it was mostly overhead.  I even was
looking for a way to pin the execution to a single proc as I wonder if
there was some kind of thrashing going on between procs.  The funny thing
was the more I tried to optimize the program to do less network traffic,
the slower it got.

On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]> wrote:

>
> Hi Brian --
>
> I believe that setting CHPL_TARGET_ARCH to 'native' should get you better
> results as long as you're not cross-compiling.  Alternatively, you can set
> it to 'none' which will squash the warning you're getting.  In any case, I
> wouldn't expect the lack of --specialize optimizations to be the problem
> here (but if you're throwing components of --fast manually, you'd want to
> be sure to add -O in addition to --no-checks).
>
> Generally speaking, Chapel programs compiled for --no-local (multi-locale
> execution) tend to generate much worse per-node code than those compiled
> for --local (single-locale execution), and this is an area of active
> optimization effort.  See the "Performance Optimizations and Generated Code
> Improvements" release note slides at:
>
>         http://chapel.cray.com/download.html#releaseNotes
>         http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf
>
> and particularly, the section entitled "the 'local field' pragma" for more
> details on this effort (starts at slide 34).
>
> In a nutshell, the Chapel compiler conservatively assumes that things are
> remote rather than local when in doubt (to emphasize correctness over fast
> but incorrect programs), and then gets into doubt far more often than it
> should.  We're currently working on tightening up this gap.
>
> This could explain the full difference in performance that you're seeing,
> or something else may be happening.  One way to check into this might be to
> run a --local vs. --no-local execution with CHPL_COMM=none to see how much
> overhead is added.  The fact that all CPUs are pegged is a good indication
> that you don't have a problem with load balance or distributing
> data/computation across nodes, I'd guess?
>
> -Brad
>
>
>
>
> On Mon, 11 May 2015, Brian Guarraci wrote:
>
>  I should add that I did supply --no-checks and that helped about 10%.
>>
>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]> wrote:
>>
>>  It says:
>>>
>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'. If you
>>> want any specialization to occur please set CHPL_TARGET_ARCH to a proper
>>> value.
>>> It's unclear which target arch is appropriate.
>>>
>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]>
>>> wrote:
>>>
>>>
>>>> Hi Brian --
>>>>
>>>> Getting --fast working should definitely be the first priority.  What
>>>> about it fails to work?
>>>>
>>>> -Brad
>>>>
>>>>
>>>>
>>>> On Sun, 10 May 2015, Brian Guarraci wrote:
>>>>
>>>>  Hi,
>>>>
>>>>>
>>>>> I've been testing my search index on my 16 node ARM system and have
>>>>> been
>>>>> running into some strange behavior.  The cool part is that the locale
>>>>> partitioning concept seems to work well, the downside is that the
>>>>> system
>>>>> is
>>>>> very slow.  I've rewritten the approach a few different ways and
>>>>> haven't
>>>>> made a dent, so wanted to ask a few questions.
>>>>>
>>>>> On the ARM processors, I can only use FIFO and can't optimize (--fast
>>>>> doesn't work).  Is this going to significantly affect cross-locale
>>>>> performance?
>>>>>
>>>>> I've looked at the generated C code and tried to minimize the _comm_
>>>>> operations in core methods, but doesn't seem to help.  Network usage is
>>>>> still quite low (100K/s) while CPUs are pegged.  Are there any
>>>>> profiling
>>>>> tools I can use to understand what might be going on here?
>>>>>
>>>>> Generally, on my laptop or single node, I can index about 1.1MM records
>>>>> in
>>>>> under 10s.  With 16 nodes, it takes 10min to do 100k records.
>>>>>
>>>>> Wondering if there's some systemic issue at play here and how can
>>>>> further
>>>>> investigate.
>>>>>
>>>>> Thanks!
>>>>> Brian
>>>>>
>>>>>
>>>>>
>>>
>>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to