Yesterday, I did multiple full clean builds (for various combos) suspecting what you suggest. I think there are some bugs and I need to dig in to provide better clues for this list. I'm using 1.11.0 official build.
One strange symptom was that using -nl 1 showed this error but -nl 16 just hung indefinitely, with no cpu or network activity. Tried this with no optimizations as well, same result. Could also be partly related to the weird gasnet issue I mentioned. > On May 12, 2015, at 7:40 PM, Michael Ferguson <[email protected]> wrote: > > Hi Brian - > > I've seen errors like that if I don't run 'make' again in the > compiler directory after setting the CHPL_TARGET_ARCH environment > variable. You need to run 'make' again since the runtime builds > with the target architecture setting. > > Of course this and the CHPL_MEM errors could be bugs... > > Hope that helps, > > -michael > >> On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote: >> >> quick, partial, follow up: >> >> recompiled chapel with the CHPL_TARGET_ARCH=native and hit the following >> error: >> >> /home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19: >> /home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nati >> ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file or >> directory >> make: *** No rule to make target >> `/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nat >> ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop. >> error: compiling generated source >> >> I tweaked the gastnet folder names, but never got it to work yet. >> Subsequent attempts to run the compiled program resulted in "set CHPL_MEM >> to a more appropriate mem type". I tried setting the mem type to a few >> different choices and didn't help. Needs more >> investigation. >> >> >> I did managed to run some local --fast tests though and the --local >> version ran in 5s while the --no-local version ran in about 12s. >> >> Additionally, I played around with the local keyword and that also seems >> to make some difference, at least locally. I need to try this on the >> distributed version when I stand it back up. >> >>> On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]> wrote: >>> >>> >>> For well-written programs, most of the --local vs. --no-local >>> differences show up as CPU overhead rather than network overhead. I.e., >>> we tend not to do unnecessary communications, we simply execute extra >>> scalar code to determine that communication is unnecessary, >> and the presence of this code hinders the back-end compiler's ability to >> optimize the per-node computation. >>> >>> Here's an example: A given array access like A[i] may not know whether >>> the access is local or remote, so will introduce communication-related >>> code to disambiguate. Even if that code doesn't generate communication, >>> it can be ugly enough to throw the back-end >> C compiler off. >>> >>> Some workarounds to deal with this: >>> >>> * the 'local' block (documented in doc/release/technotes/README.local >>> -- this is a big hammer and likely to be replaced with more data- >>> centric capabilities going forward, but can be helpful in the >>> meantime if you can get it working in a chunk of code. >>> >>> * I don't know how fully-fleshed out these features are, but there >>> are at least draft capabilities for .localAccess() and .localSlice() >>> methods on some array types to reduce overheads like the ones in >>> my simple example above. I.e., if I know that A[i] is local for a >>> given distributed array, A.localAccess(i) is likely to give better >>> performance. >>> >>> >>> But maybe I should start with a higher-level question: What kinds of >>> data structures does your code use, and what types of idioms do you use >>> to get multi-locale executions going? (e.g., distributed arrays + >>> foralls? Or more manually-distributed data structures >> + task parallelism + on-clauses?) >>> >>> Thanks, >>> >>> -Brad >>> >>> >>>> On Mon, 11 May 2015, Brian Guarraci wrote: >>>> >>>> I was aware of the on-going progress in optimizing the comm, but I'll >>>> take >>>> a look at your docs. I'll also give the -local vs --no-local >>>> experiment a >>>> try. >>>> >>>> I tested the network layer and saw my nodes were operating near peak >>>> network capacity so it wasn't a transport issue. Yes, the CPUs (of >>>> which >>>> there are 4 per node) were nearly fully pegged. Considering the level >>>> of >>>> complexity of the code, I suspect it was mostly overhead. I even was >>>> looking for a way to pin the execution to a single proc as I wonder if >>>> there was some kind of thrashing going on between procs. The funny >>>> thing >>>> was the more I tried to optimize the program to do less network >>>> traffic, >>>> the slower it got. >>>> >>>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]> >>>> wrote: >>>> >>>>> >>>>> Hi Brian -- >>>>> >>>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you >>>>> better >>>>> results as long as you're not cross-compiling. Alternatively, you >>>>> can set >>>>> it to 'none' which will squash the warning you're getting. In any >>>>> case, I >>>>> wouldn't expect the lack of --specialize optimizations to be the >>>>> problem >>>>> here (but if you're throwing components of --fast manually, you'd >>>>> want to >>>>> be sure to add -O in addition to --no-checks). >>>>> >>>>> Generally speaking, Chapel programs compiled for --no-local >>>>> (multi-locale >>>>> execution) tend to generate much worse per-node code than those >>>>> compiled >>>>> for --local (single-locale execution), and this is an area of active >>>>> optimization effort. See the "Performance Optimizations and >>>>> Generated Code >>>>> Improvements" release note slides at: >>>>> >>>>> http://chapel.cray.com/download.html#releaseNotes >> http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf >> <http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf> >>>>> >>>>> and particularly, the section entitled "the 'local field' pragma" for >>>>> more >>>>> details on this effort (starts at slide 34). >>>>> >>>>> In a nutshell, the Chapel compiler conservatively assumes that things >>>>> are >>>>> remote rather than local when in doubt (to emphasize correctness over >>>>> fast >>>>> but incorrect programs), and then gets into doubt far more often than >>>>> it >>>>> should. We're currently working on tightening up this gap. >>>>> >>>>> This could explain the full difference in performance that you're >>>>> seeing, >>>>> or something else may be happening. One way to check into this might >>>>> be to >>>>> run a --local vs. --no-local execution with CHPL_COMM=none to see how >>>>> much >>>>> overhead is added. The fact that all CPUs are pegged is a good >>>>> indication >>>>> that you don't have a problem with load balance or distributing >>>>> data/computation across nodes, I'd guess? >>>>> >>>>> -Brad >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, 11 May 2015, Brian Guarraci wrote: >>>>> >>>>> I should add that I did supply --no-checks and that helped about 10%. >>>>>> >>>>>> >>>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]> >>>>>> wrote: >>>>>> >>>>>> It says: >>>>>>> >>>>>>> >>>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'. >>>>>>> If you >>>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a >>>>>>> proper >>>>>>> value. >>>>>>> It's unclear which target arch is appropriate. >>>>>>> >>>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> Hi Brian -- >>>>>>>> >>>>>>>> Getting --fast working should definitely be the first priority. >>>>>>>> What >>>>>>>> about it fails to work? >>>>>>>> >>>>>>>> -Brad >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>>> >>>>>>>>> I've been testing my search index on my 16 node ARM system and >>>>>>>>> have >>>>>>>>> been >>>>>>>>> running into some strange behavior. The cool part is that the >>>>>>>>> locale >>>>>>>>> partitioning concept seems to work well, the downside is that the >>>>>>>>> system >>>>>>>>> is >>>>>>>>> very slow. I've rewritten the approach a few different ways and >>>>>>>>> haven't >>>>>>>>> made a dent, so wanted to ask a few questions. >>>>>>>>> >>>>>>>>> On the ARM processors, I can only use FIFO and can't optimize >>>>>>>>> (--fast >>>>>>>>> doesn't work). Is this going to significantly affect cross-locale >>>>>>>>> performance? >>>>>>>>> >>>>>>>>> I've looked at the generated C code and tried to minimize the >>>>>>>>> _comm_ >>>>>>>>> operations in core methods, but doesn't seem to help. Network >>>>>>>>> usage is >>>>>>>>> still quite low (100K/s) while CPUs are pegged. Are there any >>>>>>>>> profiling >>>>>>>>> tools I can use to understand what might be going on here? >>>>>>>>> >>>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM >>>>>>>>> records >>>>>>>>> in >>>>>>>>> under 10s. With 16 nodes, it takes 10min to do 100k records. >>>>>>>>> >>>>>>>>> Wondering if there's some systemic issue at play here and how can >>>>>>>>> further >>>>>>>>> investigate. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> Brian > ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
