Hi Brian - I've seen errors like that if I don't run 'make' again in the compiler directory after setting the CHPL_TARGET_ARCH environment variable. You need to run 'make' again since the runtime builds with the target architecture setting.
Of course this and the CHPL_MEM errors could be bugs... Hope that helps, -michael On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote: >quick, partial, follow up: > >recompiled chapel with the CHPL_TARGET_ARCH=native and hit the following >error: > >/home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19: >/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nati >ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file or >directory >make: *** No rule to make target >`/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nat >ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop. >error: compiling generated source > >I tweaked the gastnet folder names, but never got it to work yet. >Subsequent attempts to run the compiled program resulted in "set CHPL_MEM >to a more appropriate mem type". I tried setting the mem type to a few >different choices and didn't help. Needs more > investigation. > > >I did managed to run some local --fast tests though and the --local >version ran in 5s while the --no-local version ran in about 12s. > >Additionally, I played around with the local keyword and that also seems >to make some difference, at least locally. I need to try this on the >distributed version when I stand it back up. > >On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]> wrote: >> >> >> For well-written programs, most of the --local vs. --no-local >>differences show up as CPU overhead rather than network overhead. I.e., >>we tend not to do unnecessary communications, we simply execute extra >>scalar code to determine that communication is unnecessary, > and the presence of this code hinders the back-end compiler's ability to >optimize the per-node computation. >> >> Here's an example: A given array access like A[i] may not know whether >>the access is local or remote, so will introduce communication-related >>code to disambiguate. Even if that code doesn't generate communication, >>it can be ugly enough to throw the back-end > C compiler off. >> >> Some workarounds to deal with this: >> >> * the 'local' block (documented in doc/release/technotes/README.local >> -- this is a big hammer and likely to be replaced with more data- >> centric capabilities going forward, but can be helpful in the >> meantime if you can get it working in a chunk of code. >> >> * I don't know how fully-fleshed out these features are, but there >> are at least draft capabilities for .localAccess() and .localSlice() >> methods on some array types to reduce overheads like the ones in >> my simple example above. I.e., if I know that A[i] is local for a >> given distributed array, A.localAccess(i) is likely to give better >> performance. >> >> >> But maybe I should start with a higher-level question: What kinds of >>data structures does your code use, and what types of idioms do you use >>to get multi-locale executions going? (e.g., distributed arrays + >>foralls? Or more manually-distributed data structures > + task parallelism + on-clauses?) >> >> Thanks, >> >> -Brad >> >> >> On Mon, 11 May 2015, Brian Guarraci wrote: >> >>> I was aware of the on-going progress in optimizing the comm, but I'll >>>take >>> a look at your docs. I'll also give the -local vs --no-local >>>experiment a >>> try. >>> >>> I tested the network layer and saw my nodes were operating near peak >>> network capacity so it wasn't a transport issue. Yes, the CPUs (of >>>which >>> there are 4 per node) were nearly fully pegged. Considering the level >>>of >>> complexity of the code, I suspect it was mostly overhead. I even was >>> looking for a way to pin the execution to a single proc as I wonder if >>> there was some kind of thrashing going on between procs. The funny >>>thing >>> was the more I tried to optimize the program to do less network >>>traffic, >>> the slower it got. >>> >>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]> >>>wrote: >>> >>>> >>>> Hi Brian -- >>>> >>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you >>>>better >>>> results as long as you're not cross-compiling. Alternatively, you >>>>can set >>>> it to 'none' which will squash the warning you're getting. In any >>>>case, I >>>> wouldn't expect the lack of --specialize optimizations to be the >>>>problem >>>> here (but if you're throwing components of --fast manually, you'd >>>>want to >>>> be sure to add -O in addition to --no-checks). >>>> >>>> Generally speaking, Chapel programs compiled for --no-local >>>>(multi-locale >>>> execution) tend to generate much worse per-node code than those >>>>compiled >>>> for --local (single-locale execution), and this is an area of active >>>> optimization effort. See the "Performance Optimizations and >>>>Generated Code >>>> Improvements" release note slides at: >>>> >>>> http://chapel.cray.com/download.html#releaseNotes >>>> >http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf ><http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf> >>>> >>>> and particularly, the section entitled "the 'local field' pragma" for >>>>more >>>> details on this effort (starts at slide 34). >>>> >>>> In a nutshell, the Chapel compiler conservatively assumes that things >>>>are >>>> remote rather than local when in doubt (to emphasize correctness over >>>>fast >>>> but incorrect programs), and then gets into doubt far more often than >>>>it >>>> should. We're currently working on tightening up this gap. >>>> >>>> This could explain the full difference in performance that you're >>>>seeing, >>>> or something else may be happening. One way to check into this might >>>>be to >>>> run a --local vs. --no-local execution with CHPL_COMM=none to see how >>>>much >>>> overhead is added. The fact that all CPUs are pegged is a good >>>>indication >>>> that you don't have a problem with load balance or distributing >>>> data/computation across nodes, I'd guess? >>>> >>>> -Brad >>>> >>>> >>>> >>>> >>>> On Mon, 11 May 2015, Brian Guarraci wrote: >>>> >>>> I should add that I did supply --no-checks and that helped about 10%. >>>>> >>>>> >>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]> >>>>>wrote: >>>>> >>>>> It says: >>>>>> >>>>>> >>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'. >>>>>>If you >>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a >>>>>>proper >>>>>> value. >>>>>> It's unclear which target arch is appropriate. >>>>>> >>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>>> Hi Brian -- >>>>>>> >>>>>>> Getting --fast working should definitely be the first priority. >>>>>>>What >>>>>>> about it fails to work? >>>>>>> >>>>>>> -Brad >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>>> >>>>>>>> I've been testing my search index on my 16 node ARM system and >>>>>>>>have >>>>>>>> been >>>>>>>> running into some strange behavior. The cool part is that the >>>>>>>>locale >>>>>>>> partitioning concept seems to work well, the downside is that the >>>>>>>> system >>>>>>>> is >>>>>>>> very slow. I've rewritten the approach a few different ways and >>>>>>>> haven't >>>>>>>> made a dent, so wanted to ask a few questions. >>>>>>>> >>>>>>>> On the ARM processors, I can only use FIFO and can't optimize >>>>>>>>(--fast >>>>>>>> doesn't work). Is this going to significantly affect cross-locale >>>>>>>> performance? >>>>>>>> >>>>>>>> I've looked at the generated C code and tried to minimize the >>>>>>>>_comm_ >>>>>>>> operations in core methods, but doesn't seem to help. Network >>>>>>>>usage is >>>>>>>> still quite low (100K/s) while CPUs are pegged. Are there any >>>>>>>> profiling >>>>>>>> tools I can use to understand what might be going on here? >>>>>>>> >>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM >>>>>>>>records >>>>>>>> in >>>>>>>> under 10s. With 16 nodes, it takes 10min to do 100k records. >>>>>>>> >>>>>>>> Wondering if there's some systemic issue at play here and how can >>>>>>>> further >>>>>>>> investigate. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> Brian >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>> > > ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
