Thanks for the tip Adam. We briefly looked at Chronos a couple of days ago, and felt like it was an overkill for us.
Just yesterday we were able to successfully distribute part of our workload across a 15-node mesos cluster (240 cores) using a modified mesos-submit framework we hacked from the examples into our distribution code (Python). We plan to explore the options further in the near future, probably re-examining things we skipped too fast (like Chronos). Hope I'll be able to share back what we come up with at the end. Wish I could attend the upcoming MesosCon. I'm sure I would have picked up some useful stuff there. Will there be recordings of the sessions available? For scheduling workflows of batch jobs, I would recommend looking into the Chronos framework for Mesos: https://github.com/airbnb/chronos http://mesosphere.io/learn/run-chronos-on-mesos/ http://nerds.airbnb.com/introducing-chronos/ On Thu, Jul 24, 2014 at 4:58 AM, Itamar Ostricher <ita...@yowza3d.com> wrote: > Not written in MPI. Each task is a stand-alone execution of a binary > program that takes the 1-2 data file paths as parameters (GCS paths), with > the output stored in another GCS file (path as flag). > Different tasks do not need to communicate with others. Tasks only talk > with GCS to read and write their data. > The biggest bottleneck (for most tasks) is CPU. Few of the tasks do little > processing, so in these cases the bottleneck is GCS latency. > > Our current solution is to run N services on each machine (N = number of > cores on the machine), with the main Python script sending commands to > available services (using sockets). > We are not happy with this solution because it requires us to deal with > too many low-level details, like tracking the status of the services, > restarting lost tasks, collecting logs, etc. > > > On Thu, Jul 24, 2014 at 11:37 AM, Tomas Barton <barton.to...@gmail.com> > wrote: > >> Depends on the nature of your tasks. Your code is written in MPI? You >> tasks needs to communicate with others? One task will operate on all files, >> some subset, or just on file? You might have: >> - one task per machine running on as many cores as possible >> - many smaller tasks starting in a dynamic manner depending on the >> data >> >> What is the biggest bottleneck you have? disk read/write, network, CPU, >> memory? >> >> Writing own framework is possible, if you can take advantage of some >> problem specific property. >> >> >> On 24 July 2014 07:34, Itamar Ostricher <ita...@yowza3d.com> wrote: >> >>> many: we have a processing pipeline with ~10 stages (one C++ program per >>> stage usually), batch processing (almost-)all pairs of files in the >>> dataset. the dataset contains >10K files at the moment, so a couple of >>> hundreds of millions of program executions would be my definition for >>> "many" in this case :-) >>> >>> I'll start with few machines with deploy scripts and a small subset of >>> the dataset just to get the hang of it. >>> It's a bit difficult to comprehend the stack, with all the possible >>> options and combinations, though. >>> If I have a main Python script that generates all the processing >>> pipeline commands (that can be simply executed via shell), should I use a >>> specific framework (like Hydra)? Or maybe use raw mesos? Or maybe I should >>> write my own framework? >>> >>> >>> On Wed, Jul 23, 2014 at 2:25 PM, Tomas Barton <barton.to...@gmail.com> >>> wrote: >>> >>>> Define many :) If you want to use some provisioning tools like Puppet, >>>> Chef, Ansible... there are quite a few modules to do this job: >>>> >>>> http://mesosphere.io/learn/#tools >>>> >>>> If you have only a few machines, you might be fine with deploy scripts. >>>> >>>> An example of MPI framework is here: >>>> >>>> https://github.com/mesosphere/mesos-hydra >>>> >>>> >>>> >>>> >>>> On 23 July 2014 12:26, Itamar Ostricher <ita...@yowza3d.com> wrote: >>>> >>>>> Thanks Tomas. >>>>> >>>>> ldconfig didn't change anything. make still failed. >>>>> >>>>> But the Debian packaged installed like a charm, so I'm good :-) >>>>> Now I just need to figure out how to use it... >>>>> (going to start with [1], unless anyone chimes in with a better >>>>> recommended starting point for a mesos-newbie who is trying to set up a >>>>> cluster of GCE instances in order to distribute execution of *many* C++ >>>>> programs working on a large dataset that is currently stored in Google >>>>> Cloud Storage.) >>>>> >>>>> [1] http://mesos.apache.org/documentation/latest/deploy-scripts/ >>>>> >>>>> >>>>> On Wed, Jul 23, 2014 at 11:55 AM, Tomas Barton <barton.to...@gmail.com >>>>> > wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> that's quite strange. Try to run >>>>>> >>>>>> ldconfig >>>>>> >>>>>> and then again make. >>>>>> >>>>>> You can find binary packages for Debian here: >>>>>> http://mesosphere.io/downloads/ >>>>>> >>>>>> Tomas >>>>>> >>>>>> >>>>>> On 23 July 2014 10:09, Itamar Ostricher <ita...@yowza3d.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm trying to do a clean build of mesos for the 0.19.0 tarball. >>>>>>> I was following the instructions from >>>>>>> http://mesos.apache.org/gettingstarted/ step by step. Got to >>>>>>> running `make`, which ran for quite a while, and exited with errors (see >>>>>>> the end of the output below). >>>>>>> >>>>>>> Extra env info: I'm trying to do this build on a 64-bit Debian GCE >>>>>>> instance: >>>>>>> itamar@mesos-test-1:/tmp/mesos-0.19.0/build$ uname -a >>>>>>> Linux mesos-test-1 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 >>>>>>> GNU/Linux >>>>>>> >>>>>>> Assistance will be much appreciated! >>>>>>> Alternatively, I don't mind using precompiled binaries, if anyone >>>>>>> can point me in the direction of such binaries for the GCE environment I >>>>>>> described :-) >>>>>>> >>>>>>> tail of make output: >>>>>>> ---------------------------- >>>>>>> >>>>>>> libtool: link: warning: >>>>>>> `/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../lib/libgflags.la' >>>>>>> seems to be moved >>>>>>> *** Warning: Linking the shared library libmesos.la against the >>>>>>> *** static library ../3rdparty/leveldb/libleveldb.a is not portable! >>>>>>> libtool: link: warning: >>>>>>> `/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../lib/libgflags.la' >>>>>>> seems to be moved >>>>>>> libtool: link: g++ -fPIC -DPIC -shared -nostdlib >>>>>>> /usr/lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/crti.o >>>>>>> /usr/lib/gcc/x86_64-linux-gnu/4.7/crtbeginS.o -Wl,--whole-archive >>>>>>> ./.libs/libmesos_no_3rdparty.a ../3rdparty/libprocess/.libs/libprocess.a >>>>>>> ./.libs/libjava.a -Wl,--no-whole-archive >>>>>>> ../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src/.libs/libprotobuf.a >>>>>>> ../3rdparty/libprocess/3rdparty/glog-0.3.3/.libs/libglog.a >>>>>>> -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../lib >>>>>>> ../3rdparty/leveldb/libleveldb.a >>>>>>> ../3rdparty/zookeeper-3.4.5/src/c/.libs/libzookeeper_mt.a >>>>>>> /tmp/mesos-0.19.0/build/3rdparty/libprocess/3rdparty/glog-0.3.3/.libs/libglog.a >>>>>>> /usr/lib/libgflags.so -lpthread >>>>>>> /tmp/mesos-0.19.0/build/3rdparty/libprocess/3rdparty/libev-4.15/.libs/libev.a >>>>>>> -lsasl2 /usr/lib/x86_64-linux-gnu/libcurl-nss.so -lz -lrt >>>>>>> -L/usr/lib/gcc/x86_64-linux-gnu/4.7 >>>>>>> -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu >>>>>>> -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu >>>>>>> -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../.. -lstdc++ >>>>>>> -lm >>>>>>> -lc -lgcc_s /usr/lib/gcc/x86_64-linux-gnu/4.7/crtendS.o >>>>>>> /usr/lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/crtn.o >>>>>>> -pthread -Wl,-soname -Wl,libmesos-0.19.0.so -o .libs/ >>>>>>> libmesos-0.19.0.so >>>>>>> libtool: link: (cd ".libs" && rm -f "libmesos.so" && ln -s " >>>>>>> libmesos-0.19.0.so" "libmesos.so") >>>>>>> libtool: link: ( cd ".libs" && rm -f "libmesos.la" && ln -s "../ >>>>>>> libmesos.la" "libmesos.la" ) >>>>>>> g++ -DPACKAGE_NAME=\"mesos\" -DPACKAGE_TARNAME=\"mesos\" >>>>>>> -DPACKAGE_VERSION=\"0.19.0\" -DPACKAGE_STRING=\"mesos\ 0.19.0\" >>>>>>> -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\" >>>>>>> -DVERSION=\"0.19.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 >>>>>>> -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 >>>>>>> -DHAVE_MEMORY_H=1 >>>>>>> -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 >>>>>>> -DHAVE_UNISTD_H=1 >>>>>>> -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_PTHREAD=1 >>>>>>> -DMESOS_HAS_JAVA=1 >>>>>>> -DHAVE_PYTHON=\"2.7\" -DMESOS_HAS_PYTHON=1 -DHAVE_LIBZ=1 >>>>>>> -DHAVE_LIBCURL=1 >>>>>>> -DHAVE_LIBSASL2=1 -I. -I../../src -Wall -Werror >>>>>>> -DLIBDIR=\"/usr/local/lib\" -DPKGLIBEXECDIR=\"/usr/local/libexec/mesos\" >>>>>>> -DPKGDATADIR=\"/usr/local/share/mesos\" -I../../include >>>>>>> -I../../3rdparty/libprocess/include >>>>>>> -I../../3rdparty/libprocess/3rdparty/stout/include -I../include >>>>>>> -I../3rdparty/libprocess/3rdparty/boost-1.53.0 >>>>>>> -I../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src >>>>>>> -I../3rdparty/libprocess/3rdparty/picojson-4f93734 >>>>>>> -I../3rdparty/libprocess/3rdparty/glog-0.3.3/src >>>>>>> -I../3rdparty/leveldb/include >>>>>>> -I../3rdparty/zookeeper-3.4.5/src/c/include >>>>>>> -I../3rdparty/zookeeper-3.4.5/src/c/generated -pthread -g -g2 -O2 -MT >>>>>>> local/mesos_local-main.o -MD -MP -MF local/.deps/mesos_local-main.Tpo >>>>>>> -c -o >>>>>>> local/mesos_local-main.o `test -f 'local/main.cpp' || echo >>>>>>> '../../src/'`local/main.cpp >>>>>>> mv -f local/.deps/mesos_local-main.Tpo >>>>>>> local/.deps/mesos_local-main.Po >>>>>>> /bin/bash ../libtool --tag=CXX --mode=link g++ -pthread -g -g2 >>>>>>> -O2 -o mesos-local local/mesos_local-main.o libmesos.la -lsasl2 >>>>>>> -lcurl -lz -lrt >>>>>>> libtool: link: g++ -pthread -g -g2 -O2 -o .libs/mesos-local >>>>>>> local/mesos_local-main.o ./.libs/libmesos.so /usr/lib/libgflags.so >>>>>>> -lpthread -lsasl2 /usr/lib/x86_64-linux-gnu/libcurl-nss.so -lz -lrt >>>>>>> -pthread >>>>>>> ./.libs/libmesos.so: error: undefined reference to 'dlopen' >>>>>>> ./.libs/libmesos.so: error: undefined reference to 'dlsym' >>>>>>> ./.libs/libmesos.so: error: undefined reference to 'dlerror' >>>>>>> collect2: error: ld returned 1 exit status >>>>>>> make[2]: *** [mesos-local] Error 1 >>>>>>> make[2]: Leaving directory `/tmp/mesos-0.19.0/build/src' >>>>>>> make[1]: *** [all] Error 2 >>>>>>> make[1]: Leaving directory `/tmp/mesos-0.19.0/build/src' >>>>>>> make: *** [all-recursive] Error 1 >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >