Welcome to the community Richard! I suspect Hadoop can be more useful than just splitting and stitching back data. Depending on your use cases, it may come in handy to manage your machines, restart failed tasks, scheduling work when data becomes available etc. I wouldn't necessarily count it out. I'm sorry I am not familiar with celery, so I can't provide a direct comparison. Also, in the non-rare chance that your input data grows, you wouldn't have to rewrite your infrastructure code if you wrote your Hadoop code properly.
HTH Ravi On Mon, Jul 18, 2016 at 9:23 AM, Marcin Tustin <mtus...@handybook.com> wrote: > I think you're confused as to what these things are. > > The fundamental question is do you want to run one job on sub parts of the > data, then stitch their results together (in which case > hive/map-reduce/spark will be for you), or do you essentially already have > splitting to computer-sized chunks figured out, and you just need a work > queue? In the latter case there are a number of alternatives. I happen to > like python, and would recommend celery (potentially wrapped by something > like airflow) for that case. > > On Mon, Jul 18, 2016 at 12:17 PM, Richard Whitehead < > richard.whiteh...@ieee.org> wrote: > >> Hello, >> >> I wonder if the community can help me get started. >> >> I’m trying to design the architecture of a project and I think that using >> some Apache Hadoop technologies may make sense, but I am completely new to >> distributed systems and to Apache (I am a very experienced developer, but >> my expertise is image processing on Windows!). >> >> The task is very simple: call 3 or 4 executables in sequence to process >> some data. The data is just a simple image and the processing takes tens >> of minutes. >> >> We are considering a distributed architecture to increase throughput >> (latency does not matter). So we need a way to queue work on remote >> computers, and a way to move the data around. The architecture will have >> to work n a single server, or on a couple of servers in a rack, or in the >> cloud; 2 or 3 computers maximum. >> >> Being new to all this I would prefer something simple rather than >> something super-powerful. >> >> I was considering Hadoop YARN and Hadoop DFS, does this make sense? I’m >> assuming MapReduce would be over the top, is that the case? >> >> Thanks in advance. >> >> Richard >> > > > Want to work at Handy? Check out our culture deck and open roles > <http://www.handy.com/careers> > Latest news <http://www.handy.com/press> at Handy > Handy just raised $50m > <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> > led > by Fidelity > >