Re: Force pipe executions to run on same node

Stadin, Benjamin Mon, 23 May 2016 14:45:06 -0700

Hi Jesse,

I didn't provide more details to this. The input CAD data is small (though 
rather 20-50MB per file), but as I said there is lots of very IO bound data 
processing done which produces a large amount of temporary data (but still not 
Big Data, rather a few hundred MB up  to some GB at most).

This is about distributing many of these local data processing tasks to several 
nodes, in order to provide a scalable realtime service for users of a web site. 
So I'd mostly use Beam as a building block for distributing and monitoring 
jobs, rather than anything big data.

Thanks
Ben

Von meinem iPad gesendet

Am 23.05.2016 um 21:59 schrieb Jesse Anderson 
<[email protected]<mailto:[email protected]>>:

Benjamin,

Sorry, the success and failures are a bit too nuanced for an email.

A quick check on average CAD files says they're around 1 MB. That'd be a poor 
use of HDFS.

Thanks,

Jesse

On Mon, May 23, 2016 at 11:08 AM Stadin, Benjamin 
<[email protected]<mailto:[email protected]>>
 wrote:
Hi Jesse,

Yes, this is what I’m looking for. I want to deploy and run the same code, 
mostly written in Python as well as C++, on different nodes. I also want to 
benefit from the job distribution and job monitoring / administration 
capabilities. I only need parallelization to a minor degree later.

Though I’m hesitant to use HDFS, or any other distributed file system. Since I 
process the data only on one node, it will probably be big disadvantage for 
this data to be distributed to other nodes as well via HDFS.

Could you maybe share some info about the successful implementations and 
configurations of such distributed job engine?

Thanks
Ben

Von: Jesse Anderson <[email protected]<mailto:[email protected]>>
Antworten an: 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Datum: Montag, 23. Mai 2016 um 19:22
An: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Betreff: Re: Force pipe executions to run on same node

Benjamin,

I've had a few students using Big Data frameworks as a distributed job engine. 
They work in varying degrees of success.

With Beam, your success will really depend on the runner as JB said. If I 
understand your use case correctly, if you were using Hadoop MapReduce, you'd 
be using a map-only job. Beam would give you the ability to run the same code 
on several different execution engines. If that isn't your goal, you might look 
elsewhere.

Thanks,

Jesse

On Mon, May 23, 2016 at 6:47 AM Jean-Baptiste Onofré 
<[email protected]<mailto:[email protected]>> wrote:
Hi Benjamin,

Your data processing doesn't seem to be fully big data oriented and
distributed.

Maybe Apache Camel is more appropriate for such scenario. You can always
delegate part of the data processing to Beam from Camel (using Kafka
topic for instance).

Regards
JB

On 05/22/2016 11:01 PM, Stadin, Benjamin wrote:
> Hi JB,
>
> None so far. I¹m still thinking about how to achieve what I want to do,
> and whether Beam makes sense for my usage scenario.
>
> I¹m mostly interested to just orchestrate tasks to individual machines and
> service endpoints, depending on their workload. My application is not so
> much about Big Data and parallelism, but local data processing and local
> parallelization.
>
> An example scenario:
> - A user uploads a set of CAD files
> - data from CAD files are extracted in parallel
> - a whole bunch of native tools operate on this extracted data set in an
> own pipe. Due to the amount of data generated and consumed, it doesn¹t
> make sense at all to distribute these tasks to other machines. It¹s very
> IO bound.
> - For the same reason, it doesn¹t make sense to distribute data using RDD.
> It¹s rather favorable to do only some tasks (such as CAD data extraction)
> in parallel, otherwise run other data tasks as a group on a single node,
> in order to avoid IO bottle necks.
>
> So I don¹t have a typical Big Data processing in mind. What I¹m looking
> for is rather an integrated environment to provide only some kind of
> parallel task execution, and task management and administration, as well
> as a message bus and event system.
>
> Is Beam a choice for such rather non-Big-Data scenario?
>
> Regards,
> Ben
>
>
> Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofré" unter 
> <[email protected]<mailto:[email protected]>>:
>
>> Hi Ben,
>>
>> it's not SDK related, it's more depend on the runner.
>>
>> What runner are you using ?
>>
>> Regards
>> JB
>>
>> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>>> Hi,
>>>
>>> I need to control beam pipes/filters so that pipe executions that match
>>> a certain criteria are executed on the same node.
>>>
>>> In Spring XD this can be controlled by defining groups
>>>
>>> (http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>>> yment)
>>> and then specify deployment criteria to match this group.
>>>
>>> Is this possible with Beam?
>>>
>>> Best
>>> Ben
>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]<mailto:[email protected]>
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

--
Jean-Baptiste Onofré
[email protected]<mailto:[email protected]>
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Force pipe executions to run on same node

Reply via email to