Re: [infinispan-dev] Infinispan - Hadoop integration

Mircea Markus Fri, 14 Mar 2014 08:26:43 -0700

On Mar 14, 2014, at 9:06, Emmanuel Bernard <emman...@hibernate.org> wrote:


> 
> 
>> On 13 mars 2014, at 23:39, Sanne Grinovero <sa...@infinispan.org> wrote:
>> 
>>> On 13 March 2014 22:19, Mircea Markus <mmar...@redhat.com> wrote:
>>> 
>>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sa...@infinispan.org> wrote:
>>>> 
>>>>> On 13 March 2014 22:05, Mircea Markus <mmar...@redhat.com> wrote:
>>>>> 
>>>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.jus...@gmail.com> wrote:
>>>>> 
>>>>>>> - also important to notice that we will have both an Hadoop and an 
>>>>>>> Infinispan cluster running in parallel: the user will interact with the 
>>>>>>> former in order to run M/R tasks. Hadoop will use Infinispan 
>>>>>>> (integration achieved through InputFormat and OutputFormat ) in order 
>>>>>>> to get the data to be processed.
>>>>>> 
>>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as 
>>>>>> well -- hence 1JVM?
>>>>> 
>>>>> good point, ideally it should be a single VM: reduced serialization cost 
>>>>> (in vm access) and simpler architecture. That's if you're not using C/S 
>>>>> mode, of course.
>>>> 
>>>> ?
>>>> Don't try confusing us again on that :-)
>>>> I think we agreed that the job would *always* run in strict locality
>>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client
>>>> would be connecting from somewhere else but that's unrelated.
>>> 
>>> we did discuss the possibility of running it over hotrod though, do you see 
>>> a problem with that?
>> 
>> No of course not, we discussed that. I just mean I think that needs to
>> be clarified on the list that the Hadoop engine will always run in the
>> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
>> native clients, or Hadoop clients over Hot Rod) can indeed connect
>> remotely, but it's important to clarify that the processing itself
>> will take advantage of locality in all configurations. In other words,
>> to clarify that the serialization cost you mention for clients is just
>> to transfer the job definition and optionally the final processing
>> result.
>> 
> 
> Not quite. The serialization cost Mircea mentions I think is between the 
> Hadoop vm and the Infinispan vm on a single node. The serialization does not 
> require network traffic but is still shuffling data between two processes 
> basically. We could eliminate this by starting both Hadoop and Infinispan 
> from the same VM but that requires more work than necessary for a prototype.

thanks for the clarification, indeed this is the serialization overhead I had 
in mind.

> 
> So to clarify, we will have a cluster of nodes where each node contains two 
> JVM, one running an Hadoop process, one running an Infinispan process. The 
> Hadoop process would only read the data from the Infinispan process in the 
> same node during a normal M/R execution. 

Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)





_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Re: [infinispan-dev] Infinispan - Hadoop integration

Reply via email to