Re: [infinispan-dev] Infinispan - Hadoop integration

Emmanuel Bernard Mon, 17 Mar 2014 08:32:39 -0700

Got it now.
That being said, if Alan is correct (one JVM per M/R task run per node), we 
will need to implement C/S local key and keyset lookup.


Emmanuel

On 14 Mar 2014, at 12:34, Sanne Grinovero <sa...@infinispan.org> wrote:

> On 14 March 2014 09:06, Emmanuel Bernard <emman...@hibernate.org> wrote:
>> 
>> 
>>> On 13 mars 2014, at 23:39, Sanne Grinovero <sa...@infinispan.org> wrote:
>>> 
>>>> On 13 March 2014 22:19, Mircea Markus <mmar...@redhat.com> wrote:
>>>> 
>>>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sa...@infinispan.org> wrote:
>>>>> 
>>>>>> On 13 March 2014 22:05, Mircea Markus <mmar...@redhat.com> wrote:
>>>>>> 
>>>>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.jus...@gmail.com> wrote:
>>>>>> 
>>>>>>>> - also important to notice that we will have both an Hadoop and an 
>>>>>>>> Infinispan cluster running in parallel: the user will interact with 
>>>>>>>> the former in order to run M/R tasks. Hadoop will use Infinispan 
>>>>>>>> (integration achieved through InputFormat and OutputFormat ) in order 
>>>>>>>> to get the data to be processed.
>>>>>>> 
>>>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as 
>>>>>>> well -- hence 1JVM?
>>>>>> 
>>>>>> good point, ideally it should be a single VM: reduced serialization cost 
>>>>>> (in vm access) and simpler architecture. That's if you're not using C/S 
>>>>>> mode, of course.
>>>>> 
>>>>> ?
>>>>> Don't try confusing us again on that :-)
>>>>> I think we agreed that the job would *always* run in strict locality
>>>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client
>>>>> would be connecting from somewhere else but that's unrelated.
>>>> 
>>>> we did discuss the possibility of running it over hotrod though, do you 
>>>> see a problem with that?
>>> 
>>> No of course not, we discussed that. I just mean I think that needs to
>>> be clarified on the list that the Hadoop engine will always run in the
>>> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
>>> native clients, or Hadoop clients over Hot Rod) can indeed connect
>>> remotely, but it's important to clarify that the processing itself
>>> will take advantage of locality in all configurations. In other words,
>>> to clarify that the serialization cost you mention for clients is just
>>> to transfer the job definition and optionally the final processing
>>> result.
>>> 
>> 
>> Not quite. The serialization cost Mircea mentions I think is between the 
>> Hadoop vm and the Infinispan vm on a single node. The serialization does not 
>> require network traffic but is still shuffling data between two processes 
>> basically. We could eliminate this by starting both Hadoop and Infinispan 
>> from the same VM but that requires more work than necessary for a prototype.
> 
> Ok so there was indeed confusion on terminology: I don't agree with that 
> design.
>> From an implementor's effort perspective having to setup an Hot Rod
> client rather than embedding an Infinispan node is approximately the
> same work, or slightly more as you have to start both. Also to test
> it, embedded mode it easier.
> 
> Hot Rod is not meant to be used on the same node, especially not if
> you only want to access data in strict locality; for example it
> wouldn't be able to iterated on all keys of the current server node
> (and limiting to those keys only). I might be wrong as I'm not too
> familiar with Hot Rod, but I think it might not even be able to
> iterate on keys at all; maybe today it can actually via some trick,
> but the point is this is a conceptual mismatch for it.
> 
> Where you say this doesn't require nework traffic you need to consider
> that while it's true this might not be using the physical network wire
> being localhost, it would still be transferred over a costly network
> stream, as we don't do off-heap buffer sharing yet.
> 
>> So to clarify, we will have a cluster of nodes where each node contains two 
>> JVM, one running an Hadoop process, one running an Infinispan process. The 
>> Hadoop process would only read the data from the Infinispan process in the 
>> same node during a normal M/R execution.
> 
> So we discussed two use cases:
> - engage Infinispan to accelerate an existing Hadoop deployment
> - engage Hadoop to run an Hadoop job on existing data in Infinispan
> In neither case I see why I'd run them in separate JVMs: seems less
> effective and more work to get done, and no benefit unless you're
> thinking about independent JVM tuning? That might be something to
> consider, but I doubt tuning independence would ever offset the cost
> of serialized transfer of each entry.
> 
> The second use case could be used via Hot Rod too, but that's a
> different discussion, actually just a nice side effect of Hadoop being
> language agnostic that we would take advantage of.
> 
> Sanne
> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev@lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev@lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Re: [infinispan-dev] Infinispan - Hadoop integration

Reply via email to