On Mar 14, 2014, at 9:06, Emmanuel Bernard <emman...@hibernate.org> wrote:
> > >> On 13 mars 2014, at 23:39, Sanne Grinovero <sa...@infinispan.org> wrote: >> >>> On 13 March 2014 22:19, Mircea Markus <mmar...@redhat.com> wrote: >>> >>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sa...@infinispan.org> wrote: >>>> >>>>> On 13 March 2014 22:05, Mircea Markus <mmar...@redhat.com> wrote: >>>>> >>>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.jus...@gmail.com> wrote: >>>>> >>>>>>> - also important to notice that we will have both an Hadoop and an >>>>>>> Infinispan cluster running in parallel: the user will interact with the >>>>>>> former in order to run M/R tasks. Hadoop will use Infinispan >>>>>>> (integration achieved through InputFormat and OutputFormat ) in order >>>>>>> to get the data to be processed. >>>>>> >>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as >>>>>> well -- hence 1JVM? >>>>> >>>>> good point, ideally it should be a single VM: reduced serialization cost >>>>> (in vm access) and simpler architecture. That's if you're not using C/S >>>>> mode, of course. >>>> >>>> ? >>>> Don't try confusing us again on that :-) >>>> I think we agreed that the job would *always* run in strict locality >>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client >>>> would be connecting from somewhere else but that's unrelated. >>> >>> we did discuss the possibility of running it over hotrod though, do you see >>> a problem with that? >> >> No of course not, we discussed that. I just mean I think that needs to >> be clarified on the list that the Hadoop engine will always run in the >> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop >> native clients, or Hadoop clients over Hot Rod) can indeed connect >> remotely, but it's important to clarify that the processing itself >> will take advantage of locality in all configurations. In other words, >> to clarify that the serialization cost you mention for clients is just >> to transfer the job definition and optionally the final processing >> result. >> > > Not quite. The serialization cost Mircea mentions I think is between the > Hadoop vm and the Infinispan vm on a single node. The serialization does not > require network traffic but is still shuffling data between two processes > basically. We could eliminate this by starting both Hadoop and Infinispan > from the same VM but that requires more work than necessary for a prototype. thanks for the clarification, indeed this is the serialization overhead I had in mind. > > So to clarify, we will have a cluster of nodes where each node contains two > JVM, one running an Hadoop process, one running an Infinispan process. The > Hadoop process would only read the data from the Infinispan process in the > same node during a normal M/R execution. Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org) _______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev