Got it now. That being said, if Alan is correct (one JVM per M/R task run per node), we will need to implement C/S local key and keyset lookup.
Emmanuel On 14 Mar 2014, at 12:34, Sanne Grinovero <sa...@infinispan.org> wrote: > On 14 March 2014 09:06, Emmanuel Bernard <emman...@hibernate.org> wrote: >> >> >>> On 13 mars 2014, at 23:39, Sanne Grinovero <sa...@infinispan.org> wrote: >>> >>>> On 13 March 2014 22:19, Mircea Markus <mmar...@redhat.com> wrote: >>>> >>>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sa...@infinispan.org> wrote: >>>>> >>>>>> On 13 March 2014 22:05, Mircea Markus <mmar...@redhat.com> wrote: >>>>>> >>>>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.jus...@gmail.com> wrote: >>>>>> >>>>>>>> - also important to notice that we will have both an Hadoop and an >>>>>>>> Infinispan cluster running in parallel: the user will interact with >>>>>>>> the former in order to run M/R tasks. Hadoop will use Infinispan >>>>>>>> (integration achieved through InputFormat and OutputFormat ) in order >>>>>>>> to get the data to be processed. >>>>>>> >>>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as >>>>>>> well -- hence 1JVM? >>>>>> >>>>>> good point, ideally it should be a single VM: reduced serialization cost >>>>>> (in vm access) and simpler architecture. That's if you're not using C/S >>>>>> mode, of course. >>>>> >>>>> ? >>>>> Don't try confusing us again on that :-) >>>>> I think we agreed that the job would *always* run in strict locality >>>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client >>>>> would be connecting from somewhere else but that's unrelated. >>>> >>>> we did discuss the possibility of running it over hotrod though, do you >>>> see a problem with that? >>> >>> No of course not, we discussed that. I just mean I think that needs to >>> be clarified on the list that the Hadoop engine will always run in the >>> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop >>> native clients, or Hadoop clients over Hot Rod) can indeed connect >>> remotely, but it's important to clarify that the processing itself >>> will take advantage of locality in all configurations. In other words, >>> to clarify that the serialization cost you mention for clients is just >>> to transfer the job definition and optionally the final processing >>> result. >>> >> >> Not quite. The serialization cost Mircea mentions I think is between the >> Hadoop vm and the Infinispan vm on a single node. The serialization does not >> require network traffic but is still shuffling data between two processes >> basically. We could eliminate this by starting both Hadoop and Infinispan >> from the same VM but that requires more work than necessary for a prototype. > > Ok so there was indeed confusion on terminology: I don't agree with that > design. >> From an implementor's effort perspective having to setup an Hot Rod > client rather than embedding an Infinispan node is approximately the > same work, or slightly more as you have to start both. Also to test > it, embedded mode it easier. > > Hot Rod is not meant to be used on the same node, especially not if > you only want to access data in strict locality; for example it > wouldn't be able to iterated on all keys of the current server node > (and limiting to those keys only). I might be wrong as I'm not too > familiar with Hot Rod, but I think it might not even be able to > iterate on keys at all; maybe today it can actually via some trick, > but the point is this is a conceptual mismatch for it. > > Where you say this doesn't require nework traffic you need to consider > that while it's true this might not be using the physical network wire > being localhost, it would still be transferred over a costly network > stream, as we don't do off-heap buffer sharing yet. > >> So to clarify, we will have a cluster of nodes where each node contains two >> JVM, one running an Hadoop process, one running an Infinispan process. The >> Hadoop process would only read the data from the Infinispan process in the >> same node during a normal M/R execution. > > So we discussed two use cases: > - engage Infinispan to accelerate an existing Hadoop deployment > - engage Hadoop to run an Hadoop job on existing data in Infinispan > In neither case I see why I'd run them in separate JVMs: seems less > effective and more work to get done, and no benefit unless you're > thinking about independent JVM tuning? That might be something to > consider, but I doubt tuning independence would ever offset the cost > of serialized transfer of each entry. > > The second use case could be used via Hot Rod too, but that's a > different discussion, actually just a nice side effect of Hadoop being > language agnostic that we would take advantage of. > > Sanne > >> _______________________________________________ >> infinispan-dev mailing list >> infinispan-dev@lists.jboss.org >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev