Hi,

I had a very good conversation with Jonathan Halliday, Sanne and Emmanuel 
around the integration between Infinispan and Hadoop. Just to recap, the goal 
is to be able to run Hadoop M/R tasks on data that is stored in Infinispan in 
order to gain speed. (once we have a prototype in place, one of the first tasks 
will be to validate this speed assumptions).

In previous discussions we explored the idea of providing an HDFS 
implementation for Infinispan, which whilst doable, might not be the best 
integration point:
- in order to run M/R jobs, hadoop interacts with two interfaces: 
InputFormat[1] and OutputFormat[2]
- it's the specific InputFormat and OutputForman implementations that work on 
top of HDFS
- instead of implementing HDFS, we could provide implementations for the 
InputFormat and OutputFormat interfaces, which would give us more flexibility
- this seems to be the preferred integration point for other systems, such as 
Cassandra
- also important to notice that we will have both an Hadoop and an Infinispan 
cluster running in parallel: the user will interact with the former in order to 
run M/R tasks. Hadoop will use Infinispan (integration achieved through 
InputFormat and OutputFormat ) in order to get the data to be processed. 
- Assumptions that we'll need to validate: this approach doesn't impose any 
constraint on how data is stored in Infinispan and should allow data to be read 
through the Map interface. Also InputFormat and OutputFormat implementations 
would only use get(k) and keySet() methods, and no native Infinispan M/R, which 
means that C/S access should also be possible.
- very important: hadoop HDFS is an append only file system, and the M/R tasks 
operate on a snapshot of the data. From a task's perspective, all the data in 
the storage doesn't change after the task is started. More data can be appended 
whilst the task runs, but this won't be visible by the task. Infinispan doesn't 
have such an append structure, nor MVCC. The closest thing we have is the 
Snapshot isolation transactions implemented by the cloudTM project (this is not 
integrated yet). I assume that the M/R tasks are built with this 
snapshot-issolation requirement from the storage - this is something that we 
should investigate as well. It is possible that, in the first stages of this 
integration, we would require data stored in ISPM to be read only.

[1] 
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/InputFormat.html
[2] 
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/OutputFormat.html


Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)





_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Reply via email to