On Mon, Jul 14, 2014 at 9:36 AM, Pat Ferrel <[email protected]> wrote:
> 1) every change to the DSL should be implemented either in core-math or in > _both_ engines, right? So every committer will have to be willing to take > this on when changing the DSL, right? We don’t want divergence in DSL > implementation. > Well, I think that every committer should sign up to build a bit of a consortium of engine-oriented committers to handle the change. There will be some specialization before long. > 2) are we going to allow the build to be broken for extended periods > (hopefully only a day or two) until one or the other expert gets time to > help with a DSL implementation? > No. I think that the original committer should insert a stub implementation that throws an exception and file a JIRA. The unit test for the capability may have to be limited temporarily, but the build should not break. The engine-doesn't-do-this JIRA should be a release stopper. > This is for cases where #1 is not possible. This will happen with both > tests and abstract defs in core-math that are carried through other engine > specific classes. The way to get things to compile may not be immediately > obvious so to keep things going a profile or target for each engine might > help. > Profile is an interesting idea. > 3) This will create an instant split in what algos are implemented on h2o > and spark. We should clearly mark these and ideally minimize them. > Agree. > 4) Users are going to be confused. Do they need to install Spark or not, > what runs on what, what are the differences? The ideal is to say it all > runs on both so all users have to do is choose their engine. But that may > never happen. How do we handle this? There is coming confusion over Hadoop > mr vs Spark, we don’t want to add to this. > Fair point. Just like the confusion between XFS and EXT3 and EXT4 and ZFS. Needs documentation. > 5) Can we agree on file level formats and/or other ways to pass a > parallelized drm from one engine to the other? This will allow us to create > hybrid pipelines, potentially easing user confusion. > I want to avoid file level data communication as much as possible. Will it be possible to make the file handling generic? I can see how it might be and how it might not be possible. Can we push the file handling back on the user? Can we only support a few persistence technologies (say, local file, hdfs and URL)?
