Hi Nick,
Thank you for the comments.
(2014/11/22 3:42), Nick Dimiduk wrote:
I would also encourage you to consider joining forces with DataFu,
rather than "competing". I think there's a real appetite a wholistic
toolbox of patterns and implementations that can span these projects.
From my understanding, there's nothing about DataFu that's unique to
Pig, they just need the work done to abstract away the Pig bits and
implement the Hive interfaces.
My current understanding of DataFu is that it is UDF collections for
Apache Pig. Though Hive interface is not yet supported in DataFu, is the
direction (to extend DataFu for Hive) a consensus in DataFu community?
My concern is that merging Hivemall codebase to DataFu makes the
building and packing process of DataFu complex and the target/objective
of the project unclear.
I do not think that Hivemall competes with DataFu because
1) There are users who prefer Pig and Hive respectively, and
2) Pig/DataFu is useful for what HiveQL is unsuited (e.g., complex
feature engineering steps). After preprocessing using DataFu, Hivemall
can be applied for classification/regression in a scalable way in Hive.
Is there anything about Hivemall that's unique to Hive, that wouldn't be
applicable to Pig as well?
The techniques used in Hivemall (e.g., training data amplification that
emulates iterative training and machine learning algorithms as
table-generating functions) could be appreciable to Apache Pig.
However, I am not a heavy user of Pig and porting Hivemall to Pig
requires a bunch of works. So, I am currently considering to stick with
HiveQL interfaces (Hive, HCatalog, and Tez for the software stack of
Hivemall) in developing Hivemall because SQL-like interface is friendly
to a broader range of developers.
Thanks,
Makoto
--
*******************************************
Makoto YUI <m....@aist.go.jp>
Information Technology Research Institute, AIST.
https://staff.aist.go.jp/m.yui/index_e.html
*******************************************