Hi Nick,

Thank you for the comments.

(2014/11/22 3:42), Nick Dimiduk wrote:
I would also encourage you to consider joining forces with DataFu,
rather than "competing". I think there's a real appetite a wholistic
toolbox of patterns and implementations that can span these projects.
 From my understanding, there's nothing about DataFu that's unique to
Pig, they just need the work done to abstract away the Pig bits and
implement the Hive interfaces.

My current understanding of DataFu is that it is UDF collections for Apache Pig. Though Hive interface is not yet supported in DataFu, is the direction (to extend DataFu for Hive) a consensus in DataFu community?

My concern is that merging Hivemall codebase to DataFu makes the building and packing process of DataFu complex and the target/objective of the project unclear.

I do not think that Hivemall competes with DataFu because
1) There are users who prefer Pig and Hive respectively, and
2) Pig/DataFu is useful for what HiveQL is unsuited (e.g., complex feature engineering steps). After preprocessing using DataFu, Hivemall can be applied for classification/regression in a scalable way in Hive.

Is there anything about Hivemall that's unique to Hive, that wouldn't be
applicable to Pig as well?

The techniques used in Hivemall (e.g., training data amplification that emulates iterative training and machine learning algorithms as table-generating functions) could be appreciable to Apache Pig.

However, I am not a heavy user of Pig and porting Hivemall to Pig requires a bunch of works. So, I am currently considering to stick with HiveQL interfaces (Hive, HCatalog, and Tez for the software stack of Hivemall) in developing Hivemall because SQL-like interface is friendly to a broader range of developers.

Thanks,
Makoto

--
*******************************************
Makoto YUI <m....@aist.go.jp>
Information Technology Research Institute, AIST.
https://staff.aist.go.jp/m.yui/index_e.html
*******************************************

Reply via email to