Re: [ANN] Hivemall: Hive scalable machine learning library
Hi, I added support for the-state-of-the-art classifiers (those are not yet supported in Mahout) and Hivemall's cute(!?) logo as well in Hivemall 0.1-rc3. Newly supported classifiers include - Confidence Weighted (CW) - Adaptive Regularization of Weight Vectors (AROW) - Soft Confidence Weighted (SCW1, SCW2) Those classifiers are much smart comparing to the standard SGD-based or passive aggressive classifiers. Please check it out by yourself. Thanks, Makoto (2013/10/11 4:28), Clark Yang (杨卓荦) wrote: I looks really cool, I think I will try it on. Cheers, Zhuoluo (Clark) Yang 2013/10/5 Makoto YUI yuin...@gmail.com mailto:yuin...@gmail.com Hi Edward, Thank you for your interst. Hivemall project does not have a plan to have a specific mailing list, I will answer following questions/comments on twitter or through Github issues (with a question label). BTW, I just added a CTR (Click-Through-Rate) prediction example that is provided by a commercial search engine provider for the KDDCup 2012 track 2. https://github.com/myui/__hivemall/wiki/KDDCup-2012-__track-2-CTR-prediction-dataset https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset I guess many of you working on ad CTR/CVR predictions. This example might be some help understanding how to do it only within Hive. Thanks, Makoto @myui (2013/10/04 23:02), Edward Capriolo wrote: Looks cool im already starting to play with it. On Friday, October 4, 2013, Makoto Yui yuin...@gmail.com mailto:yuin...@gmail.com mailto:yuin...@gmail.com mailto:yuin...@gmail.com wrote: Hi Dean, Thank you for your interest in Hivemall. Twitter's paper actually influenced me in developing Hivemall and I initially implemented such functionality as Pig UDFs. Though my Pig ML library is not released, you can find a similar attempt for Pig in https://github.com/y-tag/java-__pig-MyUDFs https://github.com/y-tag/java-pig-MyUDFs Thanks, Makoto 2013/10/3 Dean Wampler deanwamp...@gmail.com mailto:deanwamp...@gmail.com mailto:deanwamp...@gmail.com mailto:deanwamp...@gmail.com__: This is great news! I know that Twitter has done something similar with UDFs for Pig, as described in this paper: http://www.umiacs.umd.edu/~__jimmylin/publications/Lin___Kolcz_SIGMOD2012.pdf http://www.umiacs.umd.edu/%7Ejimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf http://www.umiacs.umd.edu/%__7Ejimmylin/publications/Lin___Kolcz_SIGMOD2012.pdf http://www.umiacs.umd.edu/%7Ejimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf I'm glad to see the same thing start with Hive. Dean On Wed, Oct 2, 2013 at 10:21 AM, Makoto YUI yuin...@gmail.com mailto:yuin...@gmail.com mailto:yuin...@gmail.com mailto:yuin...@gmail.com wrote: Hello all, My employer, AIST, has given the thumbs up to open source our machine learning library, named Hivemall. Hivemall is a scalable machine learning library running on Hive/Hadoop, licensed under the LGPL 2.1. https://github.com/myui/__hivemall https://github.com/myui/hivemall Hivemall provides machine learning functionality as well as feature engineering functions through UDFs/UDAFs/UDTFs of Hive. It is designed to be scalable to the number of training instances as well as the number of training features. Hivemall is very easy to use as every machine learning step is done within HiveQL. -- Installation is just as follows: add jar /tmp/hivemall.jar; source /tmp/define-all.hive; -- Logistic regression is performed by a query. SELECT feature, avg(weight) as weight FROM (SELECT logress(features,label) as (feature,weight) FROM training_features) t GROUP BY feature; You can find detailed examples on our wiki pages. https://github.com/myui/__hivemall/wiki/_pages https://github.com/myui/hivemall/wiki/_pages Though we consider that Hivemall is much easier to use and more scalable than Mahout for classification/regression tasks, please check it by yourself. If you have a Hive environment, you can evaluate Hivemall within 5 minutes or so. Hope you enjoy the
NullPointerException on Sample Tables / CDH 4.4
hey everyone, I've got supplied with a decent ten node CDH 4.4 cluster, only 7 days old, and someone tried some HBase stuff on it. I wanted to apply my (on another cluster working) workflow's to that cluster (consisting of HiveQL Scripts and Oozie Workflows) but unfortunately i the following issue: When trying to drop some select statement on any table (sample tables or tables regarding my workflow) i get the following error stack (from beeswax and from Hive CLI): java.io.IOException: org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlocks(DatanodeManager.java:334) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1245) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:413) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:172) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44938) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1751) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1747) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1745) I tried already setting up a new table (blank) and inserting some dummy data, but the error keeps the same. Some connection between HDFS and Hive regarind block information seems to be broken. Any idea how to fix this or where the configuration is to get rid of this or what other things we should check? Cheers Fabian
Re: NullPointerException on Sample Tables / CDH 4.4
somehow after three days of searching, i just deployed client configuration for all hive roles again, and the error seems to be gone. lets see what the future brings. cheers 2013/10/11 fab wol darkwoll...@gmail.com hey everyone, I've got supplied with a decent ten node CDH 4.4 cluster, only 7 days old, and someone tried some HBase stuff on it. I wanted to apply my (on another cluster working) workflow's to that cluster (consisting of HiveQL Scripts and Oozie Workflows) but unfortunately i the following issue: When trying to drop some select statement on any table (sample tables or tables regarding my workflow) i get the following error stack (from beeswax and from Hive CLI): java.io.IOException: org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlocks(DatanodeManager.java:334) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1245) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:413) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:172) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44938) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1751) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1747) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1745) I tried already setting up a new table (blank) and inserting some dummy data, but the error keeps the same. Some connection between HDFS and Hive regarind block information seems to be broken. Any idea how to fix this or where the configuration is to get rid of this or what other things we should check? Cheers Fabian
Re: [ANN] Hivemall: Hive scalable machine learning library
Just tried this for some hot trends in forum managements. Was pretty impressive. I will try this more deeply and if possible integrate in my product. Thanks for the awesome work. Nitin On Fri, Oct 11, 2013 at 12:58 PM, Makoto YUI yuin...@gmail.com wrote: Hi, I added support for the-state-of-the-art classifiers (those are not yet supported in Mahout) and Hivemall's cute(!?) logo as well in Hivemall 0.1-rc3. Newly supported classifiers include - Confidence Weighted (CW) - Adaptive Regularization of Weight Vectors (AROW) - Soft Confidence Weighted (SCW1, SCW2) Those classifiers are much smart comparing to the standard SGD-based or passive aggressive classifiers. Please check it out by yourself. Thanks, Makoto (2013/10/11 4:28), Clark Yang (杨卓荦) wrote: I looks really cool, I think I will try it on. Cheers, Zhuoluo (Clark) Yang 2013/10/5 Makoto YUI yuin...@gmail.com mailto:yuin...@gmail.com Hi Edward, Thank you for your interst. Hivemall project does not have a plan to have a specific mailing list, I will answer following questions/comments on twitter or through Github issues (with a question label). BTW, I just added a CTR (Click-Through-Rate) prediction example that is provided by a commercial search engine provider for the KDDCup 2012 track 2. https://github.com/myui/__**hivemall/wiki/KDDCup-2012-__** track-2-CTR-prediction-datasethttps://github.com/myui/__hivemall/wiki/KDDCup-2012-__track-2-CTR-prediction-dataset https://github.com/myui/**hivemall/wiki/KDDCup-2012-** track-2-CTR-prediction-datasethttps://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset ** I guess many of you working on ad CTR/CVR predictions. This example might be some help understanding how to do it only within Hive. Thanks, Makoto @myui (2013/10/04 23:02), Edward Capriolo wrote: Looks cool im already starting to play with it. On Friday, October 4, 2013, Makoto Yui yuin...@gmail.com mailto:yuin...@gmail.com mailto:yuin...@gmail.com mailto:yuin...@gmail.com wrote: Hi Dean, Thank you for your interest in Hivemall. Twitter's paper actually influenced me in developing Hivemall and I initially implemented such functionality as Pig UDFs. Though my Pig ML library is not released, you can find a similar attempt for Pig in https://github.com/y-tag/java-**__pig-MyUDFshttps://github.com/y-tag/java-__pig-MyUDFs https://github.com/y-tag/**java-pig-MyUDFshttps://github.com/y-tag/java-pig-MyUDFs Thanks, Makoto 2013/10/3 Dean Wampler deanwamp...@gmail.com mailto:deanwamp...@gmail.com mailto:deanwamp...@gmail.com mailto:deanwamp...@gmail.com** __: This is great news! I know that Twitter has done something similar with UDFs for Pig, as described in this paper: http://www.umiacs.umd.edu/~__**jimmylin/publications/Lin___** Kolcz_SIGMOD2012.pdfhttp://www.umiacs.umd.edu/~__jimmylin/publications/Lin___Kolcz_SIGMOD2012.pdf http://www.umiacs.umd.edu/%**7Ejimmylin/publications/Lin_** Kolcz_SIGMOD2012.pdfhttp://www.umiacs.umd.edu/%7Ejimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf http://www.umiacs.umd.edu/%__**7Ejimmylin/publications/Lin___** Kolcz_SIGMOD2012.pdf http://www.umiacs.umd.edu/%**7Ejimmylin/publications/Lin_** Kolcz_SIGMOD2012.pdfhttp://www.umiacs.umd.edu/%7Ejimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf I'm glad to see the same thing start with Hive. Dean On Wed, Oct 2, 2013 at 10:21 AM, Makoto YUI yuin...@gmail.com mailto:yuin...@gmail.com mailto:yuin...@gmail.com mailto:yuin...@gmail.com wrote: Hello all, My employer, AIST, has given the thumbs up to open source our machine learning library, named Hivemall. Hivemall is a scalable machine learning library running on Hive/Hadoop, licensed under the LGPL 2.1. https://github.com/myui/__**hivemallhttps://github.com/myui/__hivemall https://github.com/myui/**hivemallhttps://github.com/myui/hivemall Hivemall provides machine learning functionality as well as feature engineering functions through UDFs/UDAFs/UDTFs of Hive. It is designed to be scalable to the number of training instances as well as the number of training features. Hivemall is very easy to use as every machine learning step is done within HiveQL.
Re: NPE org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable
Development environment,hive 0.11、hadoop 1.0.3 2013/10/11 xinyan Yang moon.yan...@gmail.com Hi, when i run this sql,it fails,can anyone give me a advise select e.udid as udid,e.app_id as app_id from acorn_3g.ClientChannelDefine cc join ( select udid,app_id,from_id from ( select u.device_id as udid,u.app_id as app_id,g.device_id as 3gdid,u.from_id as from_id from acorn_3g.user_device_info u left outer join (select device_id from acorn_3g.3g_device_id where log_date'2013-09-15') g on u.device_id=g.device_id where u.log_date='2013-09-15' and u.from_id0 and u.type=1) f1 where 3gdid is null ) e on(e.from_id=cc.from_id) error info: Task with the most failures(4): - Task ID: task_201305281414_236693_m_01 URL: http://YZSJHL18-22.opi.com:50030/taskdetails.jsp?jobid=job_201305281414_236693tipid=task_201305281414_236693_m_01http://yzsjhl18-22.opi.com:50030/taskdetails.jsp?jobid=job_201305281414_236693tipid=task_201305281414_236693_m_01 - Diagnostic Messages for this Task: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:162) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:198) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:212) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1377) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1381) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1381) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:611) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:144) ... 8 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:186) ... 14 more FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask MapReduce Jobs Launched: Job 0: Map: 343 Reduce: 2 Cumulative CPU: 3478.61 sec HDFS Read: 1862106687 HDFS Write: 3838425 SUCCESS Job 1: Map: 2 HDFS Read: 0 HDFS Write: 0 FAIL
Re: NPE org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable
Hello Xinyang, Can you attach the query plan (the output of EXPLAIN)? I think a bad plan caused the error. Also, can you try hive trunk? Looks like it is a bug fixed after the release of 0.11. Thanks, Yin On Fri, Oct 11, 2013 at 9:21 AM, xinyan Yang moon.yan...@gmail.com wrote: Development environment,hive 0.11、hadoop 1.0.3 2013/10/11 xinyan Yang moon.yan...@gmail.com Hi, when i run this sql,it fails,can anyone give me a advise select e.udid as udid,e.app_id as app_id from acorn_3g.ClientChannelDefine cc join ( select udid,app_id,from_id from ( select u.device_id as udid,u.app_id as app_id,g.device_id as 3gdid,u.from_id as from_id from acorn_3g.user_device_info u left outer join (select device_id from acorn_3g.3g_device_id where log_date'2013-09-15') g on u.device_id=g.device_id where u.log_date='2013-09-15' and u.from_id0 and u.type=1) f1 where 3gdid is null ) e on(e.from_id=cc.from_id) error info: Task with the most failures(4): - Task ID: task_201305281414_236693_m_01 URL: http://YZSJHL18-22.opi.com:50030/taskdetails.jsp?jobid=job_201305281414_236693tipid=task_201305281414_236693_m_01http://yzsjhl18-22.opi.com:50030/taskdetails.jsp?jobid=job_201305281414_236693tipid=task_201305281414_236693_m_01 - Diagnostic Messages for this Task: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:162) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:198) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:212) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1377) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1381) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1381) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:611) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:144) ... 8 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:186) ... 14 more FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask MapReduce Jobs Launched: Job 0: Map: 343 Reduce: 2 Cumulative CPU: 3478.61 sec HDFS Read: 1862106687 HDFS Write: 3838425 SUCCESS Job 1: Map: 2 HDFS Read: 0 HDFS Write: 0 FAIL
hive partition pruning on joining on partition column
I have the requirement trying to support in hive, not sure if it is doable. I have the hadoop 1.1.1 with Hive 0.9.0 (Using deby as the meta store) If I partition my data by a dt column, so if my table 'foo' have some partitions like 'dt=2013-07-01' to 'dt=2013-07-30'. Now the user want to query all the data of Saturday only. To make it flexiable, instead of asking end user to find out what date in that month are Saturday, I add a lookup table (just called it 'bar') in the HIVE with following columns: year, month, day, dt_format, week_of_day So I want to see if I can join with foo and bar to still get the partition pruning: select *from foojoin baron (bar.year=2013 and bar.month=7 and bar.day_of_week=6 and bar.dt_foramt = foo.dt) I tried several ways, like switch the table order, join with subquery etc, none of them will make partition pruning works in this case on table foo. Can this really archivable in hive? Thanks Yong
Re: hive partition pruning on joining on partition column
one easiest way to do this is create a table where each date maps to week of month, week of year, day of week, day of month and then do the join on just date and put the conditions on where clause. Its easy to manipulate the date column for my understanding and you can join just based on date and get results based on where conditions. PS: this is what we currently do where we have to do continuous rollup analytics for yeat to date or parameter to date calculations. Wait for others to give you better solutions, On Fri, Oct 11, 2013 at 10:35 PM, java8964 java8964 java8...@hotmail.comwrote: I have the requirement trying to support in hive, not sure if it is doable. I have the hadoop 1.1.1 with Hive 0.9.0 (Using deby as the meta store) If I partition my data by a dt column, so if my table 'foo' have some partitions like 'dt=2013-07-01' to 'dt=2013-07-30'. Now the user want to query all the data of Saturday only. To make it flexiable, instead of asking end user to find out what date in that month are Saturday, I add a lookup table (just called it 'bar') in the HIVE with following columns: year, month, day, dt_format, week_of_day So I want to see if I can join with foo and bar to still get the partition pruning: select * from foo join bar on (bar.year=2013 and bar.month=7 and bar.day_of_week=6 and bar.dt_foramt = foo.dt) I tried several ways, like switch the table order, join with subquery etc, none of them will make partition pruning works in this case on table foo. Can this really archivable in hive? Thanks Yong -- Nitin Pawar