[GitHub] incubator-hivemall issue #107: [WIP][HIVEMALL-132] Generalize f1score UDAF t...

2017-08-02 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/107
  
@nzw0301 Could you add test for binary (and for multi-label measure)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (HIVEMALL-138) Implement to_top_k_ordered_map

2017-08-02 Thread Takuya Kitazawa (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVEMALL-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112100#comment-16112100
 ] 

Takuya Kitazawa commented on HIVEMALL-138:
--

Instead of creating a completely new UDAF, adding 4th option `k` to 
`to_ordered_map` is another option

> Implement to_top_k_ordered_map
> --
>
> Key: HIVEMALL-138
> URL: https://issues.apache.org/jira/browse/HIVEMALL-138
> Project: Hivemall
>  Issue Type: New Feature
>Reporter: Takuya Kitazawa
>Assignee: Takuya Kitazawa
>Priority: Minor
>
> As an alternative "each_top_k" functionality, let us implement 
> "to_top_k_ordered_map(int k, int key, int value)" UDAF. Compared to the 
> CLUSTER BY + "each_top_k" option, UDAF enables us to utilize mapper-side 
> aggregation.
> According to [~myui]:
> A problem is that multiple to_top_k_ordered_map UDAFs is concurrently 
> executed and memory consumption is not reduced.
> to_top_k_ordered_map will become O(|article_id|*k) (or, 
> O(|article_id|*k/reducers*combiner_effect_ratio) per a reducer) space 
> complexity while each_top_k is O(k) (or O(k/reducers) per a reducer) space 
> complexity in an operator. each_top_k internally uses priority queue (not 
> sorting), assuming the given inputs are sorted by a group key using CLUSTER 
> BY. Shuffle involves a scalable external sort and memory space complexity can 
> be avoided.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVEMALL-138) Implement to_top_k_ordered_map

2017-08-02 Thread Takuya Kitazawa (JIRA)
Takuya Kitazawa created HIVEMALL-138:


 Summary: Implement to_top_k_ordered_map
 Key: HIVEMALL-138
 URL: https://issues.apache.org/jira/browse/HIVEMALL-138
 Project: Hivemall
  Issue Type: New Feature
Reporter: Takuya Kitazawa
Assignee: Takuya Kitazawa
Priority: Minor


As an alternative "each_top_k" functionality, let us implement 
"to_top_k_ordered_map(int k, int key, int value)" UDAF. Compared to the CLUSTER 
BY + "each_top_k" option, UDAF enables us to utilize mapper-side aggregation.

According to [~myui]:

A problem is that multiple to_top_k_ordered_map UDAFs is concurrently executed 
and memory consumption is not reduced.

to_top_k_ordered_map will become O(|article_id|*k) (or, 
O(|article_id|*k/reducers*combiner_effect_ratio) per a reducer) space 
complexity while each_top_k is O(k) (or O(k/reducers) per a reducer) space 
complexity in an operator. each_top_k internally uses priority queue (not 
sorting), assuming the given inputs are sorted by a group key using CLUSTER BY. 
Shuffle involves a scalable external sort and memory space complexity can be 
avoided.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-hivemall issue #105: [WIP][HIVEMALL-24] Scalable field-aware facto...

2017-08-02 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/105
  
memory consumption of FFM is estimated as follows:

```
( 4 + 4*factors + 8+ 4+8) * fields * features (bytes)
  ~~  ~  ~~~~~
  Wi  V[k] adagrad ftrl   ffm

(4+4*4*8+4+8)*39*2^20 bytes = 5.88 GiB
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread kottmann
Github user kottmann commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@myui that was done for the 1.6.0 release, and in maxent 3.0.3 it was 
modified to run in multiple threads. 

You probably need to take a similar approach as we took for multi-threaded 
training e.g. split the amount of work done per iteration and scale it out to 
multiple machines, merge the parameters, and repeat for the next iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@kottmann Do you know in which version maxent classifier is moved to 
opennlp-tools?
Versioning scheme of opennlp-maxent and opennlp-tools modules are very 
different.

https://mvnrepository.com/artifact/org.apache.opennlp/opennlp-maxent
https://mvnrepository.com/artifact/org.apache.opennlp/opennlp-tools


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread kottmann
Github user kottmann commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@myui the maxent 3.0.1 version went through Apache IP clearance when the 
code base was moved from SourceForge, and should be almost identical to 3.0.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread kottmann
Github user kottmann commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
Sure, there are ways to make this work across multiple machines, but then 
you can't use it like we ship it. Maybe the best solution for you would be to 
just take the code you need, strip it down and get rid of opennlp as a 
dependency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@helenahm I agree to use Hivemall's Matrix to reduce memory consumption and 
create a custom BigGISTrainer for Hivemall.

My concern is that the modification can be based on the latest release of 
Apache OpenNLP, `v1.8.1` if there are no reason to use pre-apache release. 

Anyway, I look into your PR after merging 
https://github.com/apache/incubator-hivemall/pull/105 
Maybe in the next week. Some refactoring would be applied (such as removing 
debug prints and unused codes) forking your PR branch. 

BTW, multi-thresholding should be avoided when running a task in a Yarn 
container. Better to be parallelized by Hive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
"Yeah, so if you have a lot of training data then running out of memory is 
one symptom you run into, but that is not the actual problem of this 
implementation." 
- it was the big problem for me to use on Hadoop and that is why i had to 
alter the training code 
- the newer version of the code is as bad as the old one from this point of 
view

"The actual cause is that it won't scale beyond one machine." 
- yes, that is why I really like what Hivemall project is about, and that 
is why i needed MaxEnt for Hive 

"In case you manage to make this run with much more data the time it will 
take to run will be uncomfortably high." 
-- that is why i have tested my new implementation on almost 100 mils of 
training samples and saw each of 302 mappers finish work in very reasonable 
time 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread kottmann
Github user kottmann commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@helenahm as far as I know the training data is stored once in memory, and 
then for each thread a copy of the parameters is stored. 

Yeah, so if you have a lot of training data then running out of memory is 
one symptom you run into, but that is not the actual problem of this 
implementation. The actual cause is that it won't scale beyond one machine.

Bottom line if you want to use GIS training with lots of data don't use 
this implementation,  the training requires a certain amount of CPU time and it 
increases with the amount of training data. In case you manage to make this run 
with much more data the time it will take to run will be uncomfortably high.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #107: [HIVEMALL-132] Generalize f1score UDAF to sup...

2017-08-02 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/107
  
Also, some other DDLs also needed to be updated. Please grep `tree_export` 
to know which DDLs to update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #107: [HIVEMALL-132] Generalize f1score UDAF to sup...

2017-08-02 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/107
  
@nzw Could you update user guide to include `fmeasure` and `f1score` in 
`incubator-hivemall/docs/gitbook/eval/classification_measures.md` ?

`npm install gitbook-cli; gitbook install; gitbook serve` on docs/gitbook .

Also, could you revise the current Evaluation section of 
https://treasure-data.gyazo.com/5ec4b737dcedd55353f8126040ea5366 to

```
• Binary Classification metrics
  • Area Under the ROC Curve
• Regression metrics
• Ranking metrics
```

Refer examples in 
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
https://turi.com/learn/userguide/evaluation/classification.html#f_scores


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---