[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-12-10 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-11:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed. Thanks Drew for your help.

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11-all-cleanup-20091128.patch, 
> MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, 
> MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-12-01 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-11:
--

Status: Patch Available  (was: Open)

see attached: MAHOUT-11-all-cleanup-20091128.patch

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11-all-cleanup-20091128.patch, 
> MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, 
> MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-28 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-11:
--

Attachment: MAHOUT-11-all-cleanup-20091128.patch

MAHOUT-11-all-cleanup-20091128.patch eliminates the use of static fields for 
configuration in the clustering code in all cases where it was present: canopy, 
kmeans, fuzzykmeans and meanshift. It retains Isabel's original patch to the 
kmeans package, with the exception of the items discussed previously, and adds 
similar changes to the other packages. It also includes the fix to and unit 
test for RandomSeedGenerator previously included.

Applied against rev 883446, all unit tests are passing, and I've run the kmeans 
code on real data. It would be really great if someone could double check the 
changes and comment.


> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11-all-cleanup-20091128.patch, 
> MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, 
> MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-23 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-11:
--

Attachment: MAHOUT-11-kmeans-cleanup.patch

Attached a patch that takes Isabel's original patch to remove static fields in 
kmeans clustering, makes the discussed change for the output collectors, cleans 
up some warnings and unused instance of the convergenceDelta variable. Fixes 
the RandomSeedGenerator in kmeans clustering and adds a unit test for it. Also, 
KMeansClusterer no longer extends Cluster -- it wasn't necessary to do so.

Isabel, are you planning on taking a crack at the rest of the clustering code 
that uses static fields? I'm finding this issue a great way to become familiar 
with the code, and if you're not already intending to work on it, I could give 
it a try.



> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11-kmeans-cleanup.patch, 
> MAHOUT-11-RandomSeedGenerator.patch, MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-20 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-11:
--

Attachment: MAHOUT-11-RandomSeedGenerator.patch

Found the problem, which I believe is isolated to the case where kmeans cluster 
uses random seed clusters as a basis for clustering.

In RandomSeedGenerator, no cluster ids are assigned, so all clusters generated 
get an id of zero when being written to the sequence file. If all cluster id's 
are zero, KmeansClusterer.outputPointWithClusterInfo winds up assigning all 
points to the same cluster. This issue was hidden previously because Cluster 
id's were assigned in the Cluster(Vector) constructor. 

I've attached a small patch for RandomSeedGenerator. This should probably be 
accompanied by a unit test, but I haven't had the chance to put one together.



> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11-RandomSeedGenerator.patch, MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-19 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-11:
---

Attachment: MAHOUT-11.patch

Not the original author of the source, but still managed to get the static 
fields out of the k-means clustering code. All unit-tests are still passing. 
However I would feel a lot better, if someone else double-checked the changes 
made.

Looking at the code, I spotted some more points that could benefit from being 
revisited (e.g. usage of deprecated MapReduce APIs and introduction of status 
reports). But this should be done in a separate issue.

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-11:


Priority: Major  (was: Minor)
Assignee: (was: Sean Owen)

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-11:


 Priority: Minor  (was: Major)
Fix Version/s: 0.3
 Assignee: Sean Owen  (was: Dawid Weiss)

I agree with this, this is bad design. I will take on this old issue to try to 
patch up.

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 0.3
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.