[jira] [Commented] (MAHOUT-1385) Caching Encoders don't cache
[ https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910299#comment-13910299 ] Johannes Schulte commented on MAHOUT-1385: -- Hi Manjo. My point was that these classes are (hopefully) meant to be a performance improvement, which would be the case if the string's hashcode could be reused (because the java string object caches it's own hash code). Using the byte[] hash code is evil because it depends on the reference Using another library / hashing strategy for the values inside the byte array is nonsense because this is what we are trying to cache if i understand the will of the creator correctly. The more I think about it - was this ever correct? Using the string hash code as a lookup to the murmurHash-based location? There will be different collisions leading to other results than with no caching, which should be avoided? Caching Encoders don't cache Key: MAHOUT-1385 URL: https://issues.apache.org/jira/browse/MAHOUT-1385 Project: Mahout Issue Type: Bug Affects Versions: 0.8 Reporter: Johannes Schulte Priority: Minor Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch The Caching... line of encoders contains code of caching the hash code terms added to the vector. However, the method hashForProbe inside this classes is never called as the signature has String for the parameter original form (instead of byte[] like other encoders). Changing this to byte[] however would lose the java String internal caching of the Strings hash code , that is used as a key in the cache map, triggering another hash code calculation. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Streaming KMeans clustering
Right. Up until now i'am helping myself with some minDf truncation since i am using tf idf weighted vectors anyway and have the idf counts at hand. But having a true loss driven sparsification might be better On Sun, Dec 29, 2013 at 11:43 PM, Ted Dunning ted.dunn...@gmail.com wrote: Johannes, One thing that might be of real interest is something that I haven't tried, nor read about. It should still be interesting. The idea is L_1 regularized clustering. This will give you (hopefully) sparse centroids that still are useful. In your example, where you want defensible (aka salable) clusters, it would make the clusters much easier to understand. It would also make them take less memory. On Sun, Dec 29, 2013 at 1:55 PM, Johannes Schulte johannes.schu...@gmail.com wrote: Ted, thanks for the long response! I agree with you on the benefit of a lot of clusters. the reason i chose 10 is not because i think there are truly 10 clusters in the data (there are probably thousands, e-commerce site behavioural data), but for technical reasons: - the cluster distances are used in a probability prediction model with rare events data, so i want every cluster to contain at least some positive examples - the cluster centroids need to be kept in memory online for real time feature vector generation and olr scoring. as the vectors are quite big and there are ~100 clients to handle simultaneously (100 clients * 10 clusters * ~1 non zero features per cluster centroid * 8 byte per vector element) - when selling the clusters, visualized, it's easier to plot something with only 10 clusters i'll give it another try with more clusters on the sketch phase and see what i can achieve. thanks for your help! On Sun, Dec 29, 2013 at 1:32 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte johannes.schu...@gmail.com wrote: Okay, understood. For a lot of clusters (which i don't necessarily attribute to big data problems but definetly to nearest neighbour like usage of clusters), the every cluster sees every point doesnt scale well. Nearest neighbor sorts of problems. Or market segmentation kinds of problems. Or feature generation kinds of problems. Or volume quantization kinds of problems. The point is that with more data, you can derive a more detailed model. That means more clusters. The ideal case is that we have an infinite mixture model to describe the data distribution, but we haven't seen most of the mixture components yet. As we get more data, we can justify saying we have more components. Even if you think that there are *really* 10 clusters in the data, I can make the case that you are going to get a better unsupervised description of those clusters by using 100 or more clusters in the k-means algorithm. The idea is that each of your real clusters will be composed of several of the k-means clusters and finding the attachments is much easier because 100 clusters is much smaller than the original number of samples. As a small data example, suppose you are looking at Anderson's Iris data which is available built into R. If we plot the 150 data points against the features we have in pairs, we can see various patterns and see that a non-linear classifier should do quite well in separating the classes (I hope the image makes it through the emailer): [image: Inline image 1] But if we try to do k-means on these data with only 3 clusters, we get very poor assignment to the three species: * k = kmeans(iris[,1:4], centers=3, nstart=10)* * table(iris$Species, k$cluster)* cluster 1 2 3 setosa* 50 0 0* versicolor* 0 48 2* virginica* 0 14 36* Cluster 1 captured the isolated setosa species rather well, but versicolor and virginica are not well separated because cluster 2 has 80% of versicolor and 20% of virginica. On the other hand, if we use 7 clusters, * k = kmeans(iris[,1:4], centers=7, nstart=10)* * table(iris$Species, k$cluster)* cluster 1 2 3 4 5 6 7 setosa* 0 0 28 0 22 0 0* versicolor* 0 7 0 20 0 0 23* virginica* 12 0 0 1 0 24 13* Each cluster is now composed of almost exactly one species. Only cluster 4 has any impurity and it is 95% composed of just versicolor samples. What this means is that we can use the 7 cluster k-means results to build a classifier that has a 1 of 7 input feature (cluster id) instead of 4 real values. That is, we have compressed the 4 original continuous features down to about 2.7 bits on average and this compressed representation actually makes building a classifier nearly trivial. * -sum(table(k$cluster) * log(table(k$cluster)/ 150
Re: Streaming KMeans clustering
Okay, understood. For a lot of clusters (which i don't necessarily attribute to big data problems but definetly to nearest neighbour like usage of clusters), the every cluster sees every point doesnt scale well. However, for (final) 1000 clusters i see around 200 distance measure calculations per point, a lot better than 1000. And another thing that makes me think is that the test runs A LOT faster than with 10 final clusters. On Sat, Dec 28, 2013 at 1:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: Of course, this is just for one pass of k-means. If you need 4 passes, you have break-even. More typically for big data problems, k=1000 or some such. Total number of distance computations for streaming k-means will still be about 40 (or adjust to the more theory motivated value of log k + log log N = 10 + 5 and then adjust with a bit of fudge for real world). For k-means in that case, you still have 1000 distances to compute per pass and multiple passes to do. That ratio then becomes something more like 10,000 / 40 = 250. On Fri, Dec 27, 2013 at 12:55 PM, Johannes Schulte johannes.schu...@gmail.com wrote: I updated the repository (with the typo) g...@github.com:baunz/cluster-comprarison.git to include more logging information about the number of times the distance measure calculation is triggered (which is the most expensive thing imo). the factor of dist. measure calculations per point seen is about 40 at streaming k-means and 10 for regular k-means (because there are 10 clusters). This is of course dependent on the searchSize Parameter but i used the default value of 2. On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm isa...@apache.org wrote: Hi Dan, On Fri, 27 Dec 2013 14:13:51 +0200 Dan Filimon dfili...@apache.org wrote: Thoughts? First of all - good to see you back on dev@ :) Seems a few people have run into these issues. As currently there is no high level documentation for the whole streaming kmeans implementation - would you mind writing up the limitation and advise you have for users of this algorithm? Doesn't need to be anything fancy - essentially a here's how you compute how much memory you need to run this, here's the limitations and the flags to deal with these, here's things that should be changed or fixed in a later iteration - unless your previous mail covers all of this already. This could safe people a few debugging cycles when getting started with this at scale. Feel free to get it into our web page (if you are short in time, just write something up using markdown, I can take over publishing it). Isabel
Re: Streaming KMeans clustering
I updated the repository (with the typo) g...@github.com:baunz/cluster-comprarison.git to include more logging information about the number of times the distance measure calculation is triggered (which is the most expensive thing imo). the factor of dist. measure calculations per point seen is about 40 at streaming k-means and 10 for regular k-means (because there are 10 clusters). This is of course dependent on the searchSize Parameter but i used the default value of 2. On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm isa...@apache.orgwrote: Hi Dan, On Fri, 27 Dec 2013 14:13:51 +0200 Dan Filimon dfili...@apache.org wrote: Thoughts? First of all - good to see you back on dev@ :) Seems a few people have run into these issues. As currently there is no high level documentation for the whole streaming kmeans implementation - would you mind writing up the limitation and advise you have for users of this algorithm? Doesn't need to be anything fancy - essentially a here's how you compute how much memory you need to run this, here's the limitations and the flags to deal with these, here's things that should be changed or fixed in a later iteration - unless your previous mail covers all of this already. This could safe people a few debugging cycles when getting started with this at scale. Feel free to get it into our web page (if you are short in time, just write something up using markdown, I can take over publishing it). Isabel
Re: Streaming KMeans clustering
Hi, i also had problems getting up to speed but i made the cardinality of the vectors responsible for that. i didn't do the math exactly but while streaming k-means improves over regular k-means in using log(k) and (n_umber of datapoints / k) passes, the d_imension parameter from the original k*d*n stays untouched, right? What is your vector's cardinality? On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.comwrote: Ted, What were the CLI parameters when you ran this test for 1M points - no. of clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff? On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
Hey Sebastian, it was a text like clustering problem with a dimensionality of 100 000, the number of data points could have have been million but i always cancelled it after a while (i used the java classes, not the command line version and monitored the progress). As for my statements above: They are possibly not quite correct. Sure, the projection search reduces the amount of searching needed, but by the time i looked into the code, i identified two problems, if i remember correctly: - the searching of pending additions - the projection itself but i'll have to retry that and look into the code again. i ended up using the old k-means code on a sample of the data.. cheers, johannes On Wed, Dec 25, 2013 at 11:17 AM, Sebastian Schelter s...@apache.org wrote: Hi Johannes, can you share some details about the dataset that you ran streaming k-means on (number of datapoints, cardinality, etc)? @Ted/Suneel Shouldn't the approximate searching techniques (e.g. projection search) help cope with high dimensional inputs? --sebastian On 25.12.2013 10:42, Johannes Schulte wrote: Hi, i also had problems getting up to speed but i made the cardinality of the vectors responsible for that. i didn't do the math exactly but while streaming k-means improves over regular k-means in using log(k) and (n_umber of datapoints / k) passes, the d_imension parameter from the original k*d*n stays untouched, right? What is your vector's cardinality? On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Ted, What were the CLI parameters when you ran this test for 1M points - no. of clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff? On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
everybody should have the right to do job.getConfiguration().set(mapred.reduce.child.java.opts, -Xmx2G); for that :) For my problems, i always felt the sketching took too long. i put up a simple comparison here: g...@github.com:baunz/cluster-comprarison.git it generates some sample vectors and clusters them with regular k-means, and streaming k-means, both sequentially. i took 10 kmeans iterations as a benchmark and used the default values for FastProjectionSearch from the kMeans Driver Class. Visual VM tells me the most time is spent in FastProjectionSearch.remove(). This is called on every added datapoint. Maybe i got something wrong but for this sparse, high dimensional vectors i never got streaming k-means faster than the regula version On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: Not sure how that would work in a corporate setting wherein there's a fixed systemwide setting that cannot be overridden. Sent from my iPhone On Dec 25, 2013, at 9:44 AM, Sebastian Schelter s...@apache.org wrote: On 25.12.2013 14:19, Suneel Marthi wrote: On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally. What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer? If I read the source code correctly, the final reducer clusters the sketch which should contain m * k * log n intermediate centroids, where k is the number of desired clusters, m is the number of mappers run and n is the number of datapoints. Those centroids are expected to be dense, so we can estimate the memory required for the final reducer using this formula. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
To be honest, i always cancelled the sketching after a while because i wasn't satisfied with the points per second speed. The version used is the 0.8 release. if i find the time i'm gonna look what is called when and where and how often and what the problem could be. On Thu, Dec 26, 2013 at 8:22 AM, Ted Dunning ted.dunn...@gmail.com wrote: Interesting. In Dan's tests on sparse data, he got about 10x speedup net. You didn't run multiple sketching passes did you? Also, which version? There was a horrendous clone in there at one time. On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte johannes.schu...@gmail.com wrote: everybody should have the right to do job.getConfiguration().set(mapred.reduce.child.java.opts, -Xmx2G); for that :) For my problems, i always felt the sketching took too long. i put up a simple comparison here: g...@github.com:baunz/cluster-comprarison.git it generates some sample vectors and clusters them with regular k-means, and streaming k-means, both sequentially. i took 10 kmeans iterations as a benchmark and used the default values for FastProjectionSearch from the kMeans Driver Class. Visual VM tells me the most time is spent in FastProjectionSearch.remove(). This is called on every added datapoint. Maybe i got something wrong but for this sparse, high dimensional vectors i never got streaming k-means faster than the regula version On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Not sure how that would work in a corporate setting wherein there's a fixed systemwide setting that cannot be overridden. Sent from my iPhone On Dec 25, 2013, at 9:44 AM, Sebastian Schelter s...@apache.org wrote: On 25.12.2013 14:19, Suneel Marthi wrote: On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally. What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer? If I read the source code correctly, the final reducer clusters the sketch which should contain m * k * log n intermediate centroids, where k is the number of desired clusters, m is the number of mappers run and n is the number of datapoints. Those centroids are expected to be dense, so we can estimate the memory required for the final reducer using this formula. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
[jira] [Created] (MAHOUT-1385) Caching Encoders don't cache
Johannes Schulte created MAHOUT-1385: Summary: Caching Encoders don't cache Key: MAHOUT-1385 URL: https://issues.apache.org/jira/browse/MAHOUT-1385 Project: Mahout Issue Type: Bug Affects Versions: 0.8 Reporter: Johannes Schulte Priority: Minor The Caching... line of encoders contains code of caching the hash code terms added to the vector. However, the method hashForProbe inside this classes is never called as the signature has String for the parameter original form (instead of byte[] like other encoders). Changing this to byte[] however would lose the java String internal caching of the Strings hash code , that is used as a key in the cache map, triggering another hash code calculation. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (MAHOUT-1385) Caching Encoders don't cache
[ https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johannes Schulte updated MAHOUT-1385: - Attachment: MAHOUT-1385-test.patch No solution but demonstration of the defect Caching Encoders don't cache Key: MAHOUT-1385 URL: https://issues.apache.org/jira/browse/MAHOUT-1385 Project: Mahout Issue Type: Bug Affects Versions: 0.8 Reporter: Johannes Schulte Priority: Minor Attachments: MAHOUT-1385-test.patch The Caching... line of encoders contains code of caching the hash code terms added to the vector. However, the method hashForProbe inside this classes is never called as the signature has String for the parameter original form (instead of byte[] like other encoders). Changing this to byte[] however would lose the java String internal caching of the Strings hash code , that is used as a key in the cache map, triggering another hash code calculation. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (MAHOUT-1357) InteractionValueEncoder produces wrong traceDictionary entries
Johannes Schulte created MAHOUT-1357: Summary: InteractionValueEncoder produces wrong traceDictionary entries Key: MAHOUT-1357 URL: https://issues.apache.org/jira/browse/MAHOUT-1357 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.8 Reporter: Johannes Schulte Priority: Minor In the trace code the byte values of the terms being hashed are not converted back to string but just concatenated in their raw form with Arrays.asString() This makes the reverse engineering even harder! Fix is to just create new string, see patch attached. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1357) InteractionValueEncoder produces wrong traceDictionary entries
[ https://issues.apache.org/jira/browse/MAHOUT-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johannes Schulte updated MAHOUT-1357: - Status: Patch Available (was: Open) InteractionValueEncoder produces wrong traceDictionary entries -- Key: MAHOUT-1357 URL: https://issues.apache.org/jira/browse/MAHOUT-1357 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.8 Reporter: Johannes Schulte Priority: Minor Attachments: MAHOUT-1357.patch In the trace code the byte values of the terms being hashed are not converted back to string but just concatenated in their raw form with Arrays.asString() This makes the reverse engineering even harder! Fix is to just create new string, see patch attached. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1357) InteractionValueEncoder produces wrong traceDictionary entries
[ https://issues.apache.org/jira/browse/MAHOUT-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johannes Schulte updated MAHOUT-1357: - Attachment: MAHOUT-1357.patch InteractionValueEncoder produces wrong traceDictionary entries -- Key: MAHOUT-1357 URL: https://issues.apache.org/jira/browse/MAHOUT-1357 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.8 Reporter: Johannes Schulte Priority: Minor Attachments: MAHOUT-1357.patch In the trace code the byte values of the terms being hashed are not converted back to string but just concatenated in their raw form with Arrays.asString() This makes the reverse engineering even harder! Fix is to just create new string, see patch attached. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1357) InteractionValueEncoder produces wrong traceDictionary entries
[ https://issues.apache.org/jira/browse/MAHOUT-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823719#comment-13823719 ] Johannes Schulte commented on MAHOUT-1357: -- Thanks! InteractionValueEncoder produces wrong traceDictionary entries -- Key: MAHOUT-1357 URL: https://issues.apache.org/jira/browse/MAHOUT-1357 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.8 Reporter: Johannes Schulte Assignee: Suneel Marthi Priority: Minor Fix For: 0.9 Attachments: MAHOUT-1357.patch In the trace code the byte values of the terms being hashed are not converted back to string but just concatenated in their raw form with Arrays.asString() This makes the reverse engineering even harder! Fix is to just create new string, see patch attached. -- This message was sent by Atlassian JIRA (v6.1#6144)
FeatureVectorEncoder Extension
Hi, i created an extension of the feature vector encoder framework that allows for byte array offset and length to be passed in . Some questions remain before creating an issue and attaching a diff: 1. When using Sun Conventions with 2 spaces (the link is broken by the way), which line length to choose? I'm using eclipse and my code looks somewhat different with the 80 char line length 2. I extended all existing methods that take an byte[] array to also take offset and length, the old byte[] methods stay the same. After implementing public void addInteractionToVector(byte[] originalForm1, int offset1, intlength1, byte[] originalForm2, int offset2, int length2, double weight, Vector data) i thought it would me maybe smarter to user ByteBuffer for passing in the byte array, offset and positions. A ByteBuffer is created EVERY time inside the MurmurHash class anyway so it wouldn't produce any more objects. Any comments / wishes? Cheers, Johannes