While reading through the wiki and article material on mahout, I noticed
that there was a pre-generation step where vectors were being generated
from either text with Lucene or ARFF with
org.apache.mahout.utils.vectorsarff.driver.java; Looking at the k-means
driver and mapper (KMeansMapper.java) I noticed that the mapper is
taking a key and then a Vector (point) as input.

 

Would it be smart or practical to make a special record reader for your
file format that read your data in as vectors directly and emitted
vectors to the mapper in order to skip the pre-generation step? Just
curious about that, maybe I'm missing something there, or vectorization
would be cumbersome in that position, etc.

 

Also, in Grant's article on Mahout he includes the vectorized 2.5 GB
file from Wikipedia that is in the correct format via Lucene to work
with a Mahout clustering algorithm; Is there a smaller (sub 100 meg)
version of this that I could play around with? I'm working with basic
building blocks right now and figuring out the facets of vectorization
with respect to Mahout so we can learn the base case  (lucene vectors)
and then move on to our specific case (sensor time series data).

 

Josh Patterson

TVA

Reply via email to