Hi,

I'd like to use Mahout for clustering and classification where I have tens of 
terabytes of data on Amazon's S3 storage service.  Each file in my data will 
generate one data point where I need to decompress the file and process it 
prior to applying machine learning.  Is it necessary to have all the files 
pre-processed prior to using Mahout or is there a straightforward way to 
combine the pre-processing with Mahout?  For example, I have a script that 
does the preprocessing and I somehow tell Mahout to run the script.

Pre-processing the files prior to running Mahout is simple, but Amazon 
charges for the extra storage space the pre-processed files would use.

Thanks.

Eric

Reply via email to