Hi, I'd like to use Mahout for clustering and classification where I have tens of terabytes of data on Amazon's S3 storage service. Each file in my data will generate one data point where I need to decompress the file and process it prior to applying machine learning. Is it necessary to have all the files pre-processed prior to using Mahout or is there a straightforward way to combine the pre-processing with Mahout? For example, I have a script that does the preprocessing and I somehow tell Mahout to run the script.
Pre-processing the files prior to running Mahout is simple, but Amazon charges for the extra storage space the pre-processed files would use. Thanks. Eric