Eh, hmm, does this job compress by default? I don't have the code here. That is not generally how Hadoop works but you could make it do this. I don't know if there's an override. On Mar 7, 2012 12:40 AM, "Luke Forehand" < luke.foreh...@networkedinsights.com> wrote:
> Why should it not be compressed in the first place? > > Here is the header of one of the reducer parts that was written into > /mahout/kmeans/clusters-5-final > > SEQ org.apache.hadoop.io.Text+org.apache.mahout.clustering.kmeans.Cluster > )org.apache.hadoop.io.compress.SnappyCodec > > > On 3/6/12 6:33 PM, "Sean Owen" <sro...@gmail.com> wrote: > > >Ok but you're talking about reducer output not mapper. It should not be > >compressed in the first place. > >On Mar 7, 2012 12:29 AM, "Luke Forehand" < > >luke.foreh...@networkedinsights.com> wrote: > > > >> I want the results of the kmeans clustering to be uncompressed or > >> compressed in a way that my users can natively decompress on their > >> machines. All our other hadoop jobs use Snappy compression when writing > >> output, but our users don't have Snappy and don't particularly want to > >> install it (especially because of problems installing on mac). I'll try > >> adding this param to the HADOOP_OPTS and in the longterm probably come > >>up > >> with a cleaner way to do this. Thanks! > >> > >> -Luke > >> > >> On 3/6/12 6:24 PM, "Sean Owen" <sro...@gmail.com> wrote: > >> > >> >-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I > >> >recall). Or you configure this in your Hadoop config files. It has no > >> >meaning to the driver script. Why do you want to disable compression > >> >after the mapper? > >> > > >> >On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand > >> ><luke.foreh...@networkedinsights.com> wrote: > >> >> I tried the following and it does not work: > >> >> > >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c > >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd > >>0.01 > >> >> -x 100 \ > >> >> -Dmapreduce.map.output.compress=false > >> >> > >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c > >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd > >>0.01 > >> >> -x 100 \ > >> >> > >> > >>>>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCod > >>>>ec > >> >> > >> >> > >> >> And still getting the default codec being used (which is Snappy in > >>this > >> >> case and I don't want the users to have to install native snappy > >>which > >> >>is > >> >> why I'm trying to override this param). Passing -Dkey=value on the > >> >>mahout > >> >> command line does not seem to have any effect on the mapreduce job > >> >> configuration from what I can tell. Any ideas? > >> >> > >> >> -Luke > >> >> > >> >> On 3/6/12 3:48 PM, "Sean Owen" <sro...@gmail.com> wrote: > >> >> > >> >>>Mapper compression? -Dmapreduce.map.output.compress=false. I think > >>the > >> >>>key was mapred.output.compress in Hadoop 0.20.0. > >> >>>I am not sure if there is reducer compression built-in, but, I could > >> >>>have missed it. > >> >>> > >> >>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand > >> >>><luke.foreh...@networkedinsights.com> wrote: > >> >>>> Hello, > >> >>>> > >> >>>> Is there a way to run the mahout kmeans program from the command > >>line, > >> >>>>with a parameter that will override (and disable) the reducer task > >> >>>>compression? I have tried several different ways of specifying -D > >> >>>>parameter but I can't seem to get any options to pass through to the > >> >>>>hadoop mapreduce configuration. > >> >>>> > >> >>>> Thanks! > >> >>>> Luke > >> >> > >> > >> > >