Eh, hmm, does this job compress by default? I don't have the code here.
That is not generally how Hadoop works but you could make it do this. I
don't know if there's an override.
On Mar 7, 2012 12:40 AM, "Luke Forehand" <
luke.foreh...@networkedinsights.com> wrote:

> Why should it not be compressed in the first place?
>
> Here is the header of one of the reducer parts that was written into
> /mahout/kmeans/clusters-5-final
>
> SEQ  org.apache.hadoop.io.Text+org.apache.mahout.clustering.kmeans.Cluster
>  )org.apache.hadoop.io.compress.SnappyCodec
>
>
> On 3/6/12 6:33 PM, "Sean Owen" <sro...@gmail.com> wrote:
>
> >Ok but you're talking about reducer output not mapper. It should not be
> >compressed in the first place.
> >On Mar 7, 2012 12:29 AM, "Luke Forehand" <
> >luke.foreh...@networkedinsights.com> wrote:
> >
> >> I want the results of the kmeans clustering to be uncompressed or
> >> compressed in a way that my users can natively decompress on their
> >> machines.  All our other hadoop jobs use Snappy compression when writing
> >> output, but our users don't have Snappy and don't particularly want to
> >> install it (especially because of problems installing on mac).  I'll try
> >> adding this param to the HADOOP_OPTS and in the longterm probably come
> >>up
> >> with a cleaner way to do this.  Thanks!
> >>
> >> -Luke
> >>
> >> On 3/6/12 6:24 PM, "Sean Owen" <sro...@gmail.com> wrote:
> >>
> >> >-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I
> >> >recall). Or you configure this in your Hadoop config files.  It has no
> >> >meaning to the driver script. Why do you want to disable compression
> >> >after the mapper?
> >> >
> >> >On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand
> >> ><luke.foreh...@networkedinsights.com> wrote:
> >> >> I tried the following and it does not work:
> >> >>
> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd
> >>0.01
> >> >> -x 100 \
> >> >> -Dmapreduce.map.output.compress=false
> >> >>
> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd
> >>0.01
> >> >> -x 100 \
> >> >>
> >>
> >>>>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCod
> >>>>ec
> >> >>
> >> >>
> >> >> And still getting the default codec being used (which is Snappy in
> >>this
> >> >> case and I don't want the users to have to install native snappy
> >>which
> >> >>is
> >> >> why I'm trying to override this param).  Passing -Dkey=value on the
> >> >>mahout
> >> >> command line does not seem to have any effect on the mapreduce job
> >> >> configuration from what I can tell.  Any ideas?
> >> >>
> >> >> -Luke
> >> >>
> >> >> On 3/6/12 3:48 PM, "Sean Owen" <sro...@gmail.com> wrote:
> >> >>
> >> >>>Mapper compression? -Dmapreduce.map.output.compress=false. I think
> >>the
> >> >>>key was mapred.output.compress in Hadoop 0.20.0.
> >> >>>I am not sure if there is reducer compression built-in, but, I could
> >> >>>have missed it.
> >> >>>
> >> >>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
> >> >>><luke.foreh...@networkedinsights.com> wrote:
> >> >>>> Hello,
> >> >>>>
> >> >>>> Is there a way to run the mahout kmeans program from the command
> >>line,
> >> >>>>with a parameter that will override (and disable) the reducer task
> >> >>>>compression?  I have tried several different ways of specifying -D
> >> >>>>parameter but I can't seem to get any options to pass through to the
> >> >>>>hadoop mapreduce configuration.
> >> >>>>
> >> >>>> Thanks!
> >> >>>> Luke
> >> >>
> >>
> >>
>
>

Reply via email to