Re: How to change /tmp directory for mahout usage of map-reduce?

2015-03-31 Thread Vikas Kumar
That was helpful to figure out what was required.
I had to set the right path for variable *tmp* in the function from :

Path tmp = new Path("tmp")

to

Path tmp = new Path("<>");

Silly mistake. Thanks for the clue :)

-Vikas







On Wed, Apr 1, 2015 at 1:34 AM, Suneel Marthi 
wrote:

> If u running Spectral KMeans via Command Line, u should be able to set the
> parameter -tempDir to point to a different path
>
> On Wed, Apr 1, 2015 at 1:55 AM, Andrew Musselman <
> andrew.mussel...@gmail.com
> > wrote:
>
> > Can you let us know which code/scripts you're using?
> >
> > On Tuesday, March 31, 2015, Vikas Kumar  wrote:
> >
> > > Hello,
> > >
> > > I am using Mahout Spectral clustering example which internally calls a
> > map
> > > reduce job. Right now, it is using */tmp/hadoop-/mapred/..*
> > > directory by default for its operations.
> > >
> > > Can someone please let me know how to make mahout to use a different
> > path?
> > >
> > > Thanks
> > > Vikas
> > >
> >
>


Re: How to change /tmp directory for mahout usage of map-reduce?

2015-03-31 Thread Vikas Kumar
The following line specifically:

SpectralKMeansDriver.run(conf, affinities, output, vectors.size(),
noOfClusters, measure, onvergenceDelta, maxIterations, tmp, false);

where other variables are set accordingly. I can send the whole file if
required.

It shows the following in the log which helped me identify that it is using
the user tmp directory:

15/04/01 01:18:13 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
15/04/01 01:18:13 INFO input.FileInputFormat: Total input paths to process
: 1
15/04/01 01:18:13 INFO filecache.TrackerDistributedCacheManager: Creating
vector in 
*/tmp/hadoop-vikas/mapred/local*/archive/-623590149816891030_-1428839080_1939951392/file/export/scratch/vikas/<</tmp/calculations/vector


Thanks
Vikas



On Wed, Apr 1, 2015 at 12:55 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Can you let us know which code/scripts you're using?
>
> On Tuesday, March 31, 2015, Vikas Kumar  wrote:
>
> > Hello,
> >
> > I am using Mahout Spectral clustering example which internally calls a
> map
> > reduce job. Right now, it is using */tmp/hadoop-/mapred/..*
> > directory by default for its operations.
> >
> > Can someone please let me know how to make mahout to use a different
> path?
> >
> > Thanks
> > Vikas
> >
>


Re: How to change /tmp directory for mahout usage of map-reduce?

2015-03-31 Thread Suneel Marthi
If u running Spectral KMeans via Command Line, u should be able to set the
parameter -tempDir to point to a different path

On Wed, Apr 1, 2015 at 1:55 AM, Andrew Musselman  wrote:

> Can you let us know which code/scripts you're using?
>
> On Tuesday, March 31, 2015, Vikas Kumar  wrote:
>
> > Hello,
> >
> > I am using Mahout Spectral clustering example which internally calls a
> map
> > reduce job. Right now, it is using */tmp/hadoop-/mapred/..*
> > directory by default for its operations.
> >
> > Can someone please let me know how to make mahout to use a different
> path?
> >
> > Thanks
> > Vikas
> >
>


Re: How to change /tmp directory for mahout usage of map-reduce?

2015-03-31 Thread Andrew Musselman
Can you let us know which code/scripts you're using?

On Tuesday, March 31, 2015, Vikas Kumar  wrote:

> Hello,
>
> I am using Mahout Spectral clustering example which internally calls a map
> reduce job. Right now, it is using */tmp/hadoop-/mapred/..*
> directory by default for its operations.
>
> Can someone please let me know how to make mahout to use a different path?
>
> Thanks
> Vikas
>


How to change /tmp directory for mahout usage of map-reduce?

2015-03-31 Thread Vikas Kumar
Hello,

I am using Mahout Spectral clustering example which internally calls a map
reduce job. Right now, it is using */tmp/hadoop-/mapred/..*
directory by default for its operations.

Can someone please let me know how to make mahout to use a different path?

Thanks
Vikas


Re: Text clustering with SVD

2015-03-31 Thread Donni Khan
Hallo again,

I have run the ssvd on the  textual data as the following.
1. Run ssvd:
bin/mahout ssvd -i  outputTV/tfidf/tfidf-vectors/part-r-0  -o svdOutput
-k 100   -us true -U false -V false   -t 1   -ow   -pca true
2. Run kmeans:
 bin/mahout kmeans -i svdOutput/USigma/  -c work/kmeans/kmeans-centroids
-cl -o work/kmeans/cluster -k 10 -ow -x 1000 -dm
org.apache.mahout.common.distance.CosineDistanceMeasure
3. Dumping:
bin/mahout clusterdump  -d outputTV/dictionary.file-0   -dt sequencefile -i
work/kmeans/cluster/clusters-1-final -n 20 -b 100 -o work/kmeans/cDump.txt
-p work/kmeans/cluster/clusteredPoints/

A'm I right in the above steps?

I got bad results.  In the clustering output  all words start with the
letter "a*".  anyone has idea why?

Thanks in advance,
Donni

On Mon, Mar 30, 2015 at 11:07 PM, Ted Dunning  wrote:

> Lanczos may be more accurate than SSVD, but if you use a power step or
> three, this difference goes away as well.
>
> The best way to select k is actually to pick a value k_max larger than you
> expect to need and then pick random vectors instead of singular vectors.
> To evaluate how many singular vectors you really need, substitute more and
> more of the components of the random vectors with values from the singular
> vectors.  It is common that the best k_max will be 100-300 for text
> applications, but it is also common that the best k < k_max is much, much
> smaller.
>
> The reason that this is a better selection method is because a) random word
> vectors actually work pretty well because they maintain approximate
> independence of words and b) after k gets to a certain (pretty darned
> small) size, all the SVD is doing is acting as a very fancy and slow random
> number generator.
>
>
>
> On Mon, Mar 30, 2015 at 12:00 PM, Dmitriy Lyubimov 
> wrote:
>
> > I am not aware of _any_ scenario under which lanczos would be faster (see
> > N. Halko's dissertation for comparisons), although admittedly i did not
> > study all possible cases.
> >
> > having -k=100 is probably enough for anything.  I would not recommend
> > running -q>0 for k>100 as it would become quite slow in power iterations
> > step.
> >
> > to your other questions, e.g. U*sigma result output, see "overview and
> > usage" link given here:
> > http://mahout.apache.org/users/dim-reduction/ssvd.html
> >
> > On Mon, Mar 30, 2015 at 2:19 AM, Donni Khan <
> prince.don...@googlemail.com>
> > wrote:
> >
> > > Hallo Suneel,
> > > Thanks for fast reply.
> > > Is SSVD like SVD? which one is better?
> > > I run the SSVD  by java code on my data, but how do I compute U*Sigma?
> > Can
> > > I do that by Mahout?
> > > Is there optimal method to determin K?
> > >
> > > another quesion is how do I make the relation between ssvd output and
> > > words dictionary(real words)?
> > >
> > > Thank you
> > > Donni
> > >
> > > On Mon, Mar 30, 2015 at 10:04 AM, Suneel Marthi <
> suneel.mar...@gmail.com
> > >
> > > wrote:
> > >
> > > > Here are the steps if u r using Mahout-mrlegacy in the present Mahout
> > > > trunk:
> > > >
> > > > 1. Generate tfidf vectors from the input corpus using seq2sparse (I
> am
> > > > assuming you had done this before and hence avoiding the details)
> > > >
> > > > 2. Run SSVD on the generated tfidf vectors from (1)
> > > >
> > > >   ./bin/mahout ssvd -i  -o  -k 80 -pca
> > > true
> > > > -us true -U false -V false
> > > >
> > > >  k = no. of reduced basis vectors
> > > >
> > > > You would need the U*Sigma output of the PCA flow for the next
> > > > clustering step
> > > >
> > > > 3. Run KMeans (or any other clustering algo) with the U*Sigma from
> (2)
> > as
> > > > input.
> > > >
> > > >
> > > > On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan <
> > > prince.don...@googlemail.com>
> > > > wrote:
> > > >
> > > > > Hallo Mahout users,
> > > > >
> > > > > I'm working on text clustering, I would like to reduce the features
> > to
> > > > > enhance the clustering process.
> > > > > I would like to use  the Singular Value Decomposition before
> > cluatering
> > > > > process. I will be thankfull if anyone has used this before, Is it
> a
> > > good
> > > > > idea for clustering?
> > > > > Is there any other method in mahout to reduce the text features
> > before
> > > > > clustring?
> > > > > Is anyone has idea how can I apply SVD by using Java code?
> > > > >
> > > > > Thanks in advance,
> > > > > Donni
> > > > >
> > > >
> > >
> >
>