Re: Statistics Tool For Classification/Clustering

2002-02-27 Thread Mark Harrison

Good places to start:

Optimal feature extractors, that's better than PCA because you whiten your
inter class scatter and so put all inter class comparisons on the same
level. The good thing is this will also reduce your feature vector
dimensionality to c-1 (where c is # classes). PCA will not do this.

Check the stats of each class, is it Gaussian or known pdf? Apply
parameteric classifier if so.

However you are lucky if you get good classification after this, so you will
probably need non linear, non parametric classifiers. Try K nearest
neighobour, but that might take the age of the Universe so use a condensing
algorithm first to get a smaller representative set.

Matlab is what I use for coding, there are a lot of free toolboxes around.
Mostly I write my own though.

Best wishes

Andrew


Rishabh Gupta [EMAIL PROTECTED] wrote in message
news:a4eje9$ip8$[EMAIL PROTECTED];
 Hi All,
 I'm a research student at the Department Of Electronics, University Of
 York, UK. I'm working a project related to music analysis and
 classification. I am at the stage where I perform some analysis on music
 files (currently only in MIDI format) and extract about 500 variables that
 are related to music properties like pitch, rhythm, polyphony and volume.
I
 am performing basic analysis like mean and standard deviation but then I
 also perform more elaborate analysis like measuring complexity of melody
and
 rhythm.

 The aim is that the variables obtained can be used to perform a number of
 different operations.
 - The variables can be used to classify / categorise each piece of
 music, on its own, in terms of some meta classifier (e.g. rock, pop,
 classical).
 - The variables can be used to perform comparison between two files. A
 variable from one music file can be compared to the equivalent variable in
 the other music file. By comparing all the variables in one file with the
 equivalent variable in the other file, an overall similarity measurement
can
 be obtained.

 The next stage is to test the ability of the of the variables obtained to
 perform the classification / comparison. I need to identify variables that
 are redundant (redundant in the sense of 'they do not provide any
 information' and 'they provide the same information as the other
variable')
 so that they can be removed and I need to identify variables that are
 distinguishing (provide the most amount of information).

 My Basic Questions Are:
 - What are the best statistical techniques / methods that should be
 applied here. E.g. I have looked at Principal Component Analysis; this
would
 be a good method to remove the redundant variables and hence reduce some
the
 amount of data that needs to be processed. Can anyone suggest any other
 sensible statistical anaysis methods?
 - What are the ideal tools / software to perform the clustering /
 classification. I have access to SPSS software but I have never used it
 before and am not really sure how to apply it or whether it is any good
when
 dealing with 100s of variables.

 So far I have been analysing each variable on its own 'by eye' by plotting
 the mean and sd for all music files. However this approach is not feasible
 in the long term since I am dealing with such a large number of variables.
 In addition, by looking at each variable on its own, I do not find
clusters
 / patterns that are only visible through multivariate analysis. If anyone
 can recommend a better approach I would be greatly appreciated.

 Any help or suggestion that can be offered will be greatly appreciated.

 Many Thanks!

 Rishabh Gupta






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-27 Thread Mark Harrison

Corection typo: Should read 'Whiten intra class scatter'

Mark Harrison [EMAIL PROTECTED] wrote in message
news:FIif8.16518$[EMAIL PROTECTED];
 Good places to start:

 Optimal feature extractors, that's better than PCA because you whiten your
 inter class scatter and so put all inter class comparisons on the same
 level. The good thing is this will also reduce your feature vector
 dimensionality to c-1 (where c is # classes). PCA will not do this.

 Check the stats of each class, is it Gaussian or known pdf? Apply
 parameteric classifier if so.

 However you are lucky if you get good classification after this, so you
will
 probably need non linear, non parametric classifiers. Try K nearest
 neighobour, but that might take the age of the Universe so use a
condensing
 algorithm first to get a smaller representative set.

 Matlab is what I use for coding, there are a lot of free toolboxes around.
 Mostly I write my own though.

 Best wishes

 Andrew


 Rishabh Gupta [EMAIL PROTECTED] wrote in message
 news:a4eje9$ip8$[EMAIL PROTECTED];
  Hi All,
  I'm a research student at the Department Of Electronics, University
Of
  York, UK. I'm working a project related to music analysis and
  classification. I am at the stage where I perform some analysis on music
  files (currently only in MIDI format) and extract about 500 variables
that
  are related to music properties like pitch, rhythm, polyphony and
volume.
 I
  am performing basic analysis like mean and standard deviation but then I
  also perform more elaborate analysis like measuring complexity of melody
 and
  rhythm.
 
  The aim is that the variables obtained can be used to perform a number
of
  different operations.
  - The variables can be used to classify / categorise each piece of
  music, on its own, in terms of some meta classifier (e.g. rock, pop,
  classical).
  - The variables can be used to perform comparison between two files.
A
  variable from one music file can be compared to the equivalent variable
in
  the other music file. By comparing all the variables in one file with
the
  equivalent variable in the other file, an overall similarity measurement
 can
  be obtained.
 
  The next stage is to test the ability of the of the variables obtained
to
  perform the classification / comparison. I need to identify variables
that
  are redundant (redundant in the sense of 'they do not provide any
  information' and 'they provide the same information as the other
 variable')
  so that they can be removed and I need to identify variables that are
  distinguishing (provide the most amount of information).
 
  My Basic Questions Are:
  - What are the best statistical techniques / methods that should be
  applied here. E.g. I have looked at Principal Component Analysis; this
 would
  be a good method to remove the redundant variables and hence reduce some
 the
  amount of data that needs to be processed. Can anyone suggest any other
  sensible statistical anaysis methods?
  - What are the ideal tools / software to perform the clustering /
  classification. I have access to SPSS software but I have never used it
  before and am not really sure how to apply it or whether it is any good
 when
  dealing with 100s of variables.
 
  So far I have been analysing each variable on its own 'by eye' by
plotting
  the mean and sd for all music files. However this approach is not
feasible
  in the long term since I am dealing with such a large number of
variables.
  In addition, by looking at each variable on its own, I do not find
 clusters
  / patterns that are only visible through multivariate analysis. If
anyone
  can recommend a better approach I would be greatly appreciated.
 
  Any help or suggestion that can be offered will be greatly appreciated.
 
  Many Thanks!
 
  Rishabh Gupta
 
 






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-14 Thread Rishabh Gupta

Hi all,
I recieved numerous replies to my query. I can't thanks everyone
individually so I want to thank everyone who has replied. I am now looking
through the information and links that you have provided.
Many Thanks For All Your Help!!

Rishabh
Rishabh Gupta [EMAIL PROTECTED] wrote in message
a4eje9$ip8$[EMAIL PROTECTED]">news:a4eje9$ip8$[EMAIL PROTECTED]...
 Hi All,
 I'm a research student at the Department Of Electronics, University Of
 York, UK. I'm working a project related to music analysis and
 classification. I am at the stage where I perform some analysis on music
 files (currently only in MIDI format) and extract about 500 variables that
 are related to music properties like pitch, rhythm, polyphony and volume.
I
 am performing basic analysis like mean and standard deviation but then I
 also perform more elaborate analysis like measuring complexity of melody
and
 rhythm.

 The aim is that the variables obtained can be used to perform a number of
 different operations.
 - The variables can be used to classify / categorise each piece of
 music, on its own, in terms of some meta classifier (e.g. rock, pop,
 classical).
 - The variables can be used to perform comparison between two files. A
 variable from one music file can be compared to the equivalent variable in
 the other music file. By comparing all the variables in one file with the
 equivalent variable in the other file, an overall similarity measurement
can
 be obtained.

 The next stage is to test the ability of the of the variables obtained to
 perform the classification / comparison. I need to identify variables that
 are redundant (redundant in the sense of 'they do not provide any
 information' and 'they provide the same information as the other
variable')
 so that they can be removed and I need to identify variables that are
 distinguishing (provide the most amount of information).

 My Basic Questions Are:
 - What are the best statistical techniques / methods that should be
 applied here. E.g. I have looked at Principal Component Analysis; this
would
 be a good method to remove the redundant variables and hence reduce some
the
 amount of data that needs to be processed. Can anyone suggest any other
 sensible statistical anaysis methods?
 - What are the ideal tools / software to perform the clustering /
 classification. I have access to SPSS software but I have never used it
 before and am not really sure how to apply it or whether it is any good
when
 dealing with 100s of variables.

 So far I have been analysing each variable on its own 'by eye' by plotting
 the mean and sd for all music files. However this approach is not feasible
 in the long term since I am dealing with such a large number of variables.
 In addition, by looking at each variable on its own, I do not find
clusters
 / patterns that are only visible through multivariate analysis. If anyone
 can recommend a better approach I would be greatly appreciated.

 Any help or suggestion that can be offered will be greatly appreciated.

 Many Thanks!

 Rishabh Gupta






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-14 Thread Reg Edwards

Rishabh Gupta [EMAIL PROTECTED] wrote in message
a4eje9$ip8$[EMAIL PROTECTED]">news:a4eje9$ip8$[EMAIL PROTECTED]...
 Hi All,
 I'm a research student at the Department Of Electronics, University Of
 York, UK. I'm working a project related to music analysis and
 classification.
==
Pleased to see you have had many suggestions.

But I would have thought you are sitting right on top of all the books you
may need on the shelves in the university library.





=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Statistics Tool For Classification/Clustering

2002-02-13 Thread Rishabh Gupta

Hi All,
I'm a research student at the Department Of Electronics, University Of
York, UK. I'm working a project related to music analysis and
classification. I am at the stage where I perform some analysis on music
files (currently only in MIDI format) and extract about 500 variables that
are related to music properties like pitch, rhythm, polyphony and volume. I
am performing basic analysis like mean and standard deviation but then I
also perform more elaborate analysis like measuring complexity of melody and
rhythm.

The aim is that the variables obtained can be used to perform a number of
different operations.
- The variables can be used to classify / categorise each piece of
music, on its own, in terms of some meta classifier (e.g. rock, pop,
classical).
- The variables can be used to perform comparison between two files. A
variable from one music file can be compared to the equivalent variable in
the other music file. By comparing all the variables in one file with the
equivalent variable in the other file, an overall similarity measurement can
be obtained.

The next stage is to test the ability of the of the variables obtained to
perform the classification / comparison. I need to identify variables that
are redundant (redundant in the sense of 'they do not provide any
information' and 'they provide the same information as the other variable')
so that they can be removed and I need to identify variables that are
distinguishing (provide the most amount of information).

My Basic Questions Are:
- What are the best statistical techniques / methods that should be
applied here. E.g. I have looked at Principal Component Analysis; this would
be a good method to remove the redundant variables and hence reduce some the
amount of data that needs to be processed. Can anyone suggest any other
sensible statistical anaysis methods?
- What are the ideal tools / software to perform the clustering /
classification. I have access to SPSS software but I have never used it
before and am not really sure how to apply it or whether it is any good when
dealing with 100s of variables.

So far I have been analysing each variable on its own 'by eye' by plotting
the mean and sd for all music files. However this approach is not feasible
in the long term since I am dealing with such a large number of variables.
In addition, by looking at each variable on its own, I do not find clusters
/ patterns that are only visible through multivariate analysis. If anyone
can recommend a better approach I would be greatly appreciated.

Any help or suggestion that can be offered will be greatly appreciated.

Many Thanks!

Rishabh Gupta




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-13 Thread Doug Hoy

Rishabh Gupta [EMAIL PROTECTED] wrote in
a4eje9$ip8$[EMAIL PROTECTED]:">news:a4eje9$ip8$[EMAIL PROTECTED]: 

 Hi All,
 I'm a research student at the Department Of Electronics, University
 Of 
 York, UK. I'm working a project related to music analysis and
 classification. I am at the stage where I perform some analysis on
 music files (currently only in MIDI format) and extract about 500
 variables that are related to music properties like pitch, rhythm,
 polyphony and volume. I am performing basic analysis like mean and
 standard deviation but then I also perform more elaborate analysis like
 measuring complexity of melody and rhythm.
 
 The aim is that the variables obtained can be used to perform a number
 of different operations.
 - The variables can be used to classify / categorise each piece of
 music, on its own, in terms of some meta classifier (e.g. rock, pop,
 classical).
 - The variables can be used to perform comparison between two
 files. A 
 variable from one music file can be compared to the equivalent variable
 in the other music file. By comparing all the variables in one file
 with the equivalent variable in the other file, an overall similarity
 measurement can be obtained.
 
 The next stage is to test the ability of the of the variables obtained
 to perform the classification / comparison. I need to identify
 variables that are redundant (redundant in the sense of 'they do not
 provide any information' and 'they provide the same information as the
 other variable') so that they can be removed and I need to identify
 variables that are distinguishing (provide the most amount of
 information). 
 
 My Basic Questions Are:
 - What are the best statistical techniques / methods that should be
 applied here. E.g. I have looked at Principal Component Analysis; this
 would be a good method to remove the redundant variables and hence
 reduce some the amount of data that needs to be processed. Can anyone
 suggest any other sensible statistical anaysis methods?
 - What are the ideal tools / software to perform the clustering /
 classification. I have access to SPSS software but I have never used it
 before and am not really sure how to apply it or whether it is any good
 when dealing with 100s of variables.
 
 So far I have been analysing each variable on its own 'by eye' by
 plotting the mean and sd for all music files. However this approach is
 not feasible in the long term since I am dealing with such a large
 number of variables. In addition, by looking at each variable on its
 own, I do not find clusters / patterns that are only visible through
 multivariate analysis. If anyone can recommend a better approach I
 would be greatly appreciated. 
 
 Any help or suggestion that can be offered will be greatly appreciated.
 
 Many Thanks!
 
 Rishabh Gupta

In SPSS, Factor Analysis would help you reduce your many variables down to 
bigger, more general ones. As well, Cluster Analysis will let you see how 
your variables group themselves. The results might look like the following:

Factor 1: (percussiveness)
volume of drums
number of drum types
drum melodies...

Factor 2: (happiness)
minor modes
speed
pitch...

Factor 3: (memorableness)
melodic structure
folk music precursor


The cluster analysis would be similar, but would have the variables on a 
branching tree that showed that speed and pitch were closer than drum type 
and folk precursor, say. Would be interesting to see how this works.

I wonder if you could calculate some kind of fractal dimension for the 
music too?

Doug H


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-13 Thread M Law

In sci.stat.math Rishabh Gupta [EMAIL PROTECTED] wrote:

[ snip ]

It seems that you are new to the field of pattern recognition.
In that case, you may want to check out the classic book
Pattern Classification by Duda, Hart and Stork.

There is a second edition that came out in 2001. It is a classic of the
field, and you may find other insights useful to your problem.

M Law


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-13 Thread Art Kendall

classification is a specialized field go to
http://www.pitt.edu/~csna/
and click on class-l
although this is the Classification Society of North America members of the
British Classification Society also follow it.

SPSS should be able to handle what you want to do.  However, you need
face-to-face consulting/collaboration with someone who does this kind of
analysis.  Many of the techniques grew out of psychology so if CLASS-L doesn't
help you might try your local psycholgy departments.

Rishabh Gupta wrote:

 Hi All,
 I'm a research student at the Department Of Electronics, University Of
 York, UK. I'm working a project related to music analysis and
 classification. I am at the stage where I perform some analysis on music
 files (currently only in MIDI format) and extract about 500 variables that
 are related to music properties like pitch, rhythm, polyphony and volume. I
 am performing basic analysis like mean and standard deviation but then I
 also perform more elaborate analysis like measuring complexity of melody and
 rhythm.

 The aim is that the variables obtained can be used to perform a number of
 different operations.
 - The variables can be used to classify / categorise each piece of
 music, on its own, in terms of some meta classifier (e.g. rock, pop,
 classical).
 - The variables can be used to perform comparison between two files. A
 variable from one music file can be compared to the equivalent variable in
 the other music file. By comparing all the variables in one file with the
 equivalent variable in the other file, an overall similarity measurement can
 be obtained.

 The next stage is to test the ability of the of the variables obtained to
 perform the classification / comparison. I need to identify variables that
 are redundant (redundant in the sense of 'they do not provide any
 information' and 'they provide the same information as the other variable')
 so that they can be removed and I need to identify variables that are
 distinguishing (provide the most amount of information).

 My Basic Questions Are:
 - What are the best statistical techniques / methods that should be
 applied here. E.g. I have looked at Principal Component Analysis; this would
 be a good method to remove the redundant variables and hence reduce some the
 amount of data that needs to be processed. Can anyone suggest any other
 sensible statistical anaysis methods?
 - What are the ideal tools / software to perform the clustering /
 classification. I have access to SPSS software but I have never used it
 before and am not really sure how to apply it or whether it is any good when
 dealing with 100s of variables.

 So far I have been analysing each variable on its own 'by eye' by plotting
 the mean and sd for all music files. However this approach is not feasible
 in the long term since I am dealing with such a large number of variables.
 In addition, by looking at each variable on its own, I do not find clusters
 / patterns that are only visible through multivariate analysis. If anyone
 can recommend a better approach I would be greatly appreciated.

 Any help or suggestion that can be offered will be greatly appreciated.

 Many Thanks!

 Rishabh Gupta



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-13 Thread Jim Snow


Rishabh Gupta [EMAIL PROTECTED] wrote in message
a4eje9$ip8$[EMAIL PROTECTED]">news:a4eje9$ip8$[EMAIL PROTECTED]...
 Hi All,
 I'm a research student at the Department Of Electronics, University Of
 York, UK. I'm working a project related to music analysis and
 classification. I am at the stage where I perform some analysis on music
 files (currently only in MIDI format) and extract about 500 variables that
 are related to music properties like pitch, rhythm, polyphony and volume.
I
 am performing basic analysis like mean and standard deviation but then I
 also perform more elaborate analysis like measuring complexity of melody
and
 rhythm.

 The aim is that the variables obtained can be used to perform a number of
 different operations.
 - The variables can be used to classify / categorise each piece of
 music, on its own, in terms of some meta classifier (e.g. rock, pop,
 classical).
 - The variables can be used to perform comparison between two files. A
 variable from one music file can be compared to the equivalent variable in
 the other music file. By comparing all the variables in one file with the
 equivalent variable in the other file, an overall similarity measurement
can
 be obtained.

 The next stage is to test the ability of the of the variables obtained to
 perform the classification / comparison. I need to identify variables that
 are redundant (redundant in the sense of 'they do not provide any
 information' and 'they provide the same information as the other
variable')
 so that they can be removed and I need to identify variables that are
 distinguishing (provide the most amount of information).

 My Basic Questions Are:
 - What are the best statistical techniques / methods that should be
 applied here. E.g. I have looked at Principal Component Analysis; this
would
 be a good method to remove the redundant variables and hence reduce some
the
 amount of data that needs to be processed. Can anyone suggest any other
 sensible statistical anaysis methods?
 - What are the ideal tools / software to perform the clustering /
 classification. I have access to SPSS software but I have never used it
 before and am not really sure how to apply it or whether it is any good
when
 dealing with 100s of variables.

 So far I have been analysing each variable on its own 'by eye' by plotting
 the mean and sd for all music files. However this approach is not feasible
 in the long term since I am dealing with such a large number of variables.
 In addition, by looking at each variable on its own, I do not find
clusters
 / patterns that are only visible through multivariate analysis. If anyone
 can recommend a better approach I would be greatly appreciated.

 Any help or suggestion that can be offered will be greatly appreciated.



 A useful exposition of techniques for initial investigation of
multivariate data set is given at

  http://www.sas.com/service/library/periodicals/obs/obswww22/

 If you point your browser at  Andrews plots  you will find more.

My inclination would be to start with an Andrews plot, possibly
using principal component scores for about 20 music files from several
genres. This will enable you to find linear combinations of variable which
best separate the genres. The technique and examples is set out in:

  Gnanadesikan:Multivariate Data Analysis, but this is an old
reference.

I hope this helps   Jim Snow




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-13 Thread Richard Wright

Genres are presumably groups. So linear combinations of variables that
best separate the genres would be more effectively found by linear
canonical variates analysis (aka discriminant analysis).

Richard Wright


On Thu, 14 Feb 2002 03:18:48 GMT, Jim Snow [EMAIL PROTECTED]
wrote:


snipped
My inclination would be to start with an Andrews plot, possibly
using principal component scores for about 20 music files from several
genres. This will enable you to find linear combinations of variable which
best separate the genres. The technique and examples is set out in:
snipped



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-13 Thread Jay Warner

You might consider a form of PLS - your measurmenets may be highly correlated,
and only a very few can do you any good.  You have a great many output vars,
and few enough inputs.

Jay

Rishabh Gupta wrote:

 Hi All,
 I'm a research student at the Department Of Electronics, University Of
 York, UK. I'm working a project related to music analysis and
 classification. I am at the stage where I perform some analysis on music
 files (currently only in MIDI format) and extract about 500 variables that
 are related to music properties like pitch, rhythm, polyphony and volume. I
 am performing basic analysis like mean and standard deviation but then I
 also perform more elaborate analysis like measuring complexity of melody and
 rhythm.

 The aim is that the variables obtained can be used to perform a number of
 different operations.
 - The variables can be used to classify / categorise each piece of
 music, on its own, in terms of some meta classifier (e.g. rock, pop,
 classical).
 - The variables can be used to perform comparison between two files. A
 variable from one music file can be compared to the equivalent variable in
 the other music file. By comparing all the variables in one file with the
 equivalent variable in the other file, an overall similarity measurement can
 be obtained.

 The next stage is to test the ability of the of the variables obtained to
 perform the classification / comparison. I need to identify variables that
 are redundant (redundant in the sense of 'they do not provide any
 information' and 'they provide the same information as the other variable')
 so that they can be removed and I need to identify variables that are
 distinguishing (provide the most amount of information).

 My Basic Questions Are:
 - What are the best statistical techniques / methods that should be
 applied here. E.g. I have looked at Principal Component Analysis; this would
 be a good method to remove the redundant variables and hence reduce some the
 amount of data that needs to be processed. Can anyone suggest any other
 sensible statistical anaysis methods?
 - What are the ideal tools / software to perform the clustering /
 classification. I have access to SPSS software but I have never used it
 before and am not really sure how to apply it or whether it is any good when
 dealing with 100s of variables.

 So far I have been analysing each variable on its own 'by eye' by plotting
 the mean and sd for all music files. However this approach is not feasible
 in the long term since I am dealing with such a large number of variables.
 In addition, by looking at each variable on its own, I do not find clusters
 / patterns that are only visible through multivariate analysis. If anyone
 can recommend a better approach I would be greatly appreciated.

 Any help or suggestion that can be offered will be greatly appreciated.

 Many Thanks!

 Rishabh Gupta

 =
 Instructions for joining and leaving this list, remarks about the
 problem of INAPPROPRIATE MESSAGES, and archives are available at
   http://jse.stat.ncsu.edu/
 =

--
Jay Warner
Principal Scientist
Warner Consulting, Inc.
 North Green Bay Road
Racine, WI 53404-1216
USA

Ph: (262) 634-9100
FAX: (262) 681-1133
email: [EMAIL PROTECTED]
web: http://www.a2q.com

The A2Q Method (tm) -- What do you want to improve today?






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-13 Thread Jim Snow

Richard Wright [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Genres are presumably groups. So linear combinations of variables that
 best separate the genres would be more effectively found by linear
 canonical variates analysis (aka discriminant analysis).

 Richard Wright


 On Thu, 14 Feb 2002 03:18:48 GMT, Jim Snow [EMAIL PROTECTED]
 wrote:


 snipped

 My inclination would be to start with an Andrews plot, possibly
 using principal component scores for about 20 music files from several
 genres. This will enable you to find linear combinations of variable
which
 best separate the genres. The technique and examples is set out in:
 snipped


 Andrews plots and similar techniques do not replace discriminant
analysis, which , as Richard Wright said  finds  linear combinations of
variables that best separate the variables . In the book by Gnanadesikan
which first popularised the technique, he examines the variables in the
discriminant space, ie a space defined by discriminant functions rather than
principal components or original variables.
The techniques are doing different things.
 Andrews plots are to enable examination of the multidimensional data in a
two dimensional plot. Amongst other things, for example, several dimensions
of high difference between say jazz and pop or between jazz and flamenco may
be found,which are not necessarily orthogonal.
Andrews plots are a data reduction technique which is ,in many
dimensions, analogous to examining a multi dimensional cluster of points
from many viewpoints ,so that no possible view point is far from one of
those used. Thus virtually all possible discriminant functions are tried and
the interesting ones noted. In a spirit of exploratory data analysis, this
seems useful.
RishadhGupta wrote:
- The variables can be used to perform comparison between two files. A
variable from one music file can be compared to the equivalent variable in
the other music file. By comparing all the variables in one file with the
equivalent variable in the other file, an overall similarity measurement can
be obtained.

Andrews plots reveal the directions in which the two files differ.
Incidentally, the total area between the two traces on the plot is the
Euclidean distance, I think, if the original Andrews weightings are used.
Tukey suggested weightings which examine the multidimensional space more
closely but do not have such a simple interpretation of the difference
between traces. I have not used any of this for some time and I do not have
relevant books, but the material I referred to on the web should be helpful.

Straightforward discriminant analysis will certainly find the best
linear discriminator in the least squares sense, but stepwise elimination of
variables in this process may result in discarding a variable with intuitive
appeal in favour of one or several highly correlated with it and the least
squares metric may possibly not be the best. For this and other reasons an
exploratory approach as Rishabh Gupta has begun seems appropriate.

   I still hope this helps   Jim Snow






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=