Re: (È«º¸)ÃÖ°­È«º¸ÇÁ·Î±×·¥!!È«º¸°ÆÁ¤³¡.

2002-02-27 Thread Jim Snow

This is a multi-part message in MIME format.

--=_NextPart_000_0017_01C1BFC6.F446E040
Content-Type: text/plain;
charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

O=BA=BB=B8=DE=C0=CF=C0=BA=C1=A4=BA=B8=C5=EB=BD=C5=B8=C1=C0=CC=BF=EB=C3=CB=
=C1=F8=B9=D7=C1=A4=BA=B8=BA=B8=C8=A3=B5=EE=BF=A1=B0=FC=C7=D1=B9=FD=B7=FC=C1=
=A650=C1=B6=BF=A1=C0=C7=B0=C5=C7=D1[=B1=A4=B0=ED]=B8=DE=C0=CF=C0=D4=B4=CF=
=B4=D9BLANK
  [EMAIL PROTECTED] wrote in message =
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
  O =BA=BB =B8=DE=C0=CF=C0=BA =C1=A4=BA=B8=C5=EB=BD=C5=B8=C1 =
=C0=CC=BF=EB=C3=CB=C1=F8 =B9=D7 =C1=A4=BA=B8=BA=B8=C8=A3 =B5=EE=BF=A1 =
=B0=FC=C7=D1 =B9=FD=B7=FC =C1=A6 50=C1=B6=BF=A1 =C0=C7=B0=C5=C7=D1 =
[=B1=A4=B0=ED] =B8=DE=C0=CF=C0=D4=B4=CF=B4=D9
  O e-mail=C1=D6=BC=D2=B4=C2 =C0=CE=C5=CD=B3=DD=BB=F3=BF=A1=BC=AD =
=C3=EB=B5=E6=C7=CF=BF=B4=C0=B8=B8=E7, =C1=D6=BC=D2=BF=DC =
=BE=EE=B6=B0=C7=D1 =B0=B3=C0=CE =C1=A4=BA=B8=B5=B5 =B0=A1=C1=F6=B0=ED =
=C0=D6=C1=F6 =BE=CA=BD=C0=B4=CF=B4=D9
  =BC=F6=BD=C5=B0=C5=BA=CE=B8=A6 =BF=F8=C7=CF=BD=C3=B8=E9 =
=BE=C6=B7=A1=BF=A1=BC=AD =BC=F6=BD=C5=B0=C5=BA=CE =C7=D8 =
=C1=D6=BC=BC=BF=E4.=C1=A4=BA=B8=B8=A6 =BF=F8=C4=A1 =BE=CA=B4=C2 =
=BA=D0=B2=B2=B4=C2 =B4=EB=B4=DC=C8=F7 =C1=CB=BC=DB =C7=D5=B4=CF=B4=D9.
=A2=BF=A2=BF=A2=BF =C8=AB=BA=B8 =B6=A7=B9=AE=BF=A1 =B0=C6=C1=A4 =
=C7=CF=BC=CC=B3=AA=BF=E4? =C0=CC=C1=A8 =B0=C6=C1=A4 =B8=B6=BC=BC=BF=E4. =
=A2=BF=A2=BF=A2=BF
=C8=AB=BA=B8=BF=A1 =B4=EB=C7=D1 =B8=F0=B5=E7=B0=CD=B0=FA =
=B3=EB=C7=CF=BF=EC =BF=A9=B1=E2 =B4=D9 =C0=D6=BD=C0=B4=CF=B4=D9.=20
=B9=AB=BE=FA=C0=CC=B5=E7=C1=F6 =B9=B0=BE=EE =BA=B8=BC=BC=BF=E4.  =
mailto:[EMAIL PROTECTED]

=A2=BA=A2=BA=A2=BA =C0=CC=B9=F8=BF=A1 =
=C8=AB=BA=B8=B4=EB=C7=E0=BE=F7 =C0=B8=B7=CE =
=C0=FC=C8=AF=C7=D4=BF=A1=B5=FB=B6=F3=20
3=B3=E2=B5=BF=BE=C8  =B8=F0=BE=C6=B3=F5=C0=BA =
=C8=AB=BA=B8=C7=C3=B1=D7=B7=A5=C0=BB =BF=B0=B0=A1=B7=CE  =
=B4=D9=B5=E5=B8=B2=B4=CF=B4=D9.=A2=B8=A2=B8=A2=B8
  =20
=A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=C3=CA=BA=B8=BF=EB =
=A2=BE=A2=BE=A2=BE=20

=A2=BD=C0=CC=B8=E1=C3=DF=C3=E2=B1=E22=B0=B3 =
=A2=BD=C0=CC=B8=E1=C6=ED=C1=FD=B1=E21=B0=B3 =
=A2=BD=C0=CC=B8=E1=B9=DF=BC=DB=B1=E22=B0=B3(=C1=A4=C7=B01,=B5=A5=B8=F01) =

=A2=BD=C0=CC=B8=E1=B8=AE=BD=BA=C6=AE50=B8=B8=B0=B3 =
=A2=BD=B0=D4=BD=C3=C6=C7=B5=EE=B7=CF=B1=E21=B0=B3 =
=A2=BD=B0=D4=BD=C3=C6=C7=B5=F0DB2000=B0=B3

=A2=D1 =C0=A7=C0=C7 =B8=F0=B5=E7=B0=CD=C0=BB =
10=B8=B8=BF=F8=BF=A1 =B4=D9 =B5=E5=B8=B3=B4=CF=B4=D9. =A2=D0
  =20
=A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=C1=DF=B1=DE=BF=EB =
=A2=BE=A2=BE=A2=BE

=A2=BD=C0=CC=B8=E1=C3=DF=C3=E2=B1=E23=B0=B3 =
=A2=BD=C0=CC=B8=E1=C6=ED=C1=FD=B1=E21=B0=B3 =
=A2=BD=C0=CC=B8=E1=B9=DF=BC=DB=B1=E23=B0=B3(=C1=A4=C7=B02=B0=B3,=B5=A5=B8=
=F01=B0=B3)
=A2=BD=C0=CC=B8=E1=B8=AE=BD=BA=C6=AE100=B8=B8=B0=B3 =
=A2=BD=B0=D4=BD=C3=C6=C7=B5=EE=B7=CF=B1=E21=B0=B3 =
=A2=BD=B0=D4=BD=C3=C6=C7DB5000=B0=B3

=A2=D1 =C0=A7=C0=C7 =B8=F0=B5=E7=B0=CD=C0=BB =
20=B8=B8=BF=F8=BF=A1 =B4=D9 =B5=E5=B8=B3=B4=CF=B4=D9. =A2=D0
  =20
=A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=B0=ED=B1=DE=BF=EB(1) =
=A2=BE=A2=BE=A2=BE

=A2=BC=A2=BC=A2=BC=B0=B3=C0=CE =C8=A8=C6=E4=C1=F6=BF=A1 =
=C0=CC=B8=E1=C3=DF=C3=E2,=B9=DF=BC=DB=B1=E2=B8=A6 =C1=F7=C1=A2 =
=BC=B3=C4=A1 =C7=D8 =B5=E5=B8=B3=B4=CF=B4=D9.=A2=BC=A2=BC=A2=BC

=
=A2=BD=C0=CC=B8=E1=C3=DF=C3=E2=B1=E2=B4=C9=A2=BD=C0=CC=B8=E1=C1=DF=BA=B9=BB=
=E8=C1=A6=B1=E2=B4=C9=A2=BD=BC=F6=BD=C5=B0=C5=BA=CE=C0=DA=B5=BF=B1=E2=B4=C9=
=A2=BD=C0=CC=B8=E1=B9=DF=BC=DB=B1=E2=B4=C9
=
=A2=BD=BC=F6=BD=C5=B0=C5=BA=CE=C0=DA=C0=D3=BD=C3=BA=B8=B3=BB=B1=E2=A2=BD=C0=
=D3=BD=C3=B0=C5=BA=CE=C0=DA=BC=F6=BD=C5=B0=C5=BA=CE=C0=DA=B7=CE

=
=A2=D1=BC=B3=C4=A1=B0=A1=B4=C9=C7=D1=B0=F7=3D=C8=A8=C6=E4=C1=F6=BF=A1MYSQ=
L=B0=E8=C1=A4=C0=CC =C0=D6=BE=EE=BE=DF=C7=D4
=C0=AF=B7=E1=C8=A8=C0=CC =BE=F8=B4=C2=B0=E6=BF=EC=B4=C2 =
(200=B8=DE=B0=A1,=C0=CF=B3=E2=C8=A3=BD=BA=C6=C34.4000=BF=F8=BA=B0=B5=B5=C0=
=D3)

=A2=D1 =C0=A7=C0=C7 =BC=B3=C4=A1=B8=A6 20=B8=B8=BF=F8=BF=A1 =
=C7=D8=B5=E5=B8=B3=B4=CF=B4=D9. =A2=D0
  =20
=A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=B0=ED=B1=DE=BF=EB(2) =
=A2=BE=A2=BE=A2=BE

1000=B8=B8=B0=B3 =C0=CC=B8=E1=B8=AE=BD=BA=C6=AE=B8=A6 =
=BF=C3=B8=B0=BC=AD=B9=F6=B8=A6 =
=B8=EE=BB=E7=B6=F7=BF=A1=B0=D4=B8=B8=C0=D3=B4=EB=C7=D4=B4=CF=B4=D9.
=
(=B1=E2=B0=A31=B3=E2=3D=B0=A1=B0=DD100=B8=B8=BF=F8)=C8=AB=BA=B8=C7=C1=B7=CE=
=B1=D7=B7=A5=B0=FA =B8=F0=B5=E7 =B3=EB=C7=CF=BF=EC=B8=A6 =C0=FC=BA=CE =
=C0=FC=BC=F6 =C7=D4=B4=CF=B4=D9.
  =20
=A2=C2=A2=C2=A2=C2 =C0=CC=B8=E1 =B1=A4=B0=ED =B4=EB=C7=E0 =
=A2=C2=A2=C2=A2=C2

=B1=D7=B5=BF=BE=C8 =C8=AB=BA=B8=C0=C7 =B3=EB=C7=CF=BF=EC=B7=CE =
2=B3=E2=BF=A1 =B0=C9=C3=C4 =B9=DF=BC=DB=BD=C3=BC=B3=C0=BB =BF=CF=BA=F1 =
=C7=CF=B0=ED
6000=B8=B8=B0=B3=C0=C7 =C0=CC=B8=E1=B5=A5=C0=CC=C5=B8=B8=A6 =
=B1=B8=BA=F1=C7=CF=BF=A9 =C0=CC=B8=E1=C8=AB=BA=B8=B8=A6 =
=B4=EB=C7=E0=C7=D8 =B5=E5=B8=B3=B4=CF=B4=D9.


Re: Statistics Tool For Classification/Clustering

2002-02-13 Thread Jim Snow


Rishabh Gupta [EMAIL PROTECTED] wrote in message
a4eje9$ip8$[EMAIL PROTECTED]">news:a4eje9$ip8$[EMAIL PROTECTED]...
 Hi All,
 I'm a research student at the Department Of Electronics, University Of
 York, UK. I'm working a project related to music analysis and
 classification. I am at the stage where I perform some analysis on music
 files (currently only in MIDI format) and extract about 500 variables that
 are related to music properties like pitch, rhythm, polyphony and volume.
I
 am performing basic analysis like mean and standard deviation but then I
 also perform more elaborate analysis like measuring complexity of melody
and
 rhythm.

 The aim is that the variables obtained can be used to perform a number of
 different operations.
 - The variables can be used to classify / categorise each piece of
 music, on its own, in terms of some meta classifier (e.g. rock, pop,
 classical).
 - The variables can be used to perform comparison between two files. A
 variable from one music file can be compared to the equivalent variable in
 the other music file. By comparing all the variables in one file with the
 equivalent variable in the other file, an overall similarity measurement
can
 be obtained.

 The next stage is to test the ability of the of the variables obtained to
 perform the classification / comparison. I need to identify variables that
 are redundant (redundant in the sense of 'they do not provide any
 information' and 'they provide the same information as the other
variable')
 so that they can be removed and I need to identify variables that are
 distinguishing (provide the most amount of information).

 My Basic Questions Are:
 - What are the best statistical techniques / methods that should be
 applied here. E.g. I have looked at Principal Component Analysis; this
would
 be a good method to remove the redundant variables and hence reduce some
the
 amount of data that needs to be processed. Can anyone suggest any other
 sensible statistical anaysis methods?
 - What are the ideal tools / software to perform the clustering /
 classification. I have access to SPSS software but I have never used it
 before and am not really sure how to apply it or whether it is any good
when
 dealing with 100s of variables.

 So far I have been analysing each variable on its own 'by eye' by plotting
 the mean and sd for all music files. However this approach is not feasible
 in the long term since I am dealing with such a large number of variables.
 In addition, by looking at each variable on its own, I do not find
clusters
 / patterns that are only visible through multivariate analysis. If anyone
 can recommend a better approach I would be greatly appreciated.

 Any help or suggestion that can be offered will be greatly appreciated.



 A useful exposition of techniques for initial investigation of
multivariate data set is given at

  http://www.sas.com/service/library/periodicals/obs/obswww22/

 If you point your browser at  Andrews plots  you will find more.

My inclination would be to start with an Andrews plot, possibly
using principal component scores for about 20 music files from several
genres. This will enable you to find linear combinations of variable which
best separate the genres. The technique and examples is set out in:

  Gnanadesikan:Multivariate Data Analysis, but this is an old
reference.

I hope this helps   Jim Snow




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Statistics Tool For Classification/Clustering

2002-02-13 Thread Jim Snow

Richard Wright [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Genres are presumably groups. So linear combinations of variables that
 best separate the genres would be more effectively found by linear
 canonical variates analysis (aka discriminant analysis).

 Richard Wright


 On Thu, 14 Feb 2002 03:18:48 GMT, Jim Snow [EMAIL PROTECTED]
 wrote:


 snipped

 My inclination would be to start with an Andrews plot, possibly
 using principal component scores for about 20 music files from several
 genres. This will enable you to find linear combinations of variable
which
 best separate the genres. The technique and examples is set out in:
 snipped


 Andrews plots and similar techniques do not replace discriminant
analysis, which , as Richard Wright said  finds  linear combinations of
variables that best separate the variables . In the book by Gnanadesikan
which first popularised the technique, he examines the variables in the
discriminant space, ie a space defined by discriminant functions rather than
principal components or original variables.
The techniques are doing different things.
 Andrews plots are to enable examination of the multidimensional data in a
two dimensional plot. Amongst other things, for example, several dimensions
of high difference between say jazz and pop or between jazz and flamenco may
be found,which are not necessarily orthogonal.
Andrews plots are a data reduction technique which is ,in many
dimensions, analogous to examining a multi dimensional cluster of points
from many viewpoints ,so that no possible view point is far from one of
those used. Thus virtually all possible discriminant functions are tried and
the interesting ones noted. In a spirit of exploratory data analysis, this
seems useful.
RishadhGupta wrote:
- The variables can be used to perform comparison between two files. A
variable from one music file can be compared to the equivalent variable in
the other music file. By comparing all the variables in one file with the
equivalent variable in the other file, an overall similarity measurement can
be obtained.

Andrews plots reveal the directions in which the two files differ.
Incidentally, the total area between the two traces on the plot is the
Euclidean distance, I think, if the original Andrews weightings are used.
Tukey suggested weightings which examine the multidimensional space more
closely but do not have such a simple interpretation of the difference
between traces. I have not used any of this for some time and I do not have
relevant books, but the material I referred to on the web should be helpful.

Straightforward discriminant analysis will certainly find the best
linear discriminator in the least squares sense, but stepwise elimination of
variables in this process may result in discarding a variable with intuitive
appeal in favour of one or several highly correlated with it and the least
squares metric may possibly not be the best. For this and other reasons an
exploratory approach as Rishabh Gupta has begun seems appropriate.

   I still hope this helps   Jim Snow






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: »¹ÊǺÍÒÔÇ°Ò»ÑùÂ𣿠£º£©

2002-02-04 Thread Jim Snow


[EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 2001ÄêÍøÂçƱѡ½Ü³öMLMÕ¾µã£¬Îñ±ØÒªÄÍÐÄä¯ÀÀŶ£º

 http://www.kelvin.13800.com
 ÐÂÓ¯ÀûÉÌÎñÐÅÏ¢ÍøÂç
 http://www.kelvin.uurr.com
 ´ó¼Ò׬ÉÌÎñÐÅÏ¢ÍøÂç
 http://www.kelvin.17951.com
 ½ðӯͨÉÌÎñÐÅÏ¢ÍøÂç
 http://www.kelvin.9982.com
 ÐÂ˼άÉÌÎñÐÅÏ¢ÍøÂç
 http://www.kelvin.wapmark.com
 http://kelvinfan.126.com
 999ÉÌÎñÐÅÏ¢ÍøÂç

 http://www.kelvin.uwuw.com
 MintmailÍøÂ磨¾³ÍâÎÞÐ븶·Ñ£¬Ì油ѡÔñ£©

 http://www.kelvinfan.13800.com
 ´´ÒµÂÛ̳

 ¶¼ÊǷdz£ÓÅÐãµÄMLMÕ¾µã£¬ÊÇÖйúÈË×Ô¼ºµÄ׬ǮÍøÕ¾£¬×Ô¼ºµÄ´´Òµ¼Æ»®£¡
 È«ÖÐÎĽçÃ棬һ¿´¾Í¶®¡£
 ¹þ¹þ£¬ËäÈ»½éÉܵÃÓеã¹ý·Ö£¬µ«µÄÈ·ÊÇ¿ÉÒÔ׬µ½Ç®£¬¾ø¶Ô²»Æ­ÈË£¡

ÄãÖ»ÒªÓÐÒ»¸öÒøÐÐÐÅÓÿ¨Õ˺Å--ûÓУ¿ÔÚÏßÉêÇëÒ»¸öÀ²£¬ÍêÈ«Ãâ·Ñ£¨½¨ÒéΪ½¨ÉèÒø
ÐУ©¡£

È»ºóÄãÒª×öµÄ¾ÍÊÇÿÐÇÆÚ´ò¼¸¸öµç»°£¬²éѯһÏÂÒøÐÐÕ˺ţ¬¹þ¹þ£¬×øµÈ×ÅÊÕÇ®°É£¡£¡£¡
 ¿îÏîÍêÈ«ÒøÐÐתÕÊ£¬½â³ýÄúµÄÒ»Çкó¹ËÖ®ÓÇ¡£
 È¥¿´¿´°¡£¬Á˽âËûÃǵÄÔË×÷·½Ê½£¬Ëµ²»¶¨Äú»áÐĶ¯Å¶¡£

 ÒªÖªµÀ£º
 ÔÚÃÀ¹úµÄ50Íò¸ö°ÙÍò¸»ÎÌÖУ¬ÓÐ20%ÊÇÔÚ¹ýÈ¥µÄ¼¸ÄêÖУ¬
 ͨ¹ýÕâÖÖ¶à²ã´ÎÐÅÏ¢ÍøÂçÓªÏúMULTI-LEVEL MARKETING (MLM) ¶ø³É¹¦µÄ¡£
 ´ËÍ⣬ͳ¼ÆÏÔʾÃÀ¹úÿÌìƽ¾ùÓÐ45¸öÈËͨ¹ýMLM¶ø³ÉΪ°ÙÍò¸»ÎÌ¡£

 ÏÖÔÚMLMÒÑÀ´µ½Öйú£¡£¡£¡
 ÕâЩÍøÕ¾ÕýÊÇÒÔÕâÑùÒ»ÖÖ¼òµ¥£¬Õ¸Ð£¬ºÏ·¨ºÍÓÐȤµÄºÃ·½Ê½£¬
 À´ÈÃÿһλ»áÔ±×ã²»³ö»§Ò²¿ÉÒÔ׬µ½¾Þ¶î²Æ¸»¡£

 ¡ï¡ï¡ïÉÏÕ¾ºó£¬¾´Çë×ÐϸÔĶÁËùÓÐÄÚÈÝ£¬ÔÙÑ¡Ôñ×¢²á¡£¡ï¡ï¡ï

 ×¢²á³ÉΪÕýʽ»áÔ±ºó£¬Äú¾ÍÓµÓÐÁËÊôÓÚ×Ô¼ºµÄ׬ǮÍøÕ¾ÁË¡£
 ͨ¹ýËü¿ÉÒÔʹÄú¾òµ½À´×ÔÍøÂçµÄµÚһͰ½ð£¡
 1Íò£¿10Íò£¿»¹ÊÇ100Íò£¿¹þ¹þ£¬Ë­ÖªµÀÄØ£¿Ò²Ðí¸ü¶à£¡
 ¶à°ô°¡£¬²»ÓÃÔÙÊÜÀÏÍâµÄÆøÁË,ÀÏÍâµÄÕ¾µãÐÅÓÃÌ«²îÀ²£¬ÀÏÊÇÊÕ²»µ½Ç®¡£

 --ÄãµÄÃÎÏë»áʵÏֵġ££ºP

£¨PS£ºµ±ÄúÕýʽ³ÉΪ³ÉÔ±ºó£¬ÎÒ»áÃâ·ÑÔùËÍÈ«¹ú¸÷µØµÄÊýÒÔ10Íò¼ÆµÄÓʼþµØÖ·£¬·½±ãÄú
µÄÐû´«£©

 ±¾ÐżþÈç¹û´ò½ÁÁËÄú£¬¾´Ç뽫ËüÎÞÇéµØɾ³ý£¡


 =
 Instructions for joining and leaving this list, remarks about the
 problem of INAPPROPRIATE MESSAGES, and archives are available at
   http://jse.stat.ncsu.edu/
 =




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: tricky explanation problem with chi-square on multinomial experiment(dice)

2002-01-25 Thread Jim Snow


Gottfried Helms [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Hi ,

  there was a tricky problem, recently, with the chi-square-density
  of higher dgf's.
  I discussed thath in sci.stat.consult and in a german newsgroup,
  got some answers and also think to have understood the real point.

  But I would like to have a smoother explanation, as I have to
  deal with it in my seminars. Maybe someone out has an idea or
  a better shortcut, how to describe it.
  To illustrate this I just copypaste an exchange from s.s.consult;
  hope you forgive my lazyness. On the other hand: maybe the
  true point comes out better this way.

 Regards
 Gottfried


 3 postings added:
 ---(1/3)---
 [Gottfried]
  Hi -
 
 im stumbling in the dark... eventually only missing any
 simple hint.
 I'm trying to explain the concept of significance of the
 deviation of an empirical sample from a given, expected
 distribution.
 If we discuss the chi-square-distribution
   |
   |*
   | *
   | *
   |  *
   |   *
   | *
   |  *
   |*
   +-
 
 then this graph illustrates us very well, that and how a
 small deviation is more likely to happen than a high deviation -
 thus backing the concept of the 95%tiles etc. in the beginners
 literature.
 Just cutting it in equal slices this curve gives us expected
 frequencies of occurences of samples with individual chi-squared
 deviations from the expected occurences.
 
 If I have more df's, then the curve changes its shape; in this
 case a 5 df-curve for samples of thrown dices, where I count
 the frequencies of occurences of each number and the deviation
 of these frequencies from the uniformity.
 
   |
   |
   |
   |
   |*
   |  **
   |**
   | **
   |  * *
   +-
   0X²(df=5)
 
   Now the slices with the highest frequency of occurences
   are not the ones with the smallest deviation from the
   expected distribution (X²=0) - and even if I accept, that this
   is at least so for the cumulative distribution, it is
   suddenly no more self-explaining. It is congruent with
   the reality, but our common language is different:
   the most likely chisquare-deviation from the uniformity
   is now an area which is not at the zero-mark.
   So, now: do we EXPECT a deviation from uniformity?
   That the count of frequencies of the occurences of the
   6 dices numbers is NOT most likely uniform? HÄH?
   Is this suddenly the Nullhypothesis?  And do we calculate
   the deviation of our empirical sample then from this new
   Nullhypothesis???
 
   I never thought about that in this way, but since I do
   now, I feel a bit confused, maybe I only have to step
   aside a bit?
   Any good hint appreciated -
 
  Gottfried.
 
 --

 ---(2/3)---
 Then one participant answered:

  Actually, that corresponds to the notion that if a random sequence is
  *too* uniform, it isn't really random.  For example, if you were to toss
a
  coin 1000 times, you'd be a little surprised if you got *exactly* 500
  heads and 500 tails.  If you think in terms of taking samples from a
  multinomial population, the non-monotonicity of the chi-square density
  means that a *small* amount of sampling error is more probable than *no*
  sampling error, as well as more probable than a *large* sampling error,
  which I think corresponds pretty well to our intuition.
 

 ---

 --(3/3)-
 I was not really satisfied with this and answered, after I had
 got some more insight:

 [Gottfried]
[] wrote:
   Actually, that corresponds to the notion that if a random sequence
is
   *too* uniform, it isn't really random.  For example, if you were to
toss a
   coin 1000 times, you'd be a little surprised if you got *exactly* 500
   heads and 500 tails.  If you think in terms of taking samples from a
 
 
  Yes, this is true. But it is the same with each other combination.
  No one is more likely to occur (or better: one should say: variation?).
  But then, a student would ask, how could you still attribute a
near-expected-
  variation more likely than a far-away-expected variation in generality?
 
  The reason is, that we don't argue about a specific variation,
  but about properties of a variation, or in this case, of a combination.
  We commonly select the property of having a distance from the
  expected variation, measured in terms 

Re: How To Code Position In A List

2002-01-20 Thread Jim Snow


Ronny Richardson [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 I want to analyze data sets where I have two variables, the finishing
 position (ordinal) and a ratio scale performance variable. I want to see
if
 there is a relationship between the finishing position and the value of
the
 ratio performance variable. This would be about the same as seeing if
there
 was a relationship between the order in which an exam was completed and
the
 resulting score.

 My problem is that I have multiple groups and so being #15 in one group
 might put you in the back of the group while being #20 in another group
 might put you near the front.

 Is there a way to recode the position variable to make it meaningful in
 this situation? I have considered percentiles (more specifically dectiles)
 but I am not sure this is the best way to go. Any suggestions?


 Ronny Richardson

   Using percentiles is a sensible way to go. Alternatively, replacing ranks
with normal scores would be sensible if you know that the ratio scale
variable is normally distributed with a constant variance. Then Pearson's
product moment correlation coefficient can be calculated across the whole
data set.
However, if there is any doubt about these assumptions, there is a
better way. If Spearman's rank correlation coefficient is calculated for
each group separately, the weighted average of these will be a good
descriptive statistic for the whole set. To obtain a significance test,
calculate the p value of the correlation coefficient for each group
separately ((p(1),p(2).,p(n).Then the null hypothesis is that

  -2*Sum from 1 to n (Log p(i))

 is an observation on a ChiSquared random variable with 2*n degrees of
freedom.
  Hope this helps  Jim Snow






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Interpreting mutliple regression Beta is only way?

2002-01-18 Thread Jim Snow


Wuzzy [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Rich Ulrich [EMAIL PROTECTED] wrote in message

 Thanks Rich, most informative, I am trying to determine a method of
 comparing apples to oranges - it seems an improtant thing to try to
 do, perhaps it is impossible .

 I am trying to
 determine which is better, glycemic index or carbohydrate total in
 predicting glycemic load (Glycemic load=glycemic index*carbohydrate).

 my results as a matrix:

 GI load  GI  Carb
 GI load  1.000
 GI   .5331.000
 Carb .858.1241.000

 So it seems that carb affects GI load more than does GI.. but this is
 on ALL foods.. (nobody eats ALL foods so cannot extrapolate to human
 diet) but I don't think you're allowed to do this kind of comparison
 as Carb and GI aretotal different values:

 I suspected that you would be allowed to make the comparisons if you
 use Betas, ie. measure how many standard deviationGlycemic load=glycemic
index*carbohydrate
 changes of GI and  Carb it requires..  If it takes a bigger standard
 deviation of Carb then you could say that it is more likely that carb
 has a bigger effect on glycemic load.

 you seem to suggest that even using standard deviation changes, you
 cannot compare  apples to oranges.  Which sounds right but is
 dissapointing..

The glycaemic index is calculated as the area under the blood
glucose curve for the two hours (or 3 hours for diabetics) after ingesting
enough of a food to include 50 grams of carbohydrate, divided by the same
area after ingesting 50 grams of pure glucose, expressed as a percentage.
In some cases a reference food other than glucose is used.

If the area under the curve is the glycaemic load you are studying I
would expect the model
 Glycemic load=glycemic index*carbohydrate
to fit the data very well when the carbohydrate content is near 50 gm,
providing all the glycaemic indices have been calculated on the same basis.
Using correlations or beta coefficients as you are doing is appropriate
when linear relationships are involved, but not to test for goodness of fit
to this model.
What would be of interest would be a plot  of the difference between
the predicted glycaemic load and the observed value,against carbohydrate,
especially for carbohydrate values far from 50 gm. If I have a meal of
mainly of eggs or meat, the total carbohydrate content is very low, so the
glycaemic load calculated from the formula may be wrong.
One difficulty with the whole Glycaemic Index approach is that there
is not, as far as I know, any way of calculating the glycaemic load from
foods like cheese,eggs and meat. If the body needs glucose, it will be made
from fat and protein foods.
It is not surprising that it could be hard to persuade volunteers to
ingest 8500 grams of processed cheese, containing 50gm of carbohydrate, in
order to determine its glycaemic index  :-)
 I would like to see another index constructed giving the glycaemic load
produced by 100 gm of each food, rather than the load produced by that
amount of food which contains 50gm of carbohydrate.





=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Interpreting mutliple regression Beta is only way?

2002-01-18 Thread Jim Snow


Wuzzy [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Rich Ulrich [EMAIL PROTECTED] wrote in message

 Thanks Rich, most informative, I am trying to determine a method of
 comparing apples to oranges - it seems an improtant thing to try to
 do, perhaps it is impossible .

 I am trying to
 determine which is better, glycemic index or carbohydrate total in
 predicting glycemic load (Glycemic load=glycemic index*carbohydrate).

 my results as a matrix:

 GI load  GI  Carb
 GI load  1.000
 GI   .5331.000
 Carb .858.1241.000

 So it seems that carb affects GI load more than does GI.. but this is
 on ALL foods.. (nobody eats ALL foods so cannot extrapolate to human
 diet) but I don't think you're allowed to do this kind of comparison
 as Carb and GI aretotal different values:

 I suspected that you would be allowed to make the comparisons if you
 use Betas, ie. measure how many standard deviationGlycemic load=glycemic
index*carbohydrate
 changes of GI and  Carb it requires..  If it takes a bigger standard
 deviation of Carb then you could say that it is more likely that carb
 has a bigger effect on glycemic load.

 you seem to suggest that even using standard deviation changes, you
 cannot compare  apples to oranges.  Which sounds right but is
 dissapointing..

The glycaemic index is calculated as the area under the blood
glucose curve for the two hours (or 3 hours for diabetics) after ingesting
enough of a food to include 50 grams of carbohydrate, divided by the same
area after ingesting 50 grams of pure glucose, expressed as a percentage.
In some cases a reference food other than glucose is used.

If the area under the curve is the glycaemic load you are studying I
would expect the model
 Glycemic load=glycemic index*carbohydrate
to fit the data very well when the carbohydrate content is near 50 gm,
providing all the glycaemic indices have been calculated on the same basis.
Using correlations or beta coefficients as you are doing is appropriate
when linear relationships are involved, but not to test for goodness of fit
to this model.
What would be of interest would be a plot  of the difference between
the predicted glycaemic load and the observed value,against carbohydrate,
especially for carbohydrate values far from 50 gm. If I have a meal of
mainly of eggs or meat, the total carbohydrate content is very low, so the
glycaemic load calculated from the formula may be wrong.
One difficulty with the whole Glycaemic Index approach is that there
is not, as far as I know, any way of calculating the glycaemic load from
foods like cheese,eggs and meat. If the body needs glucose, it will be made
from fat and protein foods.
It is not surprising that it could be hard to persuade volunteers to
ingest 8500 grams of processed cheese, containing 50gm of carbohydrate, in
order to determine its glycaemic index  :-)
 I would like to see another index constructed giving the glycaemic load
produced by 100 gm of each food, rather than the load produced by that
amount of food which contains 50gm of carbohydrate.





=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Interpreting mutliple regression Beta is only way?

2002-01-18 Thread Jim Snow


Wuzzy [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Rich Ulrich [EMAIL PROTECTED] wrote in message

 Thanks Rich, most informative, I am trying to determine a method of
 comparing apples to oranges - it seems an improtant thing to try to
 do, perhaps it is impossible .

 I am trying to
 determine which is better, glycemic index or carbohydrate total in
 predicting glycemic load (Glycemic load=glycemic index*carbohydrate).

 my results as a matrix:

 GI load  GI  Carb
 GI load  1.000
 GI   .5331.000
 Carb .858.1241.000

 So it seems that carb affects GI load more than does GI.. but this is
 on ALL foods.. (nobody eats ALL foods so cannot extrapolate to human
 diet) but I don't think you're allowed to do this kind of comparison
 as Carb and GI aretotal different values:

 I suspected that you would be allowed to make the comparisons if you
 use Betas, ie. measure how many standard deviationGlycemic load=glycemic
index*carbohydrate
 changes of GI and  Carb it requires..  If it takes a bigger standard
 deviation of Carb then you could say that it is more likely that carb
 has a bigger effect on glycemic load.

 you seem to suggest that even using standard deviation changes, you
 cannot compare  apples to oranges.  Which sounds right but is
 dissapointing..

The glycaemic index is calculated as the area under the blood
glucose curve for the two hours (or 3 hours for diabetics) after ingesting
enough of a food to include 50 grams of carbohydrate, divided by the same
area after ingesting 50 grams of pure glucose, expressed as a percentage.
In some cases a reference food other than glucose is used.

If the area under the curve is the glycaemic load you are studying I
would expect the model
 Glycemic load=glycemic index*carbohydrate
to fit the data very well when the carbohydrate content is near 50 gm,
providing all the glycaemic indices have been calculated on the same basis.
Using correlations or beta coefficients as you are doing is appropriate
when linear relationships are involved, but not to test for goodness of fit
to this model.
What would be of interest would be a plot  of the difference between
the predicted glycaemic load and the observed value,against carbohydrate,
especially for carbohydrate values far from 50 gm. If I have a meal of
mainly of eggs or meat, the total carbohydrate content is very low, so the
glycaemic load calculated from the formula may be wrong.
One difficulty with the whole Glycaemic Index approach is that there
is not, as far as I know, any way of calculating the glycaemic load from
foods like cheese,eggs and meat. If the body needs glucose, it will be made
from fat and protein foods.
It is not surprising that it could be hard to persuade volunteers to
ingest 8500 grams of processed cheese, containing 50gm of carbohydrate, in
order to determine its glycaemic index  :-)
 I would like to see another index constructed giving the glycaemic load
produced by 100 gm of each food, rather than the load produced by that
amount of food which contains 50gm of carbohydrate.





=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: How ro perform Runs Test??

2001-12-23 Thread Jim Snow


Glen [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 [EMAIL PROTECTED] (Chia C Chong) wrote in message
news:[EMAIL PROTECTED]...
  I am using nonlinear regression method to find the best parameters for
  my data. I came across a term called runs test from the Internet. It
  mentioned that this is to determines whether my data is differ
  significantly from the equation model I select for the nonlinear
  regression. Can someone please let me know how should I perform the
  run tests??

 You need to use a runs test that's adjusted for the dependence in the
 residuals. The usual runs test in the texts won't apply.

 Glen

I always understood that the runs test was designed to detect systematic
departures from the fitted line because some other curve fitted the data
better. In this context, it is a test for dependence of residuals.

There is a discussion of this at
 http://216.46.227.18/curvefit/systematic_deviation.htm

Any elementary text in Non-parametric Methods in statistics will
give an example.

Hope this helps  Jim Snow
[EMAIL PROTECTED]





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: How to prove absence of dependence between arguments of function?

2001-12-23 Thread Jim Snow


- Original Message -
From: Estimator [EMAIL PROTECTED]
Newsgroups: sci.stat.edu
Sent: Saturday, 22 December 2001 2:47
Subject: How to prove absence of dependence between arguments of
function?


 *** post for FREE via your newsreader at post.newsfeeds.com ***

 I've got linear function Y=f(x1;x2;x3;x4) of theoretical
 distribution, in addition x3=g1(x1), x4=g2(x1;x2) Also I've got
 empirical sample consists of N sets of these values (magnitudes)
 Y1 x11 x12 x13 x14
 ..
 Yn xn1 xn2 xn3 xn4
 as since x3 x4 are dependent of x1 x2 it's reasonable to
 evaluate x3 x4 by x1 x2 accordingly and analyse Y only from x1
x2.

If x3= g1(x1)  x3 and x1 can only be independent if the
function g1 is a constant, or at least degenerate wrt x1 and
similarly for Y.

 But I've got a strong believe that in fact all of the arguments
 are independent or dependence is insignificant. How to prove
 this mathematically using empitical observations?

You could plot x1 against x3 to convince yourself that there
was no tendency for the points to depart from a random scatter.
Similarly plot x1 vs x4  and x2 vs x4. But this would not give you
objective grounds to include or reject x3 and x4 from the
analysis.

In fact, even if x3 and x4 are uncorrelated with x1 and x2
,your best course would be to retain them in the analysis. Then
you can formally test to see whether they contribute any
explanatory power wrt Y.

If the explanatory power of the model is not significantly
improved by including x3 and x4, you have objective evidence to
exclude them from the model.

 Is any sense in making correlation matrix 4x4 (Pearson) and
 proving insignificance of coefficient of correlation (Student
 t-criterion for example) between arguments?

No. One reason is that you would have to conduct six
significance tests and the chance of the 1 in 20 level of the test
being exceeded by chance in one of them is too high. In any event
correlation is only a measure of LINEAR dependence and frequently
data has more complicated dependencies. There is no way to prove
independence from a data set, in the same sense that scientific
theories are never proved, only disproved. In particular, failure
to reject a null hypothesis is not proof of its correctness.

Dependence does not consist of only linear relationships and
lack of correlation does not imply independence. For example if X
is symetrically distributed  on (-1,1)  and Y = X^2 , then X and Y
are uncorrelated although functionally related.

Hope this helps   Jim Snow
 [EMAIL PROTECTED]




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Standardizing evaluation scores

2001-12-19 Thread Jim Snow


Glen Barnett [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Stan Brown wrote:
  But is it worth it? Don't the easy graders and :tough graders
  pretty much cancel each other out anyway?

 Not if some students only get hard graders and some only get easy
 graders.

 If all students got all graders an equal amount of time it probably
 won't matter at all.

 Glen

If some graders use the whole scale and others only use part of the
scale or concentrate grades near the centre, then using raw scores means you
are giving the full scale graders more weight in the overall ranking of
students. If this is undesirable, grades could be scaled to a common mean
and equal mean deviation.  (Standard deviation would give increased weight
to extremes of the scale.)
In all these adjustments, we lose transparency of the process and this must
be weighed against the gains. I suspect that only sharp contrasts between
the behaviour of the graders and/or different students having different sets
of graders would justify this, and may well be better dealt with by
instructing the graders appropriately after pointing out that it is
desirable for all graders to have equal weight in assessment.




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



test-ignore

2001-12-10 Thread Jim Snow

test please ignore




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-09 Thread Jim Snow

Ronny Richardson [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...

 A few weeks ago, I posted a message about when to use t and when to use z.

I did not see the earlier postings, so forgive me if I repeat advice already
given.:-)

1. The consequences of using the t distribution instead of the normal
distribution for sample sizes greater than 30 are of no importance in
practice. The difference in the numbers given as confidence limits are so
small that no sensible person would change their course of action based on
that miniscule variation. In the case of a significance test a result just
over or just under, say, the 5% level should always be examined in the
knowledge that the 5% is an arbitrary level and that a level of 4.9%  or
5.1%  could equally well have been chosen.

2. There is no good reason for statistical tables for use in practical
analysis of data to give figures for t on numbers of degrees of freedom over
30 except that it makes it simple to routinely use one set of tables when
the variance is estimated from the sample.
Another reason that books of tables do not include t values for degrees of
freedom between 30,60,sometimes 120 and infinity is that there is no
need,even for the extreme tails of the distribution and when ,for whatever
reason, high accuracy is required, because the intermediate values can be
obtained by harmonic interpolation. That is, the tail entries in the
distribution can be  obtained by linear interpolation on 1/n.

3. There are situations where the error variance is known. They
generally arise when the errors in the data arise from the use of a
measuring instrument with known accuracy or when the figures available are
known to be truncated to a certain number of decimal places. For example:
Several drivers use cars in a car pool. The distance tavelled on each
trip by a driver is recorded, based on the odometer reading. Each
observation has an error which is uniformly distributed in (0,0.2). The
variance of this error is (0.2)^2)/12  = .00  and standard deviation
0.0578  . To calculate confidence limits for the average distance travelled
by each driver, the z statistic should be used.

A similar situation could arise in dealing with data in which the error
arises from the rounding of all numbers to the nearest thousand.

   This is an uncommon situation in a business context, but it arises
quite often in scientific work where the inherent accuracy of a measuring
instrument may be known from long experience and need not be estimated from
the small sample currently being examined.

4. You seem to think the Central Limit Theorem is behind the validity of
t vs z tables. This is not so. The CLT only bears on the Normal shape and
the relation of the variance of an average or sum to the population
variance.

Commenting specifically on points in your posting:

Ronny Richardson [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...

 A few weeks ago, I posted a message about when to use t and when to use z.
(snip)
 So, I conclude 1) we use z when we know the sigma and either the data is
 normally distributed or the sample size is greater than 30

   Yes, but the difference if you use t is tiny and of no importance.

so we can use the central limit theorem.

No. The CLT is not the reason. The CLT ensures that the average and
sum are Normally distributed for large enough n. Unless the data is very
skewed or bimodal, n=5 is usually large enough in practice. This is a
separate issue to the choice of Normal or t distribution for inference.

 2) When n30 and the data is normally distributed, we use t.

 3) When n is greater than 30 and we do not know sigma, we must estimate
 sigma using s so we really should be using t rather than z.

but the difference in the resulting numbers is miniscule and of no
importance.

 Now, every single business statistics book I have examined, including the
 four referenced below, use z values when performing hypothesis testing or
 computing confidence intervals when n30.

 Are they

 1. Wrong
 2. Just oversimplifying it without telling the reader

 or am I overlooking something?

 Ronny Richardson

I hope that helps
Jim Snow




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Gen random numbers from distribution

2001-12-05 Thread Jim Snow

1. George Marsaglia and Wal Wan Tsang published a paper dealing with
your problem which gives an efficient procedure for all values of
parameters. It is

The Monty Python Method for Generating Gamma Variables

in the Journal of Statistical Software ,vol3,issue 3,1998
.

This is an online journal. The paper is available at

www.jstatsoft/v03/i03/


2. Correction to my earlier posting in this thread.
 The sentence

3. multiply the result by a^n

should read

3. multiply the result by 1/a





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=