Re: (È«º¸)ÃÖ°È«º¸ÇÁ·Î±×·¥!!È«º¸°ÆÁ¤³¡.
This is a multi-part message in MIME format. --=_NextPart_000_0017_01C1BFC6.F446E040 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable O=BA=BB=B8=DE=C0=CF=C0=BA=C1=A4=BA=B8=C5=EB=BD=C5=B8=C1=C0=CC=BF=EB=C3=CB= =C1=F8=B9=D7=C1=A4=BA=B8=BA=B8=C8=A3=B5=EE=BF=A1=B0=FC=C7=D1=B9=FD=B7=FC=C1= =A650=C1=B6=BF=A1=C0=C7=B0=C5=C7=D1[=B1=A4=B0=ED]=B8=DE=C0=CF=C0=D4=B4=CF= =B4=D9BLANK [EMAIL PROTECTED] wrote in message = [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... O =BA=BB =B8=DE=C0=CF=C0=BA =C1=A4=BA=B8=C5=EB=BD=C5=B8=C1 = =C0=CC=BF=EB=C3=CB=C1=F8 =B9=D7 =C1=A4=BA=B8=BA=B8=C8=A3 =B5=EE=BF=A1 = =B0=FC=C7=D1 =B9=FD=B7=FC =C1=A6 50=C1=B6=BF=A1 =C0=C7=B0=C5=C7=D1 = [=B1=A4=B0=ED] =B8=DE=C0=CF=C0=D4=B4=CF=B4=D9 O e-mail=C1=D6=BC=D2=B4=C2 =C0=CE=C5=CD=B3=DD=BB=F3=BF=A1=BC=AD = =C3=EB=B5=E6=C7=CF=BF=B4=C0=B8=B8=E7, =C1=D6=BC=D2=BF=DC = =BE=EE=B6=B0=C7=D1 =B0=B3=C0=CE =C1=A4=BA=B8=B5=B5 =B0=A1=C1=F6=B0=ED = =C0=D6=C1=F6 =BE=CA=BD=C0=B4=CF=B4=D9 =BC=F6=BD=C5=B0=C5=BA=CE=B8=A6 =BF=F8=C7=CF=BD=C3=B8=E9 = =BE=C6=B7=A1=BF=A1=BC=AD =BC=F6=BD=C5=B0=C5=BA=CE =C7=D8 = =C1=D6=BC=BC=BF=E4.=C1=A4=BA=B8=B8=A6 =BF=F8=C4=A1 =BE=CA=B4=C2 = =BA=D0=B2=B2=B4=C2 =B4=EB=B4=DC=C8=F7 =C1=CB=BC=DB =C7=D5=B4=CF=B4=D9. =A2=BF=A2=BF=A2=BF =C8=AB=BA=B8 =B6=A7=B9=AE=BF=A1 =B0=C6=C1=A4 = =C7=CF=BC=CC=B3=AA=BF=E4? =C0=CC=C1=A8 =B0=C6=C1=A4 =B8=B6=BC=BC=BF=E4. = =A2=BF=A2=BF=A2=BF =C8=AB=BA=B8=BF=A1 =B4=EB=C7=D1 =B8=F0=B5=E7=B0=CD=B0=FA = =B3=EB=C7=CF=BF=EC =BF=A9=B1=E2 =B4=D9 =C0=D6=BD=C0=B4=CF=B4=D9.=20 =B9=AB=BE=FA=C0=CC=B5=E7=C1=F6 =B9=B0=BE=EE =BA=B8=BC=BC=BF=E4. = mailto:[EMAIL PROTECTED] =A2=BA=A2=BA=A2=BA =C0=CC=B9=F8=BF=A1 = =C8=AB=BA=B8=B4=EB=C7=E0=BE=F7 =C0=B8=B7=CE = =C0=FC=C8=AF=C7=D4=BF=A1=B5=FB=B6=F3=20 3=B3=E2=B5=BF=BE=C8 =B8=F0=BE=C6=B3=F5=C0=BA = =C8=AB=BA=B8=C7=C3=B1=D7=B7=A5=C0=BB =BF=B0=B0=A1=B7=CE = =B4=D9=B5=E5=B8=B2=B4=CF=B4=D9.=A2=B8=A2=B8=A2=B8 =20 =A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=C3=CA=BA=B8=BF=EB = =A2=BE=A2=BE=A2=BE=20 =A2=BD=C0=CC=B8=E1=C3=DF=C3=E2=B1=E22=B0=B3 = =A2=BD=C0=CC=B8=E1=C6=ED=C1=FD=B1=E21=B0=B3 = =A2=BD=C0=CC=B8=E1=B9=DF=BC=DB=B1=E22=B0=B3(=C1=A4=C7=B01,=B5=A5=B8=F01) = =A2=BD=C0=CC=B8=E1=B8=AE=BD=BA=C6=AE50=B8=B8=B0=B3 = =A2=BD=B0=D4=BD=C3=C6=C7=B5=EE=B7=CF=B1=E21=B0=B3 = =A2=BD=B0=D4=BD=C3=C6=C7=B5=F0DB2000=B0=B3 =A2=D1 =C0=A7=C0=C7 =B8=F0=B5=E7=B0=CD=C0=BB = 10=B8=B8=BF=F8=BF=A1 =B4=D9 =B5=E5=B8=B3=B4=CF=B4=D9. =A2=D0 =20 =A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=C1=DF=B1=DE=BF=EB = =A2=BE=A2=BE=A2=BE =A2=BD=C0=CC=B8=E1=C3=DF=C3=E2=B1=E23=B0=B3 = =A2=BD=C0=CC=B8=E1=C6=ED=C1=FD=B1=E21=B0=B3 = =A2=BD=C0=CC=B8=E1=B9=DF=BC=DB=B1=E23=B0=B3(=C1=A4=C7=B02=B0=B3,=B5=A5=B8= =F01=B0=B3) =A2=BD=C0=CC=B8=E1=B8=AE=BD=BA=C6=AE100=B8=B8=B0=B3 = =A2=BD=B0=D4=BD=C3=C6=C7=B5=EE=B7=CF=B1=E21=B0=B3 = =A2=BD=B0=D4=BD=C3=C6=C7DB5000=B0=B3 =A2=D1 =C0=A7=C0=C7 =B8=F0=B5=E7=B0=CD=C0=BB = 20=B8=B8=BF=F8=BF=A1 =B4=D9 =B5=E5=B8=B3=B4=CF=B4=D9. =A2=D0 =20 =A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=B0=ED=B1=DE=BF=EB(1) = =A2=BE=A2=BE=A2=BE =A2=BC=A2=BC=A2=BC=B0=B3=C0=CE =C8=A8=C6=E4=C1=F6=BF=A1 = =C0=CC=B8=E1=C3=DF=C3=E2,=B9=DF=BC=DB=B1=E2=B8=A6 =C1=F7=C1=A2 = =BC=B3=C4=A1 =C7=D8 =B5=E5=B8=B3=B4=CF=B4=D9.=A2=BC=A2=BC=A2=BC = =A2=BD=C0=CC=B8=E1=C3=DF=C3=E2=B1=E2=B4=C9=A2=BD=C0=CC=B8=E1=C1=DF=BA=B9=BB= =E8=C1=A6=B1=E2=B4=C9=A2=BD=BC=F6=BD=C5=B0=C5=BA=CE=C0=DA=B5=BF=B1=E2=B4=C9= =A2=BD=C0=CC=B8=E1=B9=DF=BC=DB=B1=E2=B4=C9 = =A2=BD=BC=F6=BD=C5=B0=C5=BA=CE=C0=DA=C0=D3=BD=C3=BA=B8=B3=BB=B1=E2=A2=BD=C0= =D3=BD=C3=B0=C5=BA=CE=C0=DA=BC=F6=BD=C5=B0=C5=BA=CE=C0=DA=B7=CE = =A2=D1=BC=B3=C4=A1=B0=A1=B4=C9=C7=D1=B0=F7=3D=C8=A8=C6=E4=C1=F6=BF=A1MYSQ= L=B0=E8=C1=A4=C0=CC =C0=D6=BE=EE=BE=DF=C7=D4 =C0=AF=B7=E1=C8=A8=C0=CC =BE=F8=B4=C2=B0=E6=BF=EC=B4=C2 = (200=B8=DE=B0=A1,=C0=CF=B3=E2=C8=A3=BD=BA=C6=C34.4000=BF=F8=BA=B0=B5=B5=C0= =D3) =A2=D1 =C0=A7=C0=C7 =BC=B3=C4=A1=B8=A6 20=B8=B8=BF=F8=BF=A1 = =C7=D8=B5=E5=B8=B3=B4=CF=B4=D9. =A2=D0 =20 =A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=B0=ED=B1=DE=BF=EB(2) = =A2=BE=A2=BE=A2=BE 1000=B8=B8=B0=B3 =C0=CC=B8=E1=B8=AE=BD=BA=C6=AE=B8=A6 = =BF=C3=B8=B0=BC=AD=B9=F6=B8=A6 = =B8=EE=BB=E7=B6=F7=BF=A1=B0=D4=B8=B8=C0=D3=B4=EB=C7=D4=B4=CF=B4=D9. = (=B1=E2=B0=A31=B3=E2=3D=B0=A1=B0=DD100=B8=B8=BF=F8)=C8=AB=BA=B8=C7=C1=B7=CE= =B1=D7=B7=A5=B0=FA =B8=F0=B5=E7 =B3=EB=C7=CF=BF=EC=B8=A6 =C0=FC=BA=CE = =C0=FC=BC=F6 =C7=D4=B4=CF=B4=D9. =20 =A2=C2=A2=C2=A2=C2 =C0=CC=B8=E1 =B1=A4=B0=ED =B4=EB=C7=E0 = =A2=C2=A2=C2=A2=C2 =B1=D7=B5=BF=BE=C8 =C8=AB=BA=B8=C0=C7 =B3=EB=C7=CF=BF=EC=B7=CE = 2=B3=E2=BF=A1 =B0=C9=C3=C4 =B9=DF=BC=DB=BD=C3=BC=B3=C0=BB =BF=CF=BA=F1 = =C7=CF=B0=ED 6000=B8=B8=B0=B3=C0=C7 =C0=CC=B8=E1=B5=A5=C0=CC=C5=B8=B8=A6 = =B1=B8=BA=F1=C7=CF=BF=A9 =C0=CC=B8=E1=C8=AB=BA=B8=B8=A6 = =B4=EB=C7=E0=C7=D8 =B5=E5=B8=B3=B4=CF=B4=D9.
Re: Statistics Tool For Classification/Clustering
Rishabh Gupta [EMAIL PROTECTED] wrote in message a4eje9$ip8$[EMAIL PROTECTED]">news:a4eje9$ip8$[EMAIL PROTECTED]... Hi All, I'm a research student at the Department Of Electronics, University Of York, UK. I'm working a project related to music analysis and classification. I am at the stage where I perform some analysis on music files (currently only in MIDI format) and extract about 500 variables that are related to music properties like pitch, rhythm, polyphony and volume. I am performing basic analysis like mean and standard deviation but then I also perform more elaborate analysis like measuring complexity of melody and rhythm. The aim is that the variables obtained can be used to perform a number of different operations. - The variables can be used to classify / categorise each piece of music, on its own, in terms of some meta classifier (e.g. rock, pop, classical). - The variables can be used to perform comparison between two files. A variable from one music file can be compared to the equivalent variable in the other music file. By comparing all the variables in one file with the equivalent variable in the other file, an overall similarity measurement can be obtained. The next stage is to test the ability of the of the variables obtained to perform the classification / comparison. I need to identify variables that are redundant (redundant in the sense of 'they do not provide any information' and 'they provide the same information as the other variable') so that they can be removed and I need to identify variables that are distinguishing (provide the most amount of information). My Basic Questions Are: - What are the best statistical techniques / methods that should be applied here. E.g. I have looked at Principal Component Analysis; this would be a good method to remove the redundant variables and hence reduce some the amount of data that needs to be processed. Can anyone suggest any other sensible statistical anaysis methods? - What are the ideal tools / software to perform the clustering / classification. I have access to SPSS software but I have never used it before and am not really sure how to apply it or whether it is any good when dealing with 100s of variables. So far I have been analysing each variable on its own 'by eye' by plotting the mean and sd for all music files. However this approach is not feasible in the long term since I am dealing with such a large number of variables. In addition, by looking at each variable on its own, I do not find clusters / patterns that are only visible through multivariate analysis. If anyone can recommend a better approach I would be greatly appreciated. Any help or suggestion that can be offered will be greatly appreciated. A useful exposition of techniques for initial investigation of multivariate data set is given at http://www.sas.com/service/library/periodicals/obs/obswww22/ If you point your browser at Andrews plots you will find more. My inclination would be to start with an Andrews plot, possibly using principal component scores for about 20 music files from several genres. This will enable you to find linear combinations of variable which best separate the genres. The technique and examples is set out in: Gnanadesikan:Multivariate Data Analysis, but this is an old reference. I hope this helps Jim Snow = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Statistics Tool For Classification/Clustering
Richard Wright [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Genres are presumably groups. So linear combinations of variables that best separate the genres would be more effectively found by linear canonical variates analysis (aka discriminant analysis). Richard Wright On Thu, 14 Feb 2002 03:18:48 GMT, Jim Snow [EMAIL PROTECTED] wrote: snipped My inclination would be to start with an Andrews plot, possibly using principal component scores for about 20 music files from several genres. This will enable you to find linear combinations of variable which best separate the genres. The technique and examples is set out in: snipped Andrews plots and similar techniques do not replace discriminant analysis, which , as Richard Wright said finds linear combinations of variables that best separate the variables . In the book by Gnanadesikan which first popularised the technique, he examines the variables in the discriminant space, ie a space defined by discriminant functions rather than principal components or original variables. The techniques are doing different things. Andrews plots are to enable examination of the multidimensional data in a two dimensional plot. Amongst other things, for example, several dimensions of high difference between say jazz and pop or between jazz and flamenco may be found,which are not necessarily orthogonal. Andrews plots are a data reduction technique which is ,in many dimensions, analogous to examining a multi dimensional cluster of points from many viewpoints ,so that no possible view point is far from one of those used. Thus virtually all possible discriminant functions are tried and the interesting ones noted. In a spirit of exploratory data analysis, this seems useful. RishadhGupta wrote: - The variables can be used to perform comparison between two files. A variable from one music file can be compared to the equivalent variable in the other music file. By comparing all the variables in one file with the equivalent variable in the other file, an overall similarity measurement can be obtained. Andrews plots reveal the directions in which the two files differ. Incidentally, the total area between the two traces on the plot is the Euclidean distance, I think, if the original Andrews weightings are used. Tukey suggested weightings which examine the multidimensional space more closely but do not have such a simple interpretation of the difference between traces. I have not used any of this for some time and I do not have relevant books, but the material I referred to on the web should be helpful. Straightforward discriminant analysis will certainly find the best linear discriminator in the least squares sense, but stepwise elimination of variables in this process may result in discarding a variable with intuitive appeal in favour of one or several highly correlated with it and the least squares metric may possibly not be the best. For this and other reasons an exploratory approach as Rishabh Gupta has begun seems appropriate. I still hope this helps Jim Snow = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: »¹ÊǺÍÒÔÇ°Ò»ÑùÂ𣿠£º£©
[EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... 2001ÄêÍøÂçƱѡ½Ü³öMLMÕ¾µã£¬Îñ±ØÒªÄÍÐÄä¯ÀÀŶ£º http://www.kelvin.13800.com ÐÂÓ¯ÀûÉÌÎñÐÅÏ¢ÍøÂç http://www.kelvin.uurr.com ´ó¼Ò׬ÉÌÎñÐÅÏ¢ÍøÂç http://www.kelvin.17951.com ½ðӯͨÉÌÎñÐÅÏ¢ÍøÂç http://www.kelvin.9982.com ÐÂ˼άÉÌÎñÐÅÏ¢ÍøÂç http://www.kelvin.wapmark.com http://kelvinfan.126.com 999ÉÌÎñÐÅÏ¢ÍøÂç http://www.kelvin.uwuw.com MintmailÍøÂ磨¾³ÍâÎÞÐ븶·Ñ£¬Ì油ѡÔñ£© http://www.kelvinfan.13800.com ´´ÒµÂÛ̳ ¶¼ÊǷdz£ÓÅÐãµÄMLMÕ¾µã£¬ÊÇÖйúÈË×Ô¼ºµÄ׬ǮÍøÕ¾£¬×Ô¼ºµÄ´´Òµ¼Æ»®£¡ È«ÖÐÎĽçÃ棬һ¿´¾Í¶®¡£ ¹þ¹þ£¬ËäÈ»½éÉܵÃÓеã¹ý·Ö£¬µ«µÄÈ·ÊÇ¿ÉÒÔ׬µ½Ç®£¬¾ø¶Ô²»ÆÈË£¡ ÄãÖ»ÒªÓÐÒ»¸öÒøÐÐÐÅÓÿ¨Õ˺Å--ûÓУ¿ÔÚÏßÉêÇëÒ»¸öÀ²£¬ÍêÈ«Ãâ·Ñ£¨½¨ÒéΪ½¨ÉèÒø ÐУ©¡£ È»ºóÄãÒª×öµÄ¾ÍÊÇÿÐÇÆÚ´ò¼¸¸öµç»°£¬²éѯһÏÂÒøÐÐÕ˺ţ¬¹þ¹þ£¬×øµÈ×ÅÊÕÇ®°É£¡£¡£¡ ¿îÏîÍêÈ«ÒøÐÐתÕÊ£¬½â³ýÄúµÄÒ»Çкó¹ËÖ®ÓÇ¡£ È¥¿´¿´°¡£¬Á˽âËûÃǵÄÔË×÷·½Ê½£¬Ëµ²»¶¨Äú»áÐĶ¯Å¶¡£ ÒªÖªµÀ£º ÔÚÃÀ¹úµÄ50Íò¸ö°ÙÍò¸»ÎÌÖУ¬ÓÐ20%ÊÇÔÚ¹ýÈ¥µÄ¼¸ÄêÖУ¬ ͨ¹ýÕâÖÖ¶à²ã´ÎÐÅÏ¢ÍøÂçÓªÏúMULTI-LEVEL MARKETING (MLM) ¶ø³É¹¦µÄ¡£ ´ËÍ⣬ͳ¼ÆÏÔʾÃÀ¹úÿÌìƽ¾ùÓÐ45¸öÈËͨ¹ýMLM¶ø³ÉΪ°ÙÍò¸»ÎÌ¡£ ÏÖÔÚMLMÒÑÀ´µ½Öйú£¡£¡£¡ ÕâЩÍøÕ¾ÕýÊÇÒÔÕâÑùÒ»ÖÖ¼òµ¥£¬Õ¸Ð£¬ºÏ·¨ºÍÓÐȤµÄºÃ·½Ê½£¬ À´ÈÃÿһλ»áÔ±×ã²»³ö»§Ò²¿ÉÒÔ׬µ½¾Þ¶î²Æ¸»¡£ ¡ï¡ï¡ïÉÏÕ¾ºó£¬¾´Çë×ÐϸÔĶÁËùÓÐÄÚÈÝ£¬ÔÙÑ¡Ôñ×¢²á¡£¡ï¡ï¡ï ×¢²á³ÉΪÕýʽ»áÔ±ºó£¬Äú¾ÍÓµÓÐÁËÊôÓÚ×Ô¼ºµÄ׬ǮÍøÕ¾ÁË¡£ ͨ¹ýËü¿ÉÒÔʹÄú¾òµ½À´×ÔÍøÂçµÄµÚһͰ½ð£¡ 1Íò£¿10Íò£¿»¹ÊÇ100Íò£¿¹þ¹þ£¬ËÖªµÀÄØ£¿Ò²Ðí¸ü¶à£¡ ¶à°ô°¡£¬²»ÓÃÔÙÊÜÀÏÍâµÄÆøÁË,ÀÏÍâµÄÕ¾µãÐÅÓÃÌ«²îÀ²£¬ÀÏÊÇÊÕ²»µ½Ç®¡£ --ÄãµÄÃÎÏë»áʵÏֵġ££ºP £¨PS£ºµ±ÄúÕýʽ³ÉΪ³ÉÔ±ºó£¬ÎÒ»áÃâ·ÑÔùËÍÈ«¹ú¸÷µØµÄÊýÒÔ10Íò¼ÆµÄÓʼþµØÖ·£¬·½±ãÄú µÄÐû´«£© ±¾ÐżþÈç¹û´ò½ÁÁËÄú£¬¾´Ç뽫ËüÎÞÇéµØɾ³ý£¡ = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: tricky explanation problem with chi-square on multinomial experiment(dice)
Gottfried Helms [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Hi , there was a tricky problem, recently, with the chi-square-density of higher dgf's. I discussed thath in sci.stat.consult and in a german newsgroup, got some answers and also think to have understood the real point. But I would like to have a smoother explanation, as I have to deal with it in my seminars. Maybe someone out has an idea or a better shortcut, how to describe it. To illustrate this I just copypaste an exchange from s.s.consult; hope you forgive my lazyness. On the other hand: maybe the true point comes out better this way. Regards Gottfried 3 postings added: ---(1/3)--- [Gottfried] Hi - im stumbling in the dark... eventually only missing any simple hint. I'm trying to explain the concept of significance of the deviation of an empirical sample from a given, expected distribution. If we discuss the chi-square-distribution | |* | * | * | * | * | * | * |* +- then this graph illustrates us very well, that and how a small deviation is more likely to happen than a high deviation - thus backing the concept of the 95%tiles etc. in the beginners literature. Just cutting it in equal slices this curve gives us expected frequencies of occurences of samples with individual chi-squared deviations from the expected occurences. If I have more df's, then the curve changes its shape; in this case a 5 df-curve for samples of thrown dices, where I count the frequencies of occurences of each number and the deviation of these frequencies from the uniformity. | | | | |* | ** |** | ** | * * +- 0X²(df=5) Now the slices with the highest frequency of occurences are not the ones with the smallest deviation from the expected distribution (X²=0) - and even if I accept, that this is at least so for the cumulative distribution, it is suddenly no more self-explaining. It is congruent with the reality, but our common language is different: the most likely chisquare-deviation from the uniformity is now an area which is not at the zero-mark. So, now: do we EXPECT a deviation from uniformity? That the count of frequencies of the occurences of the 6 dices numbers is NOT most likely uniform? HÄH? Is this suddenly the Nullhypothesis? And do we calculate the deviation of our empirical sample then from this new Nullhypothesis??? I never thought about that in this way, but since I do now, I feel a bit confused, maybe I only have to step aside a bit? Any good hint appreciated - Gottfried. -- ---(2/3)--- Then one participant answered: Actually, that corresponds to the notion that if a random sequence is *too* uniform, it isn't really random. For example, if you were to toss a coin 1000 times, you'd be a little surprised if you got *exactly* 500 heads and 500 tails. If you think in terms of taking samples from a multinomial population, the non-monotonicity of the chi-square density means that a *small* amount of sampling error is more probable than *no* sampling error, as well as more probable than a *large* sampling error, which I think corresponds pretty well to our intuition. --- --(3/3)- I was not really satisfied with this and answered, after I had got some more insight: [Gottfried] [] wrote: Actually, that corresponds to the notion that if a random sequence is *too* uniform, it isn't really random. For example, if you were to toss a coin 1000 times, you'd be a little surprised if you got *exactly* 500 heads and 500 tails. If you think in terms of taking samples from a Yes, this is true. But it is the same with each other combination. No one is more likely to occur (or better: one should say: variation?). But then, a student would ask, how could you still attribute a near-expected- variation more likely than a far-away-expected variation in generality? The reason is, that we don't argue about a specific variation, but about properties of a variation, or in this case, of a combination. We commonly select the property of having a distance from the expected variation, measured in terms
Re: How To Code Position In A List
Ronny Richardson [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... I want to analyze data sets where I have two variables, the finishing position (ordinal) and a ratio scale performance variable. I want to see if there is a relationship between the finishing position and the value of the ratio performance variable. This would be about the same as seeing if there was a relationship between the order in which an exam was completed and the resulting score. My problem is that I have multiple groups and so being #15 in one group might put you in the back of the group while being #20 in another group might put you near the front. Is there a way to recode the position variable to make it meaningful in this situation? I have considered percentiles (more specifically dectiles) but I am not sure this is the best way to go. Any suggestions? Ronny Richardson Using percentiles is a sensible way to go. Alternatively, replacing ranks with normal scores would be sensible if you know that the ratio scale variable is normally distributed with a constant variance. Then Pearson's product moment correlation coefficient can be calculated across the whole data set. However, if there is any doubt about these assumptions, there is a better way. If Spearman's rank correlation coefficient is calculated for each group separately, the weighted average of these will be a good descriptive statistic for the whole set. To obtain a significance test, calculate the p value of the correlation coefficient for each group separately ((p(1),p(2).,p(n).Then the null hypothesis is that -2*Sum from 1 to n (Log p(i)) is an observation on a ChiSquared random variable with 2*n degrees of freedom. Hope this helps Jim Snow = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Interpreting mutliple regression Beta is only way?
Wuzzy [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Rich Ulrich [EMAIL PROTECTED] wrote in message Thanks Rich, most informative, I am trying to determine a method of comparing apples to oranges - it seems an improtant thing to try to do, perhaps it is impossible . I am trying to determine which is better, glycemic index or carbohydrate total in predicting glycemic load (Glycemic load=glycemic index*carbohydrate). my results as a matrix: GI load GI Carb GI load 1.000 GI .5331.000 Carb .858.1241.000 So it seems that carb affects GI load more than does GI.. but this is on ALL foods.. (nobody eats ALL foods so cannot extrapolate to human diet) but I don't think you're allowed to do this kind of comparison as Carb and GI aretotal different values: I suspected that you would be allowed to make the comparisons if you use Betas, ie. measure how many standard deviationGlycemic load=glycemic index*carbohydrate changes of GI and Carb it requires.. If it takes a bigger standard deviation of Carb then you could say that it is more likely that carb has a bigger effect on glycemic load. you seem to suggest that even using standard deviation changes, you cannot compare apples to oranges. Which sounds right but is dissapointing.. The glycaemic index is calculated as the area under the blood glucose curve for the two hours (or 3 hours for diabetics) after ingesting enough of a food to include 50 grams of carbohydrate, divided by the same area after ingesting 50 grams of pure glucose, expressed as a percentage. In some cases a reference food other than glucose is used. If the area under the curve is the glycaemic load you are studying I would expect the model Glycemic load=glycemic index*carbohydrate to fit the data very well when the carbohydrate content is near 50 gm, providing all the glycaemic indices have been calculated on the same basis. Using correlations or beta coefficients as you are doing is appropriate when linear relationships are involved, but not to test for goodness of fit to this model. What would be of interest would be a plot of the difference between the predicted glycaemic load and the observed value,against carbohydrate, especially for carbohydrate values far from 50 gm. If I have a meal of mainly of eggs or meat, the total carbohydrate content is very low, so the glycaemic load calculated from the formula may be wrong. One difficulty with the whole Glycaemic Index approach is that there is not, as far as I know, any way of calculating the glycaemic load from foods like cheese,eggs and meat. If the body needs glucose, it will be made from fat and protein foods. It is not surprising that it could be hard to persuade volunteers to ingest 8500 grams of processed cheese, containing 50gm of carbohydrate, in order to determine its glycaemic index :-) I would like to see another index constructed giving the glycaemic load produced by 100 gm of each food, rather than the load produced by that amount of food which contains 50gm of carbohydrate. = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Interpreting mutliple regression Beta is only way?
Wuzzy [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Rich Ulrich [EMAIL PROTECTED] wrote in message Thanks Rich, most informative, I am trying to determine a method of comparing apples to oranges - it seems an improtant thing to try to do, perhaps it is impossible . I am trying to determine which is better, glycemic index or carbohydrate total in predicting glycemic load (Glycemic load=glycemic index*carbohydrate). my results as a matrix: GI load GI Carb GI load 1.000 GI .5331.000 Carb .858.1241.000 So it seems that carb affects GI load more than does GI.. but this is on ALL foods.. (nobody eats ALL foods so cannot extrapolate to human diet) but I don't think you're allowed to do this kind of comparison as Carb and GI aretotal different values: I suspected that you would be allowed to make the comparisons if you use Betas, ie. measure how many standard deviationGlycemic load=glycemic index*carbohydrate changes of GI and Carb it requires.. If it takes a bigger standard deviation of Carb then you could say that it is more likely that carb has a bigger effect on glycemic load. you seem to suggest that even using standard deviation changes, you cannot compare apples to oranges. Which sounds right but is dissapointing.. The glycaemic index is calculated as the area under the blood glucose curve for the two hours (or 3 hours for diabetics) after ingesting enough of a food to include 50 grams of carbohydrate, divided by the same area after ingesting 50 grams of pure glucose, expressed as a percentage. In some cases a reference food other than glucose is used. If the area under the curve is the glycaemic load you are studying I would expect the model Glycemic load=glycemic index*carbohydrate to fit the data very well when the carbohydrate content is near 50 gm, providing all the glycaemic indices have been calculated on the same basis. Using correlations or beta coefficients as you are doing is appropriate when linear relationships are involved, but not to test for goodness of fit to this model. What would be of interest would be a plot of the difference between the predicted glycaemic load and the observed value,against carbohydrate, especially for carbohydrate values far from 50 gm. If I have a meal of mainly of eggs or meat, the total carbohydrate content is very low, so the glycaemic load calculated from the formula may be wrong. One difficulty with the whole Glycaemic Index approach is that there is not, as far as I know, any way of calculating the glycaemic load from foods like cheese,eggs and meat. If the body needs glucose, it will be made from fat and protein foods. It is not surprising that it could be hard to persuade volunteers to ingest 8500 grams of processed cheese, containing 50gm of carbohydrate, in order to determine its glycaemic index :-) I would like to see another index constructed giving the glycaemic load produced by 100 gm of each food, rather than the load produced by that amount of food which contains 50gm of carbohydrate. = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Interpreting mutliple regression Beta is only way?
Wuzzy [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Rich Ulrich [EMAIL PROTECTED] wrote in message Thanks Rich, most informative, I am trying to determine a method of comparing apples to oranges - it seems an improtant thing to try to do, perhaps it is impossible . I am trying to determine which is better, glycemic index or carbohydrate total in predicting glycemic load (Glycemic load=glycemic index*carbohydrate). my results as a matrix: GI load GI Carb GI load 1.000 GI .5331.000 Carb .858.1241.000 So it seems that carb affects GI load more than does GI.. but this is on ALL foods.. (nobody eats ALL foods so cannot extrapolate to human diet) but I don't think you're allowed to do this kind of comparison as Carb and GI aretotal different values: I suspected that you would be allowed to make the comparisons if you use Betas, ie. measure how many standard deviationGlycemic load=glycemic index*carbohydrate changes of GI and Carb it requires.. If it takes a bigger standard deviation of Carb then you could say that it is more likely that carb has a bigger effect on glycemic load. you seem to suggest that even using standard deviation changes, you cannot compare apples to oranges. Which sounds right but is dissapointing.. The glycaemic index is calculated as the area under the blood glucose curve for the two hours (or 3 hours for diabetics) after ingesting enough of a food to include 50 grams of carbohydrate, divided by the same area after ingesting 50 grams of pure glucose, expressed as a percentage. In some cases a reference food other than glucose is used. If the area under the curve is the glycaemic load you are studying I would expect the model Glycemic load=glycemic index*carbohydrate to fit the data very well when the carbohydrate content is near 50 gm, providing all the glycaemic indices have been calculated on the same basis. Using correlations or beta coefficients as you are doing is appropriate when linear relationships are involved, but not to test for goodness of fit to this model. What would be of interest would be a plot of the difference between the predicted glycaemic load and the observed value,against carbohydrate, especially for carbohydrate values far from 50 gm. If I have a meal of mainly of eggs or meat, the total carbohydrate content is very low, so the glycaemic load calculated from the formula may be wrong. One difficulty with the whole Glycaemic Index approach is that there is not, as far as I know, any way of calculating the glycaemic load from foods like cheese,eggs and meat. If the body needs glucose, it will be made from fat and protein foods. It is not surprising that it could be hard to persuade volunteers to ingest 8500 grams of processed cheese, containing 50gm of carbohydrate, in order to determine its glycaemic index :-) I would like to see another index constructed giving the glycaemic load produced by 100 gm of each food, rather than the load produced by that amount of food which contains 50gm of carbohydrate. = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: How ro perform Runs Test??
Glen [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... [EMAIL PROTECTED] (Chia C Chong) wrote in message news:[EMAIL PROTECTED]... I am using nonlinear regression method to find the best parameters for my data. I came across a term called runs test from the Internet. It mentioned that this is to determines whether my data is differ significantly from the equation model I select for the nonlinear regression. Can someone please let me know how should I perform the run tests?? You need to use a runs test that's adjusted for the dependence in the residuals. The usual runs test in the texts won't apply. Glen I always understood that the runs test was designed to detect systematic departures from the fitted line because some other curve fitted the data better. In this context, it is a test for dependence of residuals. There is a discussion of this at http://216.46.227.18/curvefit/systematic_deviation.htm Any elementary text in Non-parametric Methods in statistics will give an example. Hope this helps Jim Snow [EMAIL PROTECTED] = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: How to prove absence of dependence between arguments of function?
- Original Message - From: Estimator [EMAIL PROTECTED] Newsgroups: sci.stat.edu Sent: Saturday, 22 December 2001 2:47 Subject: How to prove absence of dependence between arguments of function? *** post for FREE via your newsreader at post.newsfeeds.com *** I've got linear function Y=f(x1;x2;x3;x4) of theoretical distribution, in addition x3=g1(x1), x4=g2(x1;x2) Also I've got empirical sample consists of N sets of these values (magnitudes) Y1 x11 x12 x13 x14 .. Yn xn1 xn2 xn3 xn4 as since x3 x4 are dependent of x1 x2 it's reasonable to evaluate x3 x4 by x1 x2 accordingly and analyse Y only from x1 x2. If x3= g1(x1) x3 and x1 can only be independent if the function g1 is a constant, or at least degenerate wrt x1 and similarly for Y. But I've got a strong believe that in fact all of the arguments are independent or dependence is insignificant. How to prove this mathematically using empitical observations? You could plot x1 against x3 to convince yourself that there was no tendency for the points to depart from a random scatter. Similarly plot x1 vs x4 and x2 vs x4. But this would not give you objective grounds to include or reject x3 and x4 from the analysis. In fact, even if x3 and x4 are uncorrelated with x1 and x2 ,your best course would be to retain them in the analysis. Then you can formally test to see whether they contribute any explanatory power wrt Y. If the explanatory power of the model is not significantly improved by including x3 and x4, you have objective evidence to exclude them from the model. Is any sense in making correlation matrix 4x4 (Pearson) and proving insignificance of coefficient of correlation (Student t-criterion for example) between arguments? No. One reason is that you would have to conduct six significance tests and the chance of the 1 in 20 level of the test being exceeded by chance in one of them is too high. In any event correlation is only a measure of LINEAR dependence and frequently data has more complicated dependencies. There is no way to prove independence from a data set, in the same sense that scientific theories are never proved, only disproved. In particular, failure to reject a null hypothesis is not proof of its correctness. Dependence does not consist of only linear relationships and lack of correlation does not imply independence. For example if X is symetrically distributed on (-1,1) and Y = X^2 , then X and Y are uncorrelated although functionally related. Hope this helps Jim Snow [EMAIL PROTECTED] = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Standardizing evaluation scores
Glen Barnett [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Stan Brown wrote: But is it worth it? Don't the easy graders and :tough graders pretty much cancel each other out anyway? Not if some students only get hard graders and some only get easy graders. If all students got all graders an equal amount of time it probably won't matter at all. Glen If some graders use the whole scale and others only use part of the scale or concentrate grades near the centre, then using raw scores means you are giving the full scale graders more weight in the overall ranking of students. If this is undesirable, grades could be scaled to a common mean and equal mean deviation. (Standard deviation would give increased weight to extremes of the scale.) In all these adjustments, we lose transparency of the process and this must be weighed against the gains. I suspect that only sharp contrasts between the behaviour of the graders and/or different students having different sets of graders would justify this, and may well be better dealt with by instructing the graders appropriately after pointing out that it is desirable for all graders to have equal weight in assessment. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
test-ignore
test please ignore = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: When to Use t and When to Use z Revisited
Ronny Richardson [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... A few weeks ago, I posted a message about when to use t and when to use z. I did not see the earlier postings, so forgive me if I repeat advice already given.:-) 1. The consequences of using the t distribution instead of the normal distribution for sample sizes greater than 30 are of no importance in practice. The difference in the numbers given as confidence limits are so small that no sensible person would change their course of action based on that miniscule variation. In the case of a significance test a result just over or just under, say, the 5% level should always be examined in the knowledge that the 5% is an arbitrary level and that a level of 4.9% or 5.1% could equally well have been chosen. 2. There is no good reason for statistical tables for use in practical analysis of data to give figures for t on numbers of degrees of freedom over 30 except that it makes it simple to routinely use one set of tables when the variance is estimated from the sample. Another reason that books of tables do not include t values for degrees of freedom between 30,60,sometimes 120 and infinity is that there is no need,even for the extreme tails of the distribution and when ,for whatever reason, high accuracy is required, because the intermediate values can be obtained by harmonic interpolation. That is, the tail entries in the distribution can be obtained by linear interpolation on 1/n. 3. There are situations where the error variance is known. They generally arise when the errors in the data arise from the use of a measuring instrument with known accuracy or when the figures available are known to be truncated to a certain number of decimal places. For example: Several drivers use cars in a car pool. The distance tavelled on each trip by a driver is recorded, based on the odometer reading. Each observation has an error which is uniformly distributed in (0,0.2). The variance of this error is (0.2)^2)/12 = .00 and standard deviation 0.0578 . To calculate confidence limits for the average distance travelled by each driver, the z statistic should be used. A similar situation could arise in dealing with data in which the error arises from the rounding of all numbers to the nearest thousand. This is an uncommon situation in a business context, but it arises quite often in scientific work where the inherent accuracy of a measuring instrument may be known from long experience and need not be estimated from the small sample currently being examined. 4. You seem to think the Central Limit Theorem is behind the validity of t vs z tables. This is not so. The CLT only bears on the Normal shape and the relation of the variance of an average or sum to the population variance. Commenting specifically on points in your posting: Ronny Richardson [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... A few weeks ago, I posted a message about when to use t and when to use z. (snip) So, I conclude 1) we use z when we know the sigma and either the data is normally distributed or the sample size is greater than 30 Yes, but the difference if you use t is tiny and of no importance. so we can use the central limit theorem. No. The CLT is not the reason. The CLT ensures that the average and sum are Normally distributed for large enough n. Unless the data is very skewed or bimodal, n=5 is usually large enough in practice. This is a separate issue to the choice of Normal or t distribution for inference. 2) When n30 and the data is normally distributed, we use t. 3) When n is greater than 30 and we do not know sigma, we must estimate sigma using s so we really should be using t rather than z. but the difference in the resulting numbers is miniscule and of no importance. Now, every single business statistics book I have examined, including the four referenced below, use z values when performing hypothesis testing or computing confidence intervals when n30. Are they 1. Wrong 2. Just oversimplifying it without telling the reader or am I overlooking something? Ronny Richardson I hope that helps Jim Snow = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Gen random numbers from distribution
1. George Marsaglia and Wal Wan Tsang published a paper dealing with your problem which gives an efficient procedure for all values of parameters. It is The Monty Python Method for Generating Gamma Variables in the Journal of Statistical Software ,vol3,issue 3,1998 . This is an online journal. The paper is available at www.jstatsoft/v03/i03/ 2. Correction to my earlier posting in this thread. The sentence 3. multiply the result by a^n should read 3. multiply the result by 1/a = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =