Re: [R] PCA for Binary data
Dear Prof Brian Ripley, Would you also recommend some packages for non-binary data to do variable and feature selection? Thanks a lot! Alex On 6/12/07, Prof Brian Ripley [EMAIL PROTECTED] wrote: On Tue, 12 Jun 2007, Spencer Graves wrote: The problem with applying prcomp to binary data is that it's not clear what problem you are solving. The standard principal components and factor analysis models assume that the observations are linear combinations of unobserved common factors (shared variability), normally distributed, plus normal noise, independent between observations and variables. Those assumptions are clearly violated for binary data. RSiteSearch(PCA for binary data) produced references to 'ade4' and 'FactoMineR'. Have you considered these? I have not used them, but FactoMineR included functions for 'Multiple Factor Analysis for Mixed [quantitative and qualitative] Data' AFAIK, that is not using 'factor analysis' in the same sense as e.g. factanal(). Continuous underlying variables with binary manifest variables is part of latent variable analysis. Package 'ltm' covers a variety of such models. But to begin to give advice we would need to know the scientific problem for which Ranga Chandra Gudivada is looking for a tool. Simon Blomberg mentioned ordination, but that is only one of several classes of uses of PCA (which finds a linear subspace that both has maximal variance within and is least-squares fitting to the data). Hope this helps. Spencer Graves Josh Gilbert wrote: I don't understand, what's wrong with using prcomp in this situation? On Sunday 10 June 2007 12:50 pm, Ranga Chandra Gudivada wrote: Hi, I was wondering whether there is any package implementing Principal Component Analysis for Binary data Thanks chandra - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to load a big txt file
Dear Chung-hong Chan, Thanks! Can you recommend a text editor for splitting? I used UltraEdit and TextPad but did not find they can split files. Sincerely, Alex On 6/6/07, Chung-hong Chan [EMAIL PROTECTED] wrote: Easy solution will be split your big txt files by text editor. e.g. 5000 rows each. and then combine the dataframes together into one. On 6/7/07, ssls sddd [EMAIL PROTECTED] wrote: Dear list, I need to read a big txt file (around 130Mb; 23800 rows and 49 columns) for downstream clustering analysis. I first used Tumor - read.table(Tumor.txt,header = TRUE,sep = \t) but it took a long time and failed. However, it had no problem if I just put data of 3 columns. Is there any way which can load this big file? Thanks for any suggestions! Sincerely, Alex [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- The scientists of today think deeply instead of clearly. One must be sane to think clearly, but one can think deeply and be quite insane. Nikola Tesla http://www.macgrass.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to load a big txt file
Dear Michael, It consists of 238305 rows and 50 columns including the header and row names. Thanks! Alex On 6/7/07, michael watson (IAH-C) [EMAIL PROTECTED] wrote: Erm... Is that a typo? Are we really talking 23800 rows and 49 columns? Because that doesn't seem that many -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of ssls sddd Sent: 07 June 2007 10:48 To: r-help@stat.math.ethz.ch Subject: Re: [R] How to load a big txt file Dear Chung-hong Chan, Thanks! Can you recommend a text editor for splitting? I used UltraEdit and TextPad but did not find they can split files. Sincerely, Alex On 6/6/07, Chung-hong Chan [EMAIL PROTECTED] wrote: Easy solution will be split your big txt files by text editor. e.g. 5000 rows each. and then combine the dataframes together into one. On 6/7/07, ssls sddd [EMAIL PROTECTED] wrote: Dear list, I need to read a big txt file (around 130Mb; 23800 rows and 49 columns) for downstream clustering analysis. I first used Tumor - read.table(Tumor.txt,header = TRUE,sep = \t) but it took a long time and failed. However, it had no problem if I just put data of 3 columns. Is there any way which can load this big file? Thanks for any suggestions! Sincerely, Alex [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- The scientists of today think deeply instead of clearly. One must be sane to think clearly, but one can think deeply and be quite insane. Nikola Tesla http://www.macgrass.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to load a big txt file
Dear Jim, Thanks a lot! The size of the text file is 189,588,541 bytes. It consists of 238305 rows (including the header) and 50 columns (the first column is for ID and the rest for 49 samples). The first row looks like: ID AIRNS_p_Sty5_Mapping250K_Sty_A09_50156.cel AIRNS_p_Sty5_Mapping250K_Sty_A11_50188.cel AIRNS_p_Sty5_Mapping250K_Sty_A12_50204.cel AIRNS_p_Sty5_Mapping250K_Sty_B09_50158.cel AIRNS_p_Sty5_Mapping250K_Sty_C01_50032.cel AIRNS_p_Sty5_Mapping250K_Sty_C12_50208.cel AIRNS_p_Sty5_Mapping250K_Sty_D03_50066.cel AIRNS_p_Sty5_Mapping250K_Sty_D08_50146.cel AIRNS_p_Sty5_Mapping250K_Sty_F03_50070.cel AIRNS_p_Sty5_Mapping250K_Sty_F12_50214.cel AIRNS_p_Sty5_Mapping250K_Sty_G09_50168.cel DOLCE_p_Sty7_Mapping250K_Sty_B04_53892.cel DOLCE_p_Sty7_Mapping250K_Sty_B06_53924.cel DOLCE_p_Sty7_Mapping250K_Sty_C05_53910.cel DOLCE_p_Sty7_Mapping250K_Sty_C10_53990.cel DOLCE_p_Sty7_Mapping250K_Sty_D05_53912.cel DOLCE_p_Sty7_Mapping250K_Sty_E01_53850.cel DOLCE_p_Sty7_Mapping250K_Sty_G12_54030.cel DOLCE_p_Sty7_Mapping250K_Sty_H06_53936.cel DOLCE_p_Sty7_Mapping250K_Sty_H08_53968.cel DOLCE_p_Sty7_Mapping250K_Sty_H11_54016.cel DOLCE_p_Sty7_Mapping250K_Sty_H12_54032.cel GUSTO_p_Sty20_Mapping250K_Sty_C08_81736.cel GUSTO_p_Sty20_Mapping250K_Sty_E03_81660.cel GUSTO_p_Sty20_Mapping250K_Sty_H02_81650.cel HEWED_p_250KSty_Plate_20060123_GOOD_B01_46246.cel HEWED_p_250KSty_Plate_20060123_GOOD_C06_46328.cel HEWED_p_250KSty_Plate_20060123_GOOD_F02_46270.cel HEWED_p_250KSty_Plate_20060123_GOOD_G04_46304.cel HOCUS_p_Sty4_Mapping250K_Sty_B05_55060.cel HOCUS_p_Sty4_Mapping250K_Sty_B12_55172.cel HOCUS_p_Sty4_Mapping250K_Sty_E05_55066.cel SOARS_p_Sty23_Mapping250K_Sty_B07_89024.cel SOARS_p_Sty23_Mapping250K_Sty_C01_88930.cel SOARS_p_Sty23_Mapping250K_Sty_C11_89090.cel SOARS_p_Sty23_Mapping250K_Sty_F07_89032.cel SOARS_p_Sty23_Mapping250K_Sty_H08_89052.cel SOARS_p_Sty23_Mapping250K_Sty_H10_89084.cel VINOS_p_Sty8_Mapping250K_Sty_A04_54082.cel VINOS_p_Sty8_Mapping250K_Sty_A07_54130.cel VINOS_p_Sty8_Mapping250K_Sty_B08_54148.cel VINOS_p_Sty8_Mapping250K_Sty_D01_54040.cel VINOS_p_Sty8_Mapping250K_Sty_D05_54104.cel VINOS_p_Sty8_Mapping250K_Sty_E04_54090.cel VINOS_p_Sty8_Mapping250K_Sty_E12_54218.cel VINOS_p_Sty8_Mapping250K_Sty_G01_54046.cel VINOS_p_Sty8_Mapping250K_Sty_G12_54222.cel VOLTS_p_Sty9_Mapping250K_Sty_G09_57916.cel VOLTS_p_Sty9_Mapping250K_Sty_H12_57966.cel and the second row looks like: SNP_A-17802711.85642004013061.50955998897551.7315399646759 1.5307699441911.65760004520421.4741799831392.1564099788666 1.77572267 1.59794998168952.1641461851.980849981308 2.1803700923921.87822997570042.14855003356931.5325000286102 1.72329998016362.22812008857731.9381694821.8546999692917 2.1590900421143 2.19284009933472.02532005310062.6680200099945 2.74359011650092.08049988746643.21423006057742.1001501083374 2.1475799083713.52442002296451.3744800090791.6613099575043 3.1606800556183 2.09170007705691.8727256131.8952000141144 1.8135700225831.81808996200562.25536990165711.927329428 1.67664003372191.34246003627781.56669998168951.7180800437927 1.9548699855804 1.9996948242.22429990768431.7591500282288 2.04801988601682.638689994812 Thanks a lot! Sincerely, Alex On 6/6/07, jim holtman [EMAIL PROTECTED] wrote: It would be useful if you could post the first couple of rows of the data so we can see what it looks like. On 6/6/07, ssls sddd [EMAIL PROTECTED] wrote: Dear list, I need to read a big txt file (around 130Mb; 23800 rows and 49 columns) for downstream clustering analysis. I first used Tumor - read.table(Tumor.txt,header = TRUE,sep = \t) but it took a long time and failed. However, it had no problem if I just put data of 3 columns. Is there any way which can load this big file? Thanks for any suggestions! Sincerely, Alex [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to do clustering
Dear List, I have another question to bother you about how to do clustering. My data consists of 49 columns (49 variables) and 238804 rows. I would like to do hierarchical clustering (unsupervised clustering and PCA). So far I tried pvclust (www.is.titech.ac.jp/~shimo/prog/*pvclust* /) but I always had the problem like for R like cannot allocate the memory. I am curious about what else packages can perform the clustering analysis while memory efficient. Meanwhile, is there any way that I can extract the features of each cluster. In other words, I would like to identify which are responsible for classifying these variables (samples). Thanks a lot! Sincerely, Alex [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to load a big txt file
Dear Jim, It works great. I appreciate your help. Sincerely, Alex On 6/7/07, jim holtman [EMAIL PROTECTED] wrote: I took your data and duped the data line so I had 100,000 rows and it took 40 seconds to read in when specifying colClasses system.time(x - read.table('/tempxx.txt', header=TRUE,colClasses=c('factor', rep('numeric',49 user system elapsed 40.980.46 42.39 str(x) 'data.frame ': 102272 obs. of 50 variables: $ ID : Factor w/ 1 level SNP_A-1780271: 1 1 1 1 1 1 1 1 1 1 ... $ AIRNS_p_Sty5_Mapping250K_Sty_A09_50156.cel : num 1.86 1.86 1.86 1.86 1.86 ... $ AIRNS_p_Sty5_Mapping250K_Sty_A11_50188.cel : num 1.51 1.51 1.51 1.51 1.51 ... $ AIRNS_p_Sty5_Mapping250K_Sty_A12_50204.cel : num 1.73 1.73 1.73 1.73 1.73 ... $ AIRNS_p_Sty5_Mapping250K_Sty_B09_50158.cel : num 1.53 1.53 1.53 1.53 1.53 ... $ AIRNS_p_Sty5_Mapping250K_Sty_C01_50032.cel : num 1.66 1.66 1.66 1.66 1.66 ... $ AIRNS_p_Sty5_Mapping250K_Sty_C12_50208.cel : num 1.47 1.47 1.47 1.47 1.47 ... $ AIRNS_p_Sty5_Mapping250K_Sty_D03_50066.cel : num 2.16 2.16 2.16 2.16 2.16 ... $ AIRNS_p_Sty5_Mapping250K_Sty_D08_50146.cel : num 1.78 1.78 1.78 1.78 1.78 ... $ AIRNS_p_Sty5_Mapping250K_Sty_F03_50070.cel : num 1.60 1.60 1.60 1.60 1.60 ... $ AIRNS_p_Sty5_Mapping250K_Sty_F12_50214.cel : num 2.16 2.16 2.16 2.16 2.16 ... $ AIRNS_p_Sty5_Mapping250K_Sty_G09_50168.cel : num 1.98 1.98 1.98 1.98 1.98 ... $ DOLCE_p_Sty7_Mapping250K_Sty_B04_53892.cel : num 2.18 2.18 2.18 2.18 2.18 ... $ DOLCE_p_Sty7_Mapping250K_Sty_B06_53924.cel : num 1.88 1.88 1.88 1.88 1.88 ... $ DOLCE_p_Sty7_Mapping250K_Sty_C05_53910.cel : num 2.15 2.15 2.15 2.15 2.15 ... $ DOLCE_p_Sty7_Mapping250K_Sty_C10_53990.cel : num 1.53 1.53 1.53 1.53 1.53 ... $ DOLCE_p_Sty7_Mapping250K_Sty_D05_53912.cel : num 1.72 1.72 1.72 1.72 1.72 ... $ DOLCE_p_Sty7_Mapping250K_Sty_E01_53850.cel : num 2.23 2.23 2.23 2.23 2.23 ... $ DOLCE_p_Sty7_Mapping250K_Sty_G12_54030.cel : num 1.94 1.94 1.94 1.94 1.94 ... $ DOLCE_p_Sty7_Mapping250K_Sty_H06_53936.cel : num 1.85 1.85 1.85 1.85 1.85 ... $ DOLCE_p_Sty7_Mapping250K_Sty_H08_53968.cel : num 2.16 2.16 2.16 2.16 2.16 ... $ DOLCE_p_Sty7_Mapping250K_Sty_H11_54016.cel : num 2.19 2.19 2.19 2.19 2.19 ... $ DOLCE_p_Sty7_Mapping250K_Sty_H12_54032.cel : num 2.03 2.03 2.03 2.03 2.03 ... $ GUSTO_p_Sty20_Mapping250K_Sty_C08_81736.cel : num 2.67 2.67 2.67 2.67 2.67 ... $ GUSTO_p_Sty20_Mapping250K_Sty_E03_81660.cel : num 2.74 2.74 2.74 2.74 2.74 ... $ GUSTO_p_Sty20_Mapping250K_Sty_H02_81650.cel : num 2.08 2.08 2.08 2.08 2.08 ... $ HEWED_p_250KSty_Plate_20060123_GOOD_B01_46246.cel: num 3.21 3.21 3.21 3.21 3.21 ... $ HEWED_p_250KSty_Plate_20060123_GOOD_C06_46328.cel: num 2.1 2.1 2.1 2.1 2.1 ... $ HEWED_p_250KSty_Plate_20060123_GOOD_F02_46270.cel: num 2.15 2.15 2.15 2.15 2.15 ... $ HEWED_p_250KSty_Plate_20060123_GOOD_G04_46304.cel: num 3.52 3.52 3.52 3.52 3.52 ... $ HOCUS_p_Sty4_Mapping250K_Sty_B05_55060.cel : num 1.37 1.37 1.37 1.37 1.37 ... $ HOCUS_p_Sty4_Mapping250K_Sty_B12_55172.cel : num 1.66 1.66 1.66 1.66 1.66 ... $ HOCUS_p_Sty4_Mapping250K_Sty_E05_55066.cel : num 3.16 3.16 3.16 3.16 3.16 ... $ SOARS_p_Sty23_Mapping250K_Sty_B07_89024.cel : num 2.09 2.09 2.09 2.09 2.09 ... $ SOARS_p_Sty23_Mapping250K_Sty_C01_88930.cel : num 1.87 1.87 1.87 1.87 1.87 ... $ SOARS_p_Sty23_Mapping250K_Sty_C11_89090.cel : num 1.90 1.90 1.90 1.90 1.90 ... $ SOARS_p_Sty23_Mapping250K_Sty_F07_89032.cel : num 1.81 1.81 1.81 1.81 1.81 ... $ SOARS_p_Sty23_Mapping250K_Sty_H08_89052.cel : num 1.82 1.82 1.82 1.82 1.82 ... $ SOARS_p_Sty23_Mapping250K_Sty_H10_89084.cel : num 2.26 2.26 2.26 2.26 2.26 ... $ VINOS_p_Sty8_Mapping250K_Sty_A04_54082.cel : num 1.93 1.93 1.93 1.93 1.93 ... $ VINOS_p_Sty8_Mapping250K_Sty_A07_54130.cel : num 1.68 1.68 1.68 1.68 1.68 ... $ VINOS_p_Sty8_Mapping250K_Sty_B08_54148.cel : num 1.34 1.34 1.34 1.34 1.34 ... $ VINOS_p_Sty8_Mapping250K_Sty_D01_54040.cel : num 1.57 1.57 1.57 1.57 1.57 ... $ VINOS_p_Sty8_Mapping250K_Sty_D05_54104.cel : num 1.72 1.72 1.72 1.72 1.72 ... $ VINOS_p_Sty8_Mapping250K_Sty_E04_54090.cel : num 1.95 1.95 1.95 1.95 1.95 ... $ VINOS_p_Sty8_Mapping250K_Sty_E12_54218.cel : num 1.44 1.44 1.44 1.44 1.44 ... $ VINOS_p_Sty8_Mapping250K_Sty_G01_54046.cel : num 2.22 2.22 2.22 2.22 2.22 ... $ VINOS_p_Sty8_Mapping250K_Sty_G12_54222.cel : num 1.76 1.76 1.76 1.76 1.76 ... $ VOLTS_p_Sty9_Mapping250K_Sty_G09_57916.cel : num 2.05 2.05 2.05 2.05 2.05 ... $ VOLTS_p_Sty9_Mapping250K_Sty_H12_57966.cel : num 2.64 2.64 2.64 2.64 2.64 ... On 6/7/07, ssls sddd
[R] How to load a big txt file
Dear list, I need to read a big txt file (around 130Mb; 23800 rows and 49 columns) for downstream clustering analysis. I first used Tumor - read.table(Tumor.txt,header = TRUE,sep = \t) but it took a long time and failed. However, it had no problem if I just put data of 3 columns. Is there any way which can load this big file? Thanks for any suggestions! Sincerely, Alex [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.