Re: [R] Trees (and Forests) with packages 'party' vs. 'partykit': Different results
Achim, thank you very much for your help, this really cleared up a number of issues. As for the differences in results between the party and partykit implementations of ctree, I guess that the situation is indeed as you assumed. Four out of five variables have p-values <2.2e-16. (However, it is not the first of these variables that is selected but the one in the second column.) I will just continue using the newer implementation. -- Christopher -- View this message in context: http://r.789695.n4.nabble.com/Trees-and-Forests-with-packages-party-vs-partykit-Different-results-tp4712214p4712539.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Trees (and Forests) with packages 'party' vs. 'partykit': Different results
Dear all, I'm currently exploring a dataset with the help of conditional inference trees (still very much a beginner with this technique & log. reg. methods as a whole t.b.h.), since they explained more variation in my dataset than a binary logistic regression with /glm/. I started out with the /party /package, but after I while I ran into the 'updated' /partykit /package and tried this out, too. Now, the strange thing is that both trees look quite different - actually even the very first split is different. So I did some research and came across the 'forest' concept. However, it seems that the /varImp /function does not yet work in the /partykit /implementation, which raises the question for me how I should evaluate the /partykit /forest - how can I find out whether the variables are important in the forest as in my /partykit /tree? Is there some way to do this or some other solution for this problem? I'd prefer to continue the /partykit /implementation of ctree, since it allows more settings for the final plot, which I'd need to get the final (large) plot into a readable form. Related to this project, I'd also like to give statistics for the overall model, e.g. overall significance, Nagelkerke's R², a C-value. After a 'regular' binary log. reg., I would use the lrm function to get these values, but I am unsure whether it would be correct to also apply this method to my tree data. Any help would be greatly appreciated! -- Christopher -- View this message in context: http://r.789695.n4.nabble.com/Trees-and-Forests-with-packages-party-vs-partykit-Different-results-tp4712214.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Finding unique elements faster
Thank you all for your suggestions! I must say I am amazed by the number of people who consider helping out another! Fells like it was a good idea to start using R - back when I was still using Perl for such tasks, I'd been happy to have this kind of support! @ Gheorghe Postelnicu: Unfortunately, the data is not yet in a data frame when this part of the program starts. At this point, I am trying to fill in all the relevant vectors (all.word.pairs, word1, word2, freq.word1, freq.word2, typefreq.w1, typefreq.w2, ...) and then combine them to a data frame. I will try to get my head around the doParallel package package for the foreach loop, since parallel computing would certainly be helpful. @ Jeff Newmiller: Sound interesting, but I fear the same problem applies as for Gheorghe's suggestion. I will need a data frame first,for which I do not have all the correct values... Will keep the package in mind, though, for future projects. @ Stefan Evert-3: I am not sure I understand what you mean in the second example. Since the counting of types is exactly my problem at the moment, I do not see how I could provide a function that would work more efficiently in the context you are describing. The line of code that I was giving is exactly my attempt at doint this... Sorry, I might just not be getting what you are aiming at... :-/ However, your assumptions are quite correct. word1 and word2 do indeed contain word tokens, as does all.word.pairs. The reason for this is that I need the word pairs within the vector to be in the same order as they appeared in the original corpus files. Also, thank you for the link. I will check this out when I am analysing collocates. However, I didn't find notes on my specific problem in the slides. However, please do not think I was not using reference material for designing my script. I was in fact using Gries 2009: Quantitative Corpus Linguistics with R http://www.amazon.de/Quantitative-Corpus-Linguistics-Practical-Introduction-ebook/dp/B001Y35H5A/ref=sr_1_1?ie=UTF8qid=1418119630sr=8-1keywords=gries+quantitative+corpus+linguistics for this. The trouble is that the methods in the book help as far as simple n-gram frequency calculations are concerned (since, e.g. table() would just do the trick), but methods for this size of repeated checks on tables are not included. Best, Christopher -- View this message in context: http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539p4700582.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Finding unique elements faster
Dear all, for the past two weeks, I've been working on a script to retrieve word pairs and calculate some of their statistics using R. Everything seemed to work fine until I switched from a small test dataset to the 'real thing' and noticed what a runtime monster I had devised! I could reduce processing time significantly when I realized that with R, I did not have to do everything in loops and count things vector element by vector element, but could just have the program count everything with tables, e.g. with freq.w1w2.2-table(all.word.pairs)[all.word.pairs] However, now I seem to have run into a performance problem that I cannot solve. I hope there's a kind soul on this list who has some advice for me. On to the problem: The last relic of the afore-mentioned for-loop that goes through all the word pairs and tries to calculate some statistics on them is the following line of code: typefreq.after1[i]-length(unique(word2[which(word1==word1[i])])) (where word1 and word2 are the first and second word within the two-word sequence (all.word.pairs, above) Here, I am trying to count the number of 'types', linguistically speaking, before the second word in the two-word sequence (later, I am doing the same for the first word within the sequence). The expression works, but given my ~400,000 word pairs/word1's/word2's etc, this takes quite some time. About 10 hours on my machine, in fact, since R cannot use the other three of the four cores. Since I want to repeat the process for another 20 corpora of similar size, I would definitely appreciate some help on this subject. I have been trying 'typefreq.after1-table(unique(word2[word1]))[2]' and the subset() function and both seem to work (though I haven't checked whether all the numbers are in fact correctly calculated), but they take about the same amount of time. So that's no use for me. Does anybody have any tips to speed this up? Thank you very much! -- View this message in context: http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.