Re: [R] Trees (and Forests) with packages 'party' vs. 'partykit': Different results

2015-09-21 Thread apeshifter
Achim, 

thank you very much for your help, this really cleared up a number of
issues.

As for the differences in results between the party and partykit
implementations of ctree, I guess that the situation is indeed as you
assumed. Four out of five variables have p-values <2.2e-16. (However, it is
not the first of these variables that is selected but the one in the second
column.) I will just continue using the newer implementation. 

-- Christopher



--
View this message in context: 
http://r.789695.n4.nabble.com/Trees-and-Forests-with-packages-party-vs-partykit-Different-results-tp4712214p4712539.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Trees (and Forests) with packages 'party' vs. 'partykit': Different results

2015-09-14 Thread apeshifter
Dear all, 

I'm currently exploring a dataset with the help of conditional inference
trees (still very much a beginner with this technique & log. reg. methods as
a whole t.b.h.), since they explained more variation in my dataset than a
binary logistic regression with /glm/. I started out with the /party
/package, but after I while I ran into the 'updated' /partykit /package and
tried this out, too. Now, the strange thing is that both trees look quite
different - actually even the very first split is different. So I did some
research and came across the 'forest' concept. However, it seems that the
/varImp /function does not yet work in the /partykit /implementation, which
raises the question for me how I should evaluate the /partykit /forest - how
can I find out whether the variables are important in the forest as in my
/partykit /tree? Is there some way to do this or some other solution for
this problem? I'd prefer to continue the /partykit /implementation of ctree,
since it allows more settings for the final plot, which I'd need to get the
final (large) plot into a readable form.

Related to this project, I'd also like to give statistics for the overall
model, e.g. overall significance, Nagelkerke's R², a C-value. After a
'regular' binary log. reg., I would use the lrm function to get these
values, but I am unsure whether it would be correct to also apply this
method to my tree data.

Any help would be greatly appreciated! 

-- Christopher



--
View this message in context: 
http://r.789695.n4.nabble.com/Trees-and-Forests-with-packages-party-vs-partykit-Different-results-tp4712214.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Finding unique elements faster

2014-12-09 Thread apeshifter
Thank you all for your suggestions! I must say I am amazed by the number of
people who consider helping out another! Fells like it was a good idea to
start using R - back when I was still using Perl for such tasks, I'd been
happy to have this kind of support!

@ Gheorghe Postelnicu: Unfortunately, the data is not yet in a data frame
when this part of the program starts. At this point, I am trying to fill in
all the relevant vectors (all.word.pairs, word1, word2, freq.word1,
freq.word2, typefreq.w1, typefreq.w2, ...) and then combine them to a data
frame. I will try to get my head around the doParallel package package for
the foreach loop, since parallel computing would certainly be helpful. 

@ Jeff Newmiller: Sound interesting, but I fear the same problem applies as
for Gheorghe's suggestion. I will need a data frame first,for which I do not
have all the correct values... Will keep the package in mind, though, for
future projects.

@ Stefan Evert-3: I am not sure I understand what you mean in the second
example. Since the counting of types is exactly my problem at the moment, I
do not see how I could provide a function that would work more efficiently
in the context you are describing. The line of code that I was giving is
exactly my attempt at doint this... Sorry, I might just not be getting what
you are aiming at... :-/  However, your assumptions are quite correct. word1
and word2 do indeed contain word tokens, as does all.word.pairs. The reason
for this is that I need the word pairs within the vector to be in the same
order as they appeared in the original corpus files. Also, thank you for the
link. I will check this out when I am analysing collocates. However, I
didn't find notes on my specific problem in the slides. However, please do
not think I was not using reference material for designing my script. I was
in fact using  Gries 2009: Quantitative Corpus Linguistics with R
http://www.amazon.de/Quantitative-Corpus-Linguistics-Practical-Introduction-ebook/dp/B001Y35H5A/ref=sr_1_1?ie=UTF8qid=1418119630sr=8-1keywords=gries+quantitative+corpus+linguistics
  
for this. The trouble is that the methods in the book help as far as simple
n-gram frequency calculations are concerned (since, e.g. table() would just
do the trick), but methods for this size of repeated checks on tables are
not included.

Best,
Christopher



--
View this message in context: 
http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539p4700582.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Finding unique elements faster

2014-12-08 Thread apeshifter
Dear all, 

for the past two weeks, I've been working on a script to retrieve word pairs
and calculate some of their statistics using R. Everything seemed to work
fine until I switched from a small test dataset to the 'real thing' and
noticed what a runtime monster I had devised! 

I could reduce processing time significantly when I realized that with R, I
did not have to do everything in loops and count things vector element by
vector element, but could just have the program count everything with
tables, e.g. with 
   freq.w1w2.2-table(all.word.pairs)[all.word.pairs]

However, now I seem to have run into a performance problem that I cannot
solve. I hope there's a kind soul on this list who has some advice for me.
On to the problem:

The last relic of the afore-mentioned for-loop that goes through all the
word pairs and tries to calculate some statistics on them is the following
line of code:
   typefreq.after1[i]-length(unique(word2[which(word1==word1[i])]))
(where word1 and word2 are the first and second word within the two-word
sequence (all.word.pairs, above)
  
Here, I am trying to count the number of 'types', linguistically speaking,
before the second word in the two-word sequence (later, I am doing the same
for the first word within the sequence). The expression works, but given my
~400,000 word pairs/word1's/word2's etc, this takes quite some time. About
10 hours on my machine, in fact, since R cannot use the other three of the
four cores. Since I want to repeat the process for another 20 corpora of
similar size, I would definitely appreciate some help on this subject.

I have been trying 'typefreq.after1-table(unique(word2[word1]))[2]' and the
subset() function and both seem to work (though I haven't checked whether
all the numbers are in fact correctly calculated), but they take about the
same amount of time. So that's no use for me. 

Does anybody have any tips to speed this up?

Thank you very much!



--
View this message in context: 
http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.