Thank Toke your quick response. All your suggestions seem to be very good idea. I found the capital letters also strange because of the names, places so I will skip this part as I do not need an absolute measure just a ranked order among my documents,
cheers, Roland 2015. okt. 21. dátummal, 11:25 időpontban Toke Eskildsen <t...@statsbiblioteket.dk> írta: > Roland Szűcs <roland.sz...@booknwalk.com> wrote: >> My use case is that I have to calculate the LIX readability index for my >> documents. > [...] >> *B* = Number of periods (defined by period, colon or capital first letter) > [...] >> Does anybody have idea how to get the number of "periods"? > > As the positions does not matter, you could make a copyField containing only > punctuation. And maybe extended with a replace filter so that you have dot, > comma, color, bang, question ect. instead of .,:!? > > The capital first letter seems a bit strange to me - what about names? But > anyway, you could do it with a PatternReplaceCharFilter, matching on > something like > ([^.,:!?]\p{Space}*\p{Upper})|(^\p{Upper}) > and replacing with 'capital' (the regexp above probably fails - it was just > from memory). > > - Toke Eskildsen