My input is that: {| style="text-align: left; width: 50%; table-layout: fixed;" border="0" |}
Analysis is as follows: WT textraw_bytesstartendtypeflagsposition style[73 74 79 6c 65]38<ALPHANUM>01 text[74 65 78 74]1014<ALPHANUM>02 align[61 6c 69 67 6e]1520<ALPHANUM>03 left[6c 65 66 74]2226<ALPHANUM>04 width[77 69 64 74 68]2833<ALPHANUM>05 50[35 30]3537<ALPHANUM>06 table[74 61 62 6c 65]4045<ALPHANUM>07 layout[6c 61 79 6f 75 74]4652<ALPHANUM>08 fixed[66 69 78 65 64]5459<ALPHANUM>09 border[62 6f 72 64 65 72]6268<ALPHANUM>010 0[30]7071<ALPHANUM>011 2014-02-24 0:28 GMT+02:00 Furkan KAMACI <furkankam...@gmail.com>: > I've compared the results when using WikipediaTokenizer for index time > analyzer but there is no difference? > > > 2014-02-23 3:44 GMT+02:00 Ahmet Arslan <iori...@yahoo.com>: > > Hi Furkan, >> >> There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer >> >> Ahmet >> >> >> On Sunday, February 23, 2014 2:22 AM, Furkan KAMACI < >> furkankam...@gmail.com> wrote: >> Hi; >> >> I want to run an NLP algorithm for Wikipedia data. I used dataimport >> handler for dump data and everything is OK. However there are some texts >> as >> like: >> >> == Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı >> eğitimden yararlanılmaktadır. >> >> I think that it should be like that: >> >> Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden >> yararlanılmaktadır. >> >> On the other hand this should be removed: >> >> {| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aaaaaa" >> |'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dddddd" |[[2009]] |kazım >> güngör |- bgcolor="#dddddd" | |Ömer Gungor |- bgcolor="#dddddd" | |Fazlı >> Uzun |- bgcolor="#dddddd" | |Cemal Özden |- bgcolor="#dddddd" | | |} >> >> Also including titles as like == Altyapı bilgileri == should be optional >> (I >> think that they can be removed for some purposes) >> >> My question is that. Is there any analyzer combination to clean up >> Wikipedia data for Solr? >> >> Thanks; >> Furkan KAMACI >> > >