Re: Wikipedia Data Cleaning at Solr

2014-02-24 Thread Furkan KAMACI
My input is that: {| style="text-align: left; width: 50%; table-layout: fixed;" border="0" |} Analysis is as follows: WT textraw_bytesstartendtypeflagsposition style[73 74 79 6c 65]3801 text[74 65 78 74]101402 align[61 6c 69 67 6e]152003 left[6c 65 66 74]222604 width[77 69 64 74 68]283305 50[35

Re: Wikipedia Data Cleaning at Solr

2014-02-23 Thread Furkan KAMACI
I've compared the results when using WikipediaTokenizer for index time analyzer but there is no difference? 2014-02-23 3:44 GMT+02:00 Ahmet Arslan : > Hi Furkan, > > There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer > > Ahmet > > > On Sunday, February 23, 2014 2:22 AM, Furkan KAM

Re: Wikipedia Data Cleaning at Solr

2014-02-22 Thread Ahmet Arslan
Hi Furkan, There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer Ahmet On Sunday, February 23, 2014 2:22 AM, Furkan KAMACI wrote: Hi; I want to run an NLP algorithm for Wikipedia data. I used dataimport handler for dump data and everything is OK. However there are some texts as li

Wikipedia Data Cleaning at Solr

2014-02-22 Thread Furkan KAMACI
Hi; I want to run an NLP algorithm for Wikipedia data. I used dataimport handler for dump data and everything is OK. However there are some texts as like: == Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı eğitimden yararlanılmaktadır. I think that it should be like that: