Re: Wikipedia Data Cleaning at Solr

Furkan KAMACI Mon, 24 Feb 2014 02:20:24 -0800

My input is that:

{| style="text-align: left; width: 50%; table-layout: fixed;" border="0" |}


Analysis is as follows:

WT
textraw_bytesstartendtypeflagsposition
style[73 74 79 6c 65]38<ALPHANUM>01
text[74 65 78 74]1014<ALPHANUM>02
align[61 6c 69 67 6e]1520<ALPHANUM>03
left[6c 65 66 74]2226<ALPHANUM>04
width[77 69 64 74 68]2833<ALPHANUM>05
50[35 30]3537<ALPHANUM>06
table[74 61 62 6c 65]4045<ALPHANUM>07
layout[6c 61 79 6f 75 74]4652<ALPHANUM>08
fixed[66 69 78 65 64]5459<ALPHANUM>09
border[62 6f 72 64 65 72]6268<ALPHANUM>010
0[30]7071<ALPHANUM>011



2014-02-24 0:28 GMT+02:00 Furkan KAMACI <furkankam...@gmail.com>:

> I've compared the results when using WikipediaTokenizer for  index time
> analyzer but there is no difference?
>
>
> 2014-02-23 3:44 GMT+02:00 Ahmet Arslan <iori...@yahoo.com>:
>
> Hi Furkan,
>>
>> There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer
>>
>> Ahmet
>>
>>
>> On Sunday, February 23, 2014 2:22 AM, Furkan KAMACI <
>> furkankam...@gmail.com> wrote:
>> Hi;
>>
>> I want to run an NLP algorithm for Wikipedia data. I used dataimport
>> handler for dump data and everything is OK. However there are some texts
>> as
>> like:
>>
>> == Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı
>> eğitimden yararlanılmaktadır.
>>
>> I think that it should be like that:
>>
>> Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden
>> yararlanılmaktadır.
>>
>> On the other hand this should be removed:
>>
>> {| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aaaaaa"
>> |'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dddddd" |[[2009]] |kazım
>> güngör |- bgcolor="#dddddd" | |Ömer Gungor |- bgcolor="#dddddd" | |Fazlı
>> Uzun |- bgcolor="#dddddd" | |Cemal Özden |- bgcolor="#dddddd" | | |}
>>
>> Also including titles as like == Altyapı bilgileri == should be optional
>> (I
>> think that they can be removed for some purposes)
>>
>> My question is that. Is there any analyzer combination to clean up
>> Wikipedia data for Solr?
>>
>> Thanks;
>> Furkan KAMACI
>>
>
>

Re: Wikipedia Data Cleaning at Solr

Reply via email to