I have tried to train NER for Italian Addresses using the following train data; this is just an extract because I used a train file of 50.000 records.
VIA <START:street> FRANCESCO ZANARDI <END> <START:number> 985 <END> <START:zip> 40131 <END> <START:town> BOLOGNA <END> <START:province> BO <END> VIA <START:street> STEFANO BORGIA <END> <START:number> 151 <END> <START:zip> 00168 <END> <START:town> ROMA <END> <START:province> RM <END> VIALE <START:street> ITALIA <END> <START:number> 40 <END> <START:zip> 83100 <END> <START:town> AVELLINO <END> <START:province> AV <END> PIAZZA <START:street> ROMA <END> <START:number> 15 <END> <START:zip> 63100 <END> <START:town> ASCOLI PICENO <END> <START:province> AP <END> I have used the following line command to train: C:\Programmi\apache-opennlp-1.5.2-incubating\bin>opennlp.bat TokenNameFinderTrainer -encoding UTF-8 -lang it -data ../traindata/it-ner-address.train -model ../models/it/it-ner-address.bin Then I have run a Name Finder Tool with the following connand: C:\Programmi\apache-opennlp-1.5.2-incubating\bin>opennlp.bat TokenNameFinder ../models/it/it-ner-address.bin < ../input/it-ner-address.txt > ../output/it-ner-address.txt using a small file of 100 records and I have received the following results (still this is just an extract): PZA <START:number> GIOVANNI FONTANA <END> <START:zip> 1 <END> <START:town> 60125 <END> <START:province> ANCONA <END> <START:province> AN <END> VIA <START:number> A. GARIBALDI <END> <START:zip> 56 <END> <START:town> 60019 <END> <START:province> SENIGALLIA <END> <START:province> AN <END> VIA <START:number> A. GARIBALDI <END> <START:zip> 56 <END> <START:town> 60019 <END> <START:province> SENIGALLIA <END> <START:province> AN <END> VIA <START:zip> ACHILLE GRANDI <END> <START:zip> 21 <END> <START:street> INT <END> <START:number> INT A <END> <START:street> 23891 BARZANO' <END> <START:street> LC <END> VIA <START:number> AGRARIA <END> <START:zip> 2 <END> <START:town> 60035 <END> <START:province> JESI <END> <START:province> AN <END> VIA <START:number> AGRARIA <END> <START:zip> 2 <END> <START:town> 60035 <END> <START:province> JESI <END> <START:province> AN <END> VIA <START:street> ALBERTO DA GIUSSANO <END> <START:number> 39 INT <END> <START:zip> I <END> <START:town> 20030 <END> <START:street> SEVESO <END> <START:street> MB <END> VIA <START:number> AMEDEO <END> <START:zip> 51A <END> <START:town> 24040 <END> <START:province> VERDELLINO <END> <START:province> BG <END> VIA <START:street> AMEDEO DI SAVOIA 15 INT <END> <START:zip> INT <END> <START:town> 46040 <END> <START:street> CASALROMANO <END> <START:street> MN <END> VIA <START:number> ANTONIO GRAMSCI <END> <START:zip> 14 <END> <START:town> 61040 <END> <START:town> MONDAVIO PU <END> VIA <START:town> ARNETTA <END> <START:zip> 20 <END> <START:street> INT <END> <START:number> INT <END> <START:zip> B <END> <START:town> 21045 <END> <START:province> GAZZADA SCHIANNO <END> <START:province> VA <END> VIA <START:number> BRESCIA <END> <START:zip> 31 <END> <START:town> 26013 <END> <START:province> CREMA <END> <START:province> CR <END> VIA <START:zip> C. CAVOUR <END> <START:zip> 6 <END> <START:street> PRESSO <END> <START:number> INT <END> <START:zip> FARMA <END> <START:town> 60033 <END> <START:province> CHIARAVALLE <END> <START:province> AN <END> VIA <START:number> CAMERANO <END> <START:zip> 7 <END> <START:town> 62019 <END> <START:province> RECANATI <END> <START:province> MC <END> VIA <START:town> CANDIA <END> <START:street> 350 <END> <START:street> INT <END> <START:zip> INT E <END> <START:town> 60131 <END> <START:province> ANCONA <END> <START:province> AN <END> VIA <START:number> CESARE BECCARIA <END> <START:zip> 49 <END> <START:town> 60019 <END> <START:province> SENIGALLIA <END> <START:province> AN <END> VIA <START:zip> CESARE PAVESE <END> <START:zip> 28 <END> <START:street> INT <END> <START:zip> INT INT <END> <START:town> 46030 <END> <START:town> BIGARELLO MN <END> The results are clearly not good. Do you have any idea of how I could improve them ? I am new to Opennlp is there any parameter that I should use when running the training? Mauro
