Hello Peter,
I looked a bit a the new regular _expression_ and there are still some cases that aren’t caught. More specifically, it won’t annotate XML tags that have a dash in their name, so tags such as:
<first-name>
aren’t caught by the current regular _expression_. I’ve changed the _expression_ so that it works. What I did was change the \w+ part from the tag name into \w[\w-]* since XML tag names can contain dashes, but cannot start with dashes. I’ve also updated the unit test so that there are tags with dashes and underscores and also one non-tag.
I’m attaching the SVN patch to this email.
Manuel

Attachment: MARKUP.patch
Description: Binary data

>Thanks Peter, 
> 
>The quotes are just normal quotes in the original source but the mail software must have changed 
>this. Sorry about that misunderstanding. 
> 
>Cheers 
>Mario  
> 
>> On 21/10/2015, at 16.03, Peter Klügl <peter.klu...@averbis.com> wrote: 
>>  
>> Hi, 
>>  
>> I extended the pattern to support dashes, but not the other quotes. This 
>> can get arbitrary complex (and slow) if any combination of unicode 
>> characters that look like quotes should be supported. I still think that 
>> this is not valid xml. Can you give me a link to the standard? 
>>  
>> It's maybe better to solve this in a specific use case before applying 
>> the seeder. 
>>  
>> Best, 
>>  
>> Peter 
>>  
>>> Am 20.10.2015 um 19:22 schrieb Mario Gazzo: 
>>> I believe it should be extended since I think that a RUTA user would expect that 
>the MARKUP annotation indeed captures at least XML and HTML markup properly. The examples 
>are from a Pub Med Central XML file that follows the NISO JATS specification so I will assume 
>it is proper formatted XML without knowing all the details of the spec. 
>>>  
>>> We have managed to implement a crude workaround for now but let us know when an improved 
>version becomes available. 
>>>  
>>> Cheers 
>>> Mario 
>>>  
>>>> On 20 Oct 2015, at 17:56 , Peter Klügl <peter.klu...@averbis.com> wrote: 
>>>>  
>>>> Hi Mario, 
>>>>  
>>>> yes, and the different quote also causes problems (are these valid?). 
>>>>  
>>>> The MARUP annotation is not created by jflex like the other annoations, 
>>>> but by a postprocessing step using an regular epression. This _expression_ 
>>>> does not cover theses cases (markupPattern in DefaultSeeder.java). 
>>>>  
>>>> Should we extend it? 
>>>>  
>>>> Best, 
>>>>  
>>>> Peter 
>>>>  
>>>>> Am 20.10.2015 um 17:26 schrieb Mario Gazzo: 
>>>>> Hi Peter, 
>>>>>  
>>>>> RUTA doesn’t seem to capture some XML markup with attributes. Here are 
>some examples: 
>>>>>  
>>>>> <xref ref-type="bibr" rid="b35-ehp0113-000220”> 
>>>>> <sec sec-type="methods”> 
>>>>>  
>>>>> The above markup examples are totally missing in the TokenSeed annotations. 
>I wonder whether it is related to the dash in the attribute names since other markup without 
>this appear to be captured. 
>>>>>  
>>>>> Can you confirm that the dash could cause the problem? 
>>>>>  
>>>>> Cheers 
>>>>> Mario 
>>  
>

Reply via email to