Actually at my company, we do a lot of NLP work and we've ended up
using bespoke formats, formerly a FeatureStructure serialized to JSON,
but most recently in protobufs. Possibly not the answer you were
looking for, Otis, but at least it's a datapoint.

Michael Della Bitta

------------------------------------------------
Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Sep 12, 2012 at 7:36 AM, Alexandre Rafalovitch
<arafa...@gmail.com> wrote:
> Otis,
>
> If you are doing Named Entity Recognition, you may want to look at the
> research area concerned with Named Entity Recognition. :-) In general,
> there is inline markup and standoff markup. You seem to be going for
> standoff/stand-alone markup. I am not clear though whether it is just
> 'discovery' format or actual annotation format (with reference to
> where in the sentence it is with offsets or token ids).
>
> UIMA (which Solr integrate with already, right?), does NER so it must
> be using some sort of format.
>
> Also, TREC is one of the competitions and they provide marked-up
> datasets you might be able to learn something from:
> http://ilps.science.uva.nl/trec-entity/
>
> If you are not sure where to start with NER, you can look at my
> collection of papers, though most of them are probably too specific:
> http://www.citeulike.org/user/arafalov
>
> Finally,  if you have to deal with overlapping entities, there was an
> article about a month about some sort of general format. I can't seem
> to find the article right now, but I could try digging if you are
> still stuck.
>
> Regards,
>     Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Tue, Sep 11, 2012 at 11:51 AM, Otis Gospodnetic
> <otis_gospodne...@yahoo.com> wrote:
>> Hello,
>>
>> If I'm extracting named entities, topics, key phrases/tags, etc. from 
>> documents and I want to have a representation of this document, what format 
>> should I use? Are there any standard or at least common formats or 
>> approaches people use in such situations?
>>
>> For example, the most straight forward format might be something like this:
>>
>>
>> <document>
>>   <title>doc title</title>
>>   <keywords>meta keywords coming from the web page</keywords>
>>   <content>page meat</content>
>>   <entities>name entities recognized in the document</entities>
>>   <topics>topics extracted by the annotator</topics>
>>   <tags>tags extracted by the annotator</tags>
>>   <relations>relations extracted by the annotator</relations>
>> </document>
>>
>> But this is a made up format - the XML tags above are just what somebody 
>> happened to pick.
>>
>> Are there any standard or at least common formats for this?
>>
>>
>> Thanks,
>> Otis
>> ----
>> Performance Monitoring - Solr - ElasticSearch - HBase - 
>> http://sematext.com/spm
>>
>> Search Analytics - http://sematext.com/search-analytics/index.html

Reply via email to