Actually at my company, we do a lot of NLP work and we've ended up using bespoke formats, formerly a FeatureStructure serialized to JSON, but most recently in protobufs. Possibly not the answer you were looking for, Otis, but at least it's a datapoint.
Michael Della Bitta ------------------------------------------------ Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game On Wed, Sep 12, 2012 at 7:36 AM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > Otis, > > If you are doing Named Entity Recognition, you may want to look at the > research area concerned with Named Entity Recognition. :-) In general, > there is inline markup and standoff markup. You seem to be going for > standoff/stand-alone markup. I am not clear though whether it is just > 'discovery' format or actual annotation format (with reference to > where in the sentence it is with offsets or token ids). > > UIMA (which Solr integrate with already, right?), does NER so it must > be using some sort of format. > > Also, TREC is one of the competitions and they provide marked-up > datasets you might be able to learn something from: > http://ilps.science.uva.nl/trec-entity/ > > If you are not sure where to start with NER, you can look at my > collection of papers, though most of them are probably too specific: > http://www.citeulike.org/user/arafalov > > Finally, if you have to deal with overlapping entities, there was an > article about a month about some sort of general format. I can't seem > to find the article right now, but I could try digging if you are > still stuck. > > Regards, > Alex. > Personal blog: http://blog.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all > at once. Lately, it doesn't seem to be working. (Anonymous - via GTD > book) > > > On Tue, Sep 11, 2012 at 11:51 AM, Otis Gospodnetic > <otis_gospodne...@yahoo.com> wrote: >> Hello, >> >> If I'm extracting named entities, topics, key phrases/tags, etc. from >> documents and I want to have a representation of this document, what format >> should I use? Are there any standard or at least common formats or >> approaches people use in such situations? >> >> For example, the most straight forward format might be something like this: >> >> >> <document> >> <title>doc title</title> >> <keywords>meta keywords coming from the web page</keywords> >> <content>page meat</content> >> <entities>name entities recognized in the document</entities> >> <topics>topics extracted by the annotator</topics> >> <tags>tags extracted by the annotator</tags> >> <relations>relations extracted by the annotator</relations> >> </document> >> >> But this is a made up format - the XML tags above are just what somebody >> happened to pick. >> >> Are there any standard or at least common formats for this? >> >> >> Thanks, >> Otis >> ---- >> Performance Monitoring - Solr - ElasticSearch - HBase - >> http://sematext.com/spm >> >> Search Analytics - http://sematext.com/search-analytics/index.html