A few additions, since <paragraph><commentRangeStart id="commentId"
/><run><text>John</text></run><commentRangeStop id="commentId"
/></paragraph> is the critical thing:
<!-- comment range, text run "John" -->
<w:commentRangeStart w:id="0"/>
<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr><w:rtl w:val="0"/></w:rPr>
<w:t xml:space="preserve">John</w:t>
</w:r>
<w:commentRangeEnd w:id="0"/>
<xsd:element name="commentRangeStart" type="CT_MarkupRange">
<xsd:annotation>
<xsd:documentation>Comment Anchor Range Start</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="commentRangeEnd" type="CT_MarkupRange">
<xsd:annotation>
<xsd:documentation>Comment Anchor Range End</xsd:documentation>
</xsd:annotation>
</xsd:element>
So if performance isn't a concern here (you don't need to save
pointers to where the comment ranges are), the pseudo-code for a
XWPFComment method that gets the text that a comment refers to would
be:
public String getRefersToText() {
StringBuilder refersTo = new StringBuilder();
for each CTParagraph in document:
for each child element of the CTParagraph:
if child element is a commentRangeStart and id==this.id
append subsequent text runs to the refersTo buffer
continue
if we have found the comment range start and child
element is a text run
append this text run to the refersTo buffer
if child element is a commentRangeEnd and id==this.id
return refersTo.toString() (assuming that one
comment may not refer to multiple text ranges)
}
This would require searching the entire document for every comment.
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFDocument.java?view=markup
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFParagraph.java?view=markup
On Tue, May 9, 2017 at 11:14 PM, Javen O'Neal <[email protected]> wrote:
> First, if you're using Java 1.5+(?), you can use for-each loops for
> more readable code.
> for (final XWPFComment comment : adoc.getComments()) {
> final String id = comment.getId();
> final String author = comment.getAuthor();
> final String text = comment.getText();
> }
>
> I don't see anything in POI right now that make extracting the
> annotated text that a track changes comment refers to.
>
> Here's the current implementation of XWPFComment:
> https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFComment.java?view=markup
>
> Taking a look at the OOXML 2006 schemas wml.xsd (download from
> http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%201st%20edition%20Part%204%20(PDF).zip,
> extract OfficeOpenXML-Part4a.zip, extract OfficeOpenXML-XMLSchema.zip,
> open wml.xsd), I see that the comment (*.docx/word/comments.xml)
> doesn't refer to the document text.
>
> <xsd:complexType name="CT_Comment">
> <xsd:complexContent>
> <xsd:extension base="CT_TrackChange">
> <xsd:sequence>
> <xsd:group ref="EG_BlockLevelElts" minOccurs="0"
> maxOccurs="unbounded"></xsd:group>
> </xsd:sequence>
> <xsd:attribute name="initials" type="ST_String" use="optional">
> <xsd:annotation>
> <xsd:documentation>Initials of Comment Author</xsd:documentation>
> </xsd:annotation>
> </xsd:attribute>
> </xsd:extension>
> </xsd:complexContent>
> </xsd:complexType>
>
> <xsd:complexType name="CT_TrackChange">
> <xsd:complexContent>
> <xsd:extension base="CT_Markup">
> <xsd:attribute name="author" type="ST_String" use="required">
> <xsd:annotation>
> <xsd:documentation>Annotation Author</xsd:documentation>
> </xsd:annotation>
> </xsd:attribute>
> <xsd:attribute name="date" type="ST_DateTime" use="optional">
> <xsd:annotation>
> <xsd:documentation>Annotation Date</xsd:documentation>
> </xsd:annotation>
> </xsd:attribute>
> </xsd:extension>
> </xsd:complexContent>
> </xsd:complexType>
>
> <xsd:complexType name="CT_Markup">
> <xsd:attribute name="id" type="ST_DecimalNumber" use="required">
> <xsd:annotation>
> <xsd:documentation>Annotation Identifier</xsd:documentation>
> </xsd:annotation>
> </xsd:attribute>
> </xsd:complexType>
>
> Examining the zipped xml contents of a simple comment example docx
> file that I created, I see that the relationship is the other way
> around: the document refers to the comments (this ordering makes more
> sense anyways).
>
> For a simple file that I created with the text "My name is John." and
> annotated the word John with a comment with the message "Noun", here's
> what I got in CommentExample.docx/word/document.xml:
>
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <w:document xmlns....>
> <w:body>
> <!-- text paragraph: "My name is [[John]]." -->
> <w:p w:rsidR="00000000" w:rsidDel="00000000" w:rsidP="00000000"
> w:rsidRDefault="00000000" w:rsidRPr="00000000">
> <w:pPr>
> <w:pBdr/>
> <w:contextualSpacing w:val="0"/>
> <w:rPr/>
> </w:pPr>
>
> <!-- text run "My name is " -->
> <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
> <w:rPr><w:rtl w:val="0"/></w:rPr>
> <w:t xml:space="preserve">My name is </w:t>
> </w:r>
>
> <!-- comment range, text run "John" -->
> <w:commentRangeStart w:id="0"/>
> <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
> <w:rPr><w:rtl w:val="0"/></w:rPr>
> <w:t xml:space="preserve">John</w:t>
> </w:r>
> <w:commentRangeEnd w:id="0"/>
>
> <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
> <w:commentReference w:id="0"/>
> </w:r>
>
> <!-- text run "." -->
> <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
> <w:rPr><w:rtl w:val="0"/></w:rPr>
> <w:t xml:space="preserve">.</w:t>
> </w:r>
>
> </w:p>
> <w:sectPr>
> <w:pgSz w:h="15840" w:w="12240"/>
> <w:pgMar w:bottom="1440" w:top="1440" w:left="1440"
> w:right="1440" w:header="0"/>
> <w:pgNumType w:start="1"/>
> </w:sectPr>
> </w:body>
> </w:document>
>
> So to solve your problem, you could either:
> 1. search the document.xml for all comments, looking up the comment's
> author and text using the ID that is referenced in the document
> commentRangeStart-commentRangeEnd and joining all the text contained
> between those markers
> 2. for each comment in the comment table, find the corresponding
> commentRangeStart and commentRangeEnd tags in document.xml and get the
> corresponding text that was annotated (in this example, John).
>
> If you don't already have a development environment set up, I
> encourage you to do so. Patches are greatly appreciated.
>
> On Tue, May 9, 2017 at 9:42 AM, Ramani Routray <[email protected]> wrote:
>> I have a Microsoft word (.docx) file and trying to retrieve the comments and
>> it's associated highlighted text. Can you pls help.
>>
>> Attaching picture of the sample word document and the java code for
>> extracting the comments. [ A file with a line "My name is John". The word
>> "John" is highlighted with a comment "Noun" ]
>>
>> I am able to extract the comments (Noun, Adjective). I would like to extract
>> the text associated with the comment "Noun" (Noun = John, Adjective = great)
>>
>> FileInputStream fis = new FileInputStream(new File(msWordFilePath));
>> XWPFDocument adoc = new XWPFDocument(fis);
>> XWPFWordExtractor xwe = new XWPFWordExtractor(adoc);
>> XWPFComment[] comments = adoc.getComments();
>>
>>
>> for(int idx=0; idx < comments.length; idx++)
>> {
>> MSWordAnnotation annot = new MSWordAnnotation();
>> annot.setAnnotationName(comments[idx].getId());
>> annot.setAnnotationValue(comments[idx].getText());
>> aList.add(annot);
>>
>>
>> }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]