[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-03 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17574695#comment-17574695
 ] 

Tika User commented on TIKA-3827:
-

Its file type is reading it as RF and while extracting the content itself the 
embedded file contains two file with .mpga extension.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: example.DOC
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17574756#comment-17574756
 ] 

Tim Allison commented on TIKA-3827:
---

The RTFParser runs file type detection on embedded files. That process is 
identifying this as mpga.  I agree that the files are bitmap, and the RTF file 
encodes that with: {{\bitmap}}.

So, should we turn off file type detection if the RTF alleges that the embedded 
file is a pict/bitmap?  Or do we need to improve our mpga handling?

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: example.DOC
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-03 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17574757#comment-17574757
 ] 

Tika User commented on TIKA-3827:
-

I think based on above document, I can say that from Tika should turn off file 
type detection if the RTF alleges that the embedded file is a pict/bitmap.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: example.DOC
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-03 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17574765#comment-17574765
 ] 

Tika User commented on TIKA-3827:
-

I think both because when I process extracted document separately it is 
treating it as mpga instead of png/bitmap.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: example.DOC
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17574779#comment-17574779
 ] 

Tim Allison commented on TIKA-3827:
---

It looks like the embedded files do not have bmp headers.  Are they just the 
raw bytes after what would be the header?  If you extract them (attached), are 
you able to open them?

Magic isn't working because they don't have headers.  I'm working on adding a 
mime type hint if \wbitmap is encountered.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: example.DOC, file_1.bmp, file_2.bmp
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-03 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575047#comment-17575047
 ] 

Tika User commented on TIKA-3827:
-

 

When I tried to open the extracted file in paint I am seeing the below error. 

 

!image-2022-08-04-10-53-48-894.png!

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: example.DOC, file_1.bmp, file_2.bmp, 
> image-2022-08-04-10-52-44-800.png, image-2022-08-04-10-53-48-894.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575184#comment-17575184
 ] 

Tim Allison commented on TIKA-3827:
---

Right.  That's my point above. I don't know if this is a feature of how RTF 
embeds bitmaps (without headers) or if the file is corrupted.  If you open the 
file in a hex editor, you can see the bytes are extracted properly and that the 
files do not start with "BM" (see eg: 
https://gist.github.com/leommoore/f9e57ba2aa4bf197ebc5).  This is the reason 
that Tika's mime detection on the bytes of these images is failing.  There's no 
header.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-04 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575193#comment-17575193
 ] 

Tika User commented on TIKA-3827:
-

But if we look into the original document we are seeing those images. So if it 
is not bitmap then because it has no header's, may I know the rules why it is 
detecting it as mpga file, due that extensions we are trying to play the file 
but couldn't.

!image-2022-08-04-15-44-48-396.png!

!image-2022-08-04-15-45-10-892.png!

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575195#comment-17575195
 ] 

Tim Allison commented on TIKA-3827:
---

Ah, thank you.  I hadn't noticed the images when I opened the file in an 
application.  The rule that triggers mpga is here: 
https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L5232

I didn't notice anything in the rtf spec(s) about encoding of {{\wbitmap}}, but 
I'll take a look again.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575199#comment-17575199
 ] 

Tim Allison commented on TIKA-3827:
---

This might be relevant: http://justsolve.archiveteam.org/wiki/Raw_bitmap

Note that there is metadata about the image before the raw bytes, width, 
height, color stuff, etc.  

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575204#comment-17575204
 ] 

Tim Allison commented on TIKA-3827:
---

I'm not having luck with ImageMagick, but this is likely user error: 
https://legacy.imagemagick.org/discourse-server/viewtopic.php?t=22999

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-04 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575208#comment-17575208
 ] 

Tika User commented on TIKA-3827:
-

I tried using the same file using below link and both the attachments is of png 
type.


[https://products.aspose.app/words/editor] 

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575279#comment-17575279
 ] 

Tim Allison commented on TIKA-3827:
---

My guess is that aspose is doing the correct post-processing to yield a png.  
This happens commonly in PDFs, where the raw bytes for an image that are stored 
have to be manipulated by the application along with other information in the 
document to yield an actual image file.

Are you able to attach the pngs?  I'm curious if they just slapped a png header 
on those bytes or if they were actually transformed.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-04 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575281#comment-17575281
 ] 

Tika User commented on TIKA-3827:
-

[^example.zip]

Attached the Zip file , the content and two png files.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> example.zip, file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575293#comment-17575293
 ] 

Tim Allison commented on TIKA-3827:
---

Y, those bytes are completely transformed.  Aspose is doing the correct image 
manipulation...now we just need to find documentation for what that is...

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> example.zip, file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-05 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575881#comment-17575881
 ] 

Tika User commented on TIKA-3827:
-

Below is the code:

 

You can easily extract text from the document using code like this:

 

{{Document doc = new Document("C:\Temp\in.doc");
doc.save("C:\Temp\out.txt");}}

Or of you need to extract text into a String, you can use code like this:

 

{{Document doc = new Document("C:\Temp\in.doc");
String docText = doc.toString(SaveFormat.TEXT);}}

The following code can be used for image extraction:

 

{{Document doc = new Document("C:\Temp\in.doc");

Iterable shapes = doc.getChildNodes(NodeType.SHAPE, true);
int counter = 0;
for (Shape s : shapes)
\{
if (s.hasImage())
{
s.getImageData().save("C:\Temp\img_" + counter + 
FileFormatUtil.imageTypeToExtension(s.getImageData().getImageType()));
counter++;
}
}}}

 

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> example.zip, file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575890#comment-17575890
 ] 

Tim Allison commented on TIKA-3827:
---

That's the client code, but we don't know what "getImageData()" is doing under 
the hood to transform the raw bytes.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> example.zip, file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575891#comment-17575891
 ] 

Tim Allison commented on TIKA-3827:
---

For now, I've added a mediatype hint that the bytes are of type 
{{image/x-rtf-raw-bitmap}}.  This prevents parsers from being applied.

The correct solution would be to figure out the algorithm to manipulate the 
bytes to convert them to an actual image file, but that is beyond my reach atm.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> example.zip, file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-05 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575935#comment-17575935
 ] 

Hudson commented on TIKA-3827:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #730 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/730/])
TIKA-3827 -- override image mime if raw bitmap in RTF (tallison: 
[https://github.com/apache/tika/commit/99533c971d5db7d7f3c501bc6cf67082a8d7f0cc])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/rtf/RTFEmbObjHandler.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/rtf/TextExtractor.java


> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> example.zip, file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-05 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575945#comment-17575945
 ] 

Tika User commented on TIKA-3827:
-

When this fix will be available? Next version?

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> example.zip, file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-07 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576563#comment-17576563
 ] 

Tika User commented on TIKA-3827:
-

[~tallison]  Image data documentation.

[https://reference.aspose.com/words/java/com.aspose.words/ImageData] 

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> example.zip, file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3827) Word Document extracted mpga file extension instead of bitmap

2022-08-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576786#comment-17576786
 ] 

Tim Allison commented on TIKA-3827:
---

This is helpful for how to use Aspose's API, but I don't see anything in there 
about how to translate the raw bytes of the image as stored into an actual 
image.  As I showed above, Tika is extracting the raw bytes as they appear in 
the file.  However, I think there has to be some kind of transformation to 
convert that to an actual image file.

> Word Document extracted mpga file extension instead of bitmap 
> --
>
> Key: TIKA-3827
> URL: https://issues.apache.org/jira/browse/TIKA-3827
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tika User
>Priority: Major
> Attachments: Screenshot from 2022-08-04 06-05-09.png, example.DOC, 
> example.zip, file_1.bmp, file_2.bmp, image-2022-08-04-10-52-44-800.png, 
> image-2022-08-04-10-53-48-894.png, image-2022-08-04-15-44-48-396.png, 
> image-2022-08-04-15-45-10-892.png
>
>
> When tried to parser the .doc document it is extracted two mpga files which 
> can't be open to play. We are suspecting they should be bitmap image files. 
> The Tika version we are using is 2.4.1.
> [^example.DOC]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)