Re: [classlib][luni] guessing content mime types

Oliver Deakin Mon, 03 Sep 2007 02:49:36 -0700

Hi Tim,

There is FindMimeFromData() [1] defined in urlmon.h which may be useful- from [2] it appears that this is the system function used by IE todetermine mime types.


Regards,
Oliver

[1] http://msdn2.microsoft.com/en-us/library/ms775107.aspx
[2] http://msdn2.microsoft.com/en-us/library/ms775147.aspx

Tim Ellison wrote:

On a related note, we do a rubbish job of guessing the content type from
the content of files themselves  via
URLConnection#guessContentTypeFromStream(InputStream).  I've added a bit
more logic in there for the most obvious cases, but when you consider
the info in your typical Linux 'magic' file we have a long way to go.
My first thought was whether we could ask the platform to guess for us,
but I don't think there is any equivalent on Windows etc?

Regards,
Tim

Alexey Petrenko wrote:

Looks like both application/rtf and text/rtf are correct from IANA [1]
point of view.
So I do not see any harm to follow RI's behavior in this case.

By the way application/rtf specification looks more fresh then text/rtf

SY, Alexey

1. http://www.iana.org/assignments/media-types/

2007/8/31, Tim Ellison <[EMAIL PROTECTED]>:

The MIME types for a given extension are defined here [1] which we took
from httpd's view of the world.  So while it would be trivial to change
them to be the same as the RI, I'm inclined to:
 - leave rtf as text/rtf
 - add java to our list as text/plain
 - leave doc as application/msword
then figure out how to snoop the stream for other types.

[1]
http://svn.apache.org/viewvc/harmony/enhanced/classlib/trunk/depends/files/content-types.properties?revision=494047&view=markup

Thoughts?
Tim


Vasily Zakharov (JIRA) wrote:

[classlib][luni] URLConnection.getContentType() works with files incorrectly
----------------------------------------------------------------------------

                 Key: HARMONY-4699
                 URL: https://issues.apache.org/jira/browse/HARMONY-4699
             Project: Harmony
          Issue Type: Bug
          Components: Classlib
            Reporter: Vasily Zakharov


In Harmony implementation, java.net.URLConnection.getContentType() works 
incorrectly when addresses a file URL:

1. For files with .rtf extension, RI returns "application/rtf", while Harmony returns 
"text/rtf".

2. For files with .java extension, RI returns "text/plain", while Harmony returns 
"content/unknown".

3. For files with .doc extension, RI returns "content/unknown", while Harmony returns 
"application/msword". The same is true for other known extensions.

4. For files with unrecognized extension and with HTML content, RI returns "text/html", 
while Harmony returns "content/unknown".

Items 1 and 2 look like a minor issues that would better be fixed for 
compatibility with RI.

Item 3 looks like a non-bug difference, as Harmony behaves clearly better than 
RI in these cases.

Item 4 looks like a serious bug, as RI clearly looks into file content for the 
file type, and Harmony does not. Looks like 
org.apache.harmony.luni.internal.net.www.protocol.file.FileURLConnection.getContentType()
 needs to be fixed to use guessContentTypeFromStream() in addition to 
guessContentTypeFromName().

The attached archive contains the reproducer with some test files it uses. 
Here's the reproducer code:

public class Test {
    static void printContentType(String fileName) throws java.io.IOException {
        System.out.println(fileName + ": " + new java.net.URL("file:" + 
fileName).openConnection().getContentType());
    }
    public static void main(String argv[]) {
        try {
            printContentType("test.rtf");
            printContentType("Test.java");
            printContentType("test.doc");
            printContentType("test.htx");
        } catch (Exception e) {
            e.printStackTrace(System.out);
        }
    }
}

Output on RI:

test.rtf: application/rtf
Test.java: text/plain
test.doc: content/unknown
test.htx: text/html

Output on Harmony:

test.rtf: text/rtf
Test.java: content/unknown
test.doc: application/msword
test.htx: content/unknown

This issue is a blocker for HARMONY-4696, as on RI JEditorPane.getContentType() 
should be based on URLConnection.getContentType() that now works incorrectly.


--
Oliver Deakin
Unless stated otherwise above:

IBM United Kingdom Limited - Registered in England and Wales with number 741598.Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Re: [classlib][luni] guessing content mime types

Reply via email to