Re: Mimetype detection for JSON

Sebastian Nagel Thu, 16 Apr 2015 15:09:56 -0700

Hi Iain,

> It would also seems simple enough to pass these hints to Tika
> with the following modification to my previously proposed code:
> try {
>   InputStream in = new ByteArrayInputStream(data);
>   Metadata meta = new Metadata();
>   meta.set(Metadata.CONTENT_TYPE,typeName);
>   meta.set(Metadata.RESOURCE_NAME_KEY,url);
>   magicType = this.mimeTypes.detect(in, meta).toString();



Ok, that would be quite close to the current state in trunk:

    if (this.mimeMagic) {
      String magicType = null;
      // pass URL (file name) and (cleansed) content type from protocol to Tika
      Metadata tikaMeta = new Metadata();
      tikaMeta.add(Metadata.RESOURCE_NAME_KEY, url);
      tikaMeta.add(Metadata.CONTENT_TYPE,
          (cleanedMimeType != null ? cleanedMimeType : typeName));
      try {
        InputStream stream = TikaInputStream.get(data);
        try {
          magicType = tika.detect(stream, tikaMeta);
        } finally {


Sorry, looks like I've overseen this little difference:

 magicType = tika.detect(stream, tikaMeta);
vs.
 magicType = this.mimeTypes.detect(in, meta).toString();


Is this correct?
Seems plausible because only "mimeTypes" is explicitly configured to use the
mime.types.file, while "tika" is not, cf. MimeUtil constructor.

> In fact, I am  left wondering why the entire autoResolveContentType in
> MimeUtil.java can not be replace by this code

Yes, of course. Please, open a Jira. That's a bug, definitely.

Thanks,
Sebastian


On 04/16/2015 04:13 PM, Iain Lopata wrote:
> Sebastian,
> 
> I am not sure I understand your response.
> 
> While you are correct that the call to the detect method in my revised code 
> below only uses the content, in the broader context of MimeUtil.java both the 
> mime type returned by the server and the filename are both considered before 
> MimeUtil returns a final value.
> 
> It would also seems simple enough to pass these hints to Tika with the 
> following modification to my previously proposed code:
> 
>              try {
>   InputStream in = new ByteArrayInputStream(data);
>   Metadata meta = new Metadata();
>   meta.set(Metadata.CONTENT_TYPE,typeName);
>   meta.set(Metadata.RESOURCE_NAME_KEY,url);  magicType = 
> this.mimeTypes.detect(in, meta).toString();
> 
>   LOG.debug("Magic Type for" + url + " is " + magicType);
> 
> } catch (Exception e) {
>    //Can't complete magic detection
> }
> 
> In fact, I am  left wondering why the entire autoResolveContentType in 
> MimeUtil.java can not be replace by this code, but for now I will be happy 
> with a solution that allows me to add rules to tika-mimetypes.xml such that 
> these rules get used by Nutch.
> 
> Iain
> 
> 
>> Date: Thu, 16 Apr 2015 00:26:20 +0200
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: Mimetype detection for JSON
>>
>> Hi Iain,
>>
>> that means mime type detection is done exclusively on content
>> without URL and server content type. There are examples where
>> both will definitely add necessary support, cf. NUTCH-1605.
>>
>> Maybe it's best to let Tika improve the mime detectors, there
>> is still some work ongoing, cf. TIKA-1517.
>>
>> It could be an option, instead of a binary mime.type.magic
>> to set a (weighted) hierarchy of heuristics
>>  magic > URL pattern > HTTP content type
>> or just a list of hints to be used.
>>
>> But it's not as easy because often these are used in combination
>> a zip file by signature with extension .xlsx is likely to be an Excel
>> Office Open XML spreadsheet. JSON is similar or even worse:
>> a '{' 0x7B in position 0 is only a little hint:
>> - could be also '[' (but less likely)
>> - also RTF has a '{' in position 0
>>
>> Sebastian
>>
>>
>> On 04/15/2015 02:05 PM, Iain Lopata wrote:
>>> The following change to MimeUtil.java seems to solve my problem:
>>>
>>> //      magicType = tika.detect(data);
>>>             try {
>>>                     InputStream in = new ByteArrayInputStream(data);
>>>                     Metadata meta = new Metadata();
>>>                     magicType = this.mimeTypes.detect(in, meta).toString();
>>>                     LOG.debug("Magic Type for" + url + " is " + magicType);
>>>             } catch (Exception e) {
>>>                     //Can't complete magic detection
>>>             }
>>>
>>> However, my confidence that I haven’t broken something else is modest at 
>>> best.
>>>
>>> If this looks like a bug I am happy to create the JIRA entry and submit 
>>> this as a patch, but before I do so can you tell me if this looks sensible?
>>>
>>> -----Original Message-----
>>> From: Iain Lopata [mailto:[email protected]] 
>>> Sent: Tuesday, April 14, 2015 8:43 PM
>>> To: [email protected]
>>> Subject: RE: Mimetype detection for JSON
>>>
>>> It seems to me that setting tika-mimetypes.xml in the Nutch configuration 
>>> causes MimeUtil.java to use the specified file for initial lookup and for 
>>> URL resolution.  However, when it comes to magic detection, the 
>>> tika-mimetypes.xml file in the Tika jar file seems to be used instead.  
>>>
>>> If I update the Tika jar with my match rule it works perfectly. If I only 
>>> place the updated tika-mimetypes.xml file in my Nutch configuration 
>>> directory, the magic detection does not use my match rule.
>>>
>>> Can anyone familiar with the Tika implementation tell me if there is a way 
>>> to update Nutch's MimeUtil.java to instantiate Tika to use the 
>>> configuration file from Nutch?  Or would it be better just to update the 
>>> configuration file in the Tika jar?
>>>
>>> -----Original Message-----
>>> From: Iain Lopata [mailto:[email protected]]
>>> Sent: Tuesday, April 14, 2015 5:32 PM
>>> To: [email protected]
>>> Subject: RE: Mimetype detection for JSON
>>>
>>> Thanks Sebastian.
>>>
>>> mime.type.magic is true.
>>>
>>> I don’t have control over the web server, so cannot test with 
>>> application/javascript
>>>
>>> Time for some deeper debugging it seems.  Will update the list with 
>>> findings.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:[email protected]]
>>> Sent: Tuesday, April 14, 2015 4:09 PM
>>> To: [email protected]
>>> Subject: Re: Mimetype detection for JSON
>>>
>>> Hi Iain,
>>>
>>>> I have copied tika-mimetypes.xml from the tika jar file and installed 
>>>> a copy in my configuration directory.  I have updated nutch-site.xml 
>>>> to point to this file and the log entries indicate that this is being 
>>>> found.
>>>
>>> ... and the property mime.type.magic is true (default)?
>>>
>>>
>>>> <mime-type type="application/json">
>>>>           <sub-class-of type="application/javascript"/>
>>>
>>> Just as a trial: What happens if you make the web server return 
>>> "application/javascript"
>>> as content type?
>>>
>>>
>>>> I am still getting the content type detected as text/html and the json 
>>>> parser is not being invoked.  Any suggestions as to what to look at next?
>>>
>>> The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the 
>>> following resources to Tika:
>>> - byte stream for magic detection
>>> - URL for additional file name patterns
>>> - content type sent by server
>>> URL and server content type are required as additional hints, e.g., for zip 
>>> containers such as .xlsx, etc.
>>>
>>> I fear that you have to run a debugger to find out what is going wrong.
>>> I would also run first Tika alone with the modified tika-mimetypes.xml, 
>>> just to make sure that the mime magic works as expected.
>>>
>>> Cheers,
>>> Sebastian
>>>
>>> On 04/13/2015 04:26 PM, Iain Lopata wrote:
>>>> I have a page that I am fetching that contains JSON and I have a 
>>>> plugin for parsing JSON.
>>>>
>>>>  
>>>>
>>>> The server sets a mimetype of "text/html" and consequently my json 
>>>> parser does not get invoked.
>>>>
>>>>  
>>>>
>>>> If I run parsechecker from the command line and specify -forceAs 
>>>> "application/json" the json parser is invoked and works successfully.
>>>>
>>>>  
>>>>
>>>> So, I believe that if I can get tika to give me "application/json" as 
>>>> the detected content type for this page, it should work during a crawl.
>>>>
>>>>  
>>>>
>>>> I have copied tika-mimetypes.xml from the tika jar file and installed 
>>>> a copy in my configuration directory.  I have updated nutch-site.xml 
>>>> to point to this file and the log entries indicate that this is being 
>>>> found.
>>>>
>>>>  
>>>>
>>>> In my copy of tika-mimetypes.xml I have added the match rule shown 
>>>> below
>>>>
>>>>  
>>>>
>>>> <mime-type type="application/json">
>>>>
>>>>           <sub-class-of type="application/javascript"/>
>>>>
>>>>           <magic priority="100">
>>>>
>>>>                   <match value="{" type="string" offset="0"/>
>>>>
>>>>           </magic>
>>>>
>>>>           <glob pattern="*.json"/>
>>>>
>>>>   </mime-type>
>>>>
>>>>  
>>>>
>>>> I know that my match is much too broad, but I am using this just while 
>>>> trying to resolve this problem.
>>>>
>>>>  
>>>>
>>>> I have also set lang.extraction.policy to identify in nutch-site.xml 
>>>> (again primarily for testing purposes).
>>>>
>>>>  
>>>>
>>>> I am still getting the content type detected as text/html and the json 
>>>> parser is not being invoked.  Any suggestions as to what to look at next?
>>>>
>>>>  
>>>>
>>>> Thanks!
>>>>
>>>>  
>>>>
>>>> Iain
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>                                         
>

Re: Mimetype detection for JSON

Reply via email to