RE: Mimetype detection for JSON

Iain Lopata Thu, 16 Apr 2015 07:15:10 -0700

Sebastian,

I am not sure I understand your response.


While you are correct that the call to the detect method in my revised code 
below only uses the content, in the broader context of MimeUtil.java both the 
mime type returned by the server and the filename are both considered before 
MimeUtil returns a final value.

It would also seems simple enough to pass these hints to Tika with the 
following modification to my previously proposed code:

             try {
  InputStream in = new ByteArrayInputStream(data);
  Metadata meta = new Metadata();
  meta.set(Metadata.CONTENT_TYPE,typeName);
  meta.set(Metadata.RESOURCE_NAME_KEY,url);  magicType = 
this.mimeTypes.detect(in, meta).toString();

  LOG.debug("Magic Type for" + url + " is " + magicType);

} catch (Exception e) {
   //Can't complete magic detection
}

In fact, I am  left wondering why the entire autoResolveContentType in 
MimeUtil.java can not be replace by this code, but for now I will be happy with 
a solution that allows me to add rules to tika-mimetypes.xml such that these 
rules get used by Nutch.

Iain


> Date: Thu, 16 Apr 2015 00:26:20 +0200
> From: [email protected]
> To: [email protected]
> Subject: Re: Mimetype detection for JSON
> 
> Hi Iain,
> 
> that means mime type detection is done exclusively on content
> without URL and server content type. There are examples where
> both will definitely add necessary support, cf. NUTCH-1605.
> 
> Maybe it's best to let Tika improve the mime detectors, there
> is still some work ongoing, cf. TIKA-1517.
> 
> It could be an option, instead of a binary mime.type.magic
> to set a (weighted) hierarchy of heuristics
>  magic > URL pattern > HTTP content type
> or just a list of hints to be used.
> 
> But it's not as easy because often these are used in combination
> a zip file by signature with extension .xlsx is likely to be an Excel
> Office Open XML spreadsheet. JSON is similar or even worse:
> a '{' 0x7B in position 0 is only a little hint:
> - could be also '[' (but less likely)
> - also RTF has a '{' in position 0
> 
> Sebastian
> 
> 
> On 04/15/2015 02:05 PM, Iain Lopata wrote:
> > The following change to MimeUtil.java seems to solve my problem:
> > 
> > //      magicType = tika.detect(data);
> >             try {
> >                     InputStream in = new ByteArrayInputStream(data);
> >                     Metadata meta = new Metadata();
> >                     magicType = this.mimeTypes.detect(in, meta).toString();
> >                     LOG.debug("Magic Type for" + url + " is " + magicType);
> >             } catch (Exception e) {
> >                     //Can't complete magic detection
> >             }
> > 
> > However, my confidence that I haven’t broken something else is modest at 
> > best.
> > 
> > If this looks like a bug I am happy to create the JIRA entry and submit 
> > this as a patch, but before I do so can you tell me if this looks sensible?
> > 
> > -----Original Message-----
> > From: Iain Lopata [mailto:[email protected]] 
> > Sent: Tuesday, April 14, 2015 8:43 PM
> > To: [email protected]
> > Subject: RE: Mimetype detection for JSON
> > 
> > It seems to me that setting tika-mimetypes.xml in the Nutch configuration 
> > causes MimeUtil.java to use the specified file for initial lookup and for 
> > URL resolution.  However, when it comes to magic detection, the 
> > tika-mimetypes.xml file in the Tika jar file seems to be used instead.  
> > 
> > If I update the Tika jar with my match rule it works perfectly. If I only 
> > place the updated tika-mimetypes.xml file in my Nutch configuration 
> > directory, the magic detection does not use my match rule.
> > 
> > Can anyone familiar with the Tika implementation tell me if there is a way 
> > to update Nutch's MimeUtil.java to instantiate Tika to use the 
> > configuration file from Nutch?  Or would it be better just to update the 
> > configuration file in the Tika jar?
> > 
> > -----Original Message-----
> > From: Iain Lopata [mailto:[email protected]]
> > Sent: Tuesday, April 14, 2015 5:32 PM
> > To: [email protected]
> > Subject: RE: Mimetype detection for JSON
> > 
> > Thanks Sebastian.
> > 
> > mime.type.magic is true.
> > 
> > I don’t have control over the web server, so cannot test with 
> > application/javascript
> > 
> > Time for some deeper debugging it seems.  Will update the list with 
> > findings.
> > 
> > -----Original Message-----
> > From: Sebastian Nagel [mailto:[email protected]]
> > Sent: Tuesday, April 14, 2015 4:09 PM
> > To: [email protected]
> > Subject: Re: Mimetype detection for JSON
> > 
> > Hi Iain,
> > 
> >> I have copied tika-mimetypes.xml from the tika jar file and installed 
> >> a copy in my configuration directory.  I have updated nutch-site.xml 
> >> to point to this file and the log entries indicate that this is being 
> >> found.
> > 
> > ... and the property mime.type.magic is true (default)?
> > 
> > 
> >> <mime-type type="application/json">
> >>           <sub-class-of type="application/javascript"/>
> > 
> > Just as a trial: What happens if you make the web server return 
> > "application/javascript"
> > as content type?
> > 
> > 
> >> I am still getting the content type detected as text/html and the json 
> >> parser is not being invoked.  Any suggestions as to what to look at next?
> > 
> > The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the 
> > following resources to Tika:
> > - byte stream for magic detection
> > - URL for additional file name patterns
> > - content type sent by server
> > URL and server content type are required as additional hints, e.g., for zip 
> > containers such as .xlsx, etc.
> > 
> > I fear that you have to run a debugger to find out what is going wrong.
> > I would also run first Tika alone with the modified tika-mimetypes.xml, 
> > just to make sure that the mime magic works as expected.
> > 
> > Cheers,
> > Sebastian
> > 
> > On 04/13/2015 04:26 PM, Iain Lopata wrote:
> >> I have a page that I am fetching that contains JSON and I have a 
> >> plugin for parsing JSON.
> >>
> >>  
> >>
> >> The server sets a mimetype of "text/html" and consequently my json 
> >> parser does not get invoked.
> >>
> >>  
> >>
> >> If I run parsechecker from the command line and specify -forceAs 
> >> "application/json" the json parser is invoked and works successfully.
> >>
> >>  
> >>
> >> So, I believe that if I can get tika to give me "application/json" as 
> >> the detected content type for this page, it should work during a crawl.
> >>
> >>  
> >>
> >> I have copied tika-mimetypes.xml from the tika jar file and installed 
> >> a copy in my configuration directory.  I have updated nutch-site.xml 
> >> to point to this file and the log entries indicate that this is being 
> >> found.
> >>
> >>  
> >>
> >> In my copy of tika-mimetypes.xml I have added the match rule shown 
> >> below
> >>
> >>  
> >>
> >> <mime-type type="application/json">
> >>
> >>           <sub-class-of type="application/javascript"/>
> >>
> >>           <magic priority="100">
> >>
> >>                   <match value="{" type="string" offset="0"/>
> >>
> >>           </magic>
> >>
> >>           <glob pattern="*.json"/>
> >>
> >>   </mime-type>
> >>
> >>  
> >>
> >> I know that my match is much too broad, but I am using this just while 
> >> trying to resolve this problem.
> >>
> >>  
> >>
> >> I have also set lang.extraction.policy to identify in nutch-site.xml 
> >> (again primarily for testing purposes).
> >>
> >>  
> >>
> >> I am still getting the content type detected as text/html and the json 
> >> parser is not being invoked.  Any suggestions as to what to look at next?
> >>
> >>  
> >>
> >> Thanks!
> >>
> >>  
> >>
> >> Iain
> >>
> >>
> > 
> > 
> > 
> > 
>

RE: Mimetype detection for JSON

Reply via email to