RE: Mimetype detection for JSON

Iain Lopata Thu, 16 Apr 2015 15:30:56 -0700

Glad we are on the same page now.  So much harder to discuss code over e-mail 
than in person!!  I will open the report in JIRA shortly.  Thanks for sticking 
with me through this discussion!


> Date: Fri, 17 Apr 2015 00:07:08 +0200
> From: [email protected]
> To: [email protected]
> Subject: Re: Mimetype detection for JSON
> 
> Hi Iain,
> 
> > It would also seems simple enough to pass these hints to Tika
> > with the following modification to my previously proposed code:
> > try {
> >   InputStream in = new ByteArrayInputStream(data);
> >   Metadata meta = new Metadata();
> >   meta.set(Metadata.CONTENT_TYPE,typeName);
> >   meta.set(Metadata.RESOURCE_NAME_KEY,url);
> >   magicType = this.mimeTypes.detect(in, meta).toString();
> 
> 
> Ok, that would be quite close to the current state in trunk:
> 
>     if (this.mimeMagic) {
>       String magicType = null;
>       // pass URL (file name) and (cleansed) content type from protocol to 
> Tika
>       Metadata tikaMeta = new Metadata();
>       tikaMeta.add(Metadata.RESOURCE_NAME_KEY, url);
>       tikaMeta.add(Metadata.CONTENT_TYPE,
>           (cleanedMimeType != null ? cleanedMimeType : typeName));
>       try {
>         InputStream stream = TikaInputStream.get(data);
>         try {
>           magicType = tika.detect(stream, tikaMeta);
>         } finally {
> 
> 
> Sorry, looks like I've overseen this little difference:
> 
>  magicType = tika.detect(stream, tikaMeta);
> vs.
>  magicType = this.mimeTypes.detect(in, meta).toString();
> 
> 
> Is this correct?
> Seems plausible because only "mimeTypes" is explicitly configured to use the
> mime.types.file, while "tika" is not, cf. MimeUtil constructor.
> 
> > In fact, I am  left wondering why the entire autoResolveContentType in
> > MimeUtil.java can not be replace by this code
> 
> Yes, of course. Please, open a Jira. That's a bug, definitely.
> 
> Thanks,
> Sebastian
> 
> 
> On 04/16/2015 04:13 PM, Iain Lopata wrote:
> > Sebastian,
> > 
> > I am not sure I understand your response.
> > 
> > While you are correct that the call to the detect method in my revised code 
> > below only uses the content, in the broader context of MimeUtil.java both 
> > the mime type returned by the server and the filename are both considered 
> > before MimeUtil returns a final value.
> > 
> > It would also seems simple enough to pass these hints to Tika with the 
> > following modification to my previously proposed code:
> > 
> >              try {
> >   InputStream in = new ByteArrayInputStream(data);
> >   Metadata meta = new Metadata();
> >   meta.set(Metadata.CONTENT_TYPE,typeName);
> >   meta.set(Metadata.RESOURCE_NAME_KEY,url);  magicType = 
> > this.mimeTypes.detect(in, meta).toString();
> > 
> >   LOG.debug("Magic Type for" + url + " is " + magicType);
> > 
> > } catch (Exception e) {
> >    //Can't complete magic detection
> > }
> > 
> > In fact, I am  left wondering why the entire autoResolveContentType in 
> > MimeUtil.java can not be replace by this code, but for now I will be happy 
> > with a solution that allows me to add rules to tika-mimetypes.xml such that 
> > these rules get used by Nutch.
> > 
> > Iain
> > 
> > 
> >> Date: Thu, 16 Apr 2015 00:26:20 +0200
> >> From: [email protected]
> >> To: [email protected]
> >> Subject: Re: Mimetype detection for JSON
> >>
> >> Hi Iain,
> >>
> >> that means mime type detection is done exclusively on content
> >> without URL and server content type. There are examples where
> >> both will definitely add necessary support, cf. NUTCH-1605.
> >>
> >> Maybe it's best to let Tika improve the mime detectors, there
> >> is still some work ongoing, cf. TIKA-1517.
> >>
> >> It could be an option, instead of a binary mime.type.magic
> >> to set a (weighted) hierarchy of heuristics
> >>  magic > URL pattern > HTTP content type
> >> or just a list of hints to be used.
> >>
> >> But it's not as easy because often these are used in combination
> >> a zip file by signature with extension .xlsx is likely to be an Excel
> >> Office Open XML spreadsheet. JSON is similar or even worse:
> >> a '{' 0x7B in position 0 is only a little hint:
> >> - could be also '[' (but less likely)
> >> - also RTF has a '{' in position 0
> >>
> >> Sebastian
> >>
> >>
> >> On 04/15/2015 02:05 PM, Iain Lopata wrote:
> >>> The following change to MimeUtil.java seems to solve my problem:
> >>>
> >>> //      magicType = tika.detect(data);
> >>>             try {
> >>>                     InputStream in = new ByteArrayInputStream(data);
> >>>                     Metadata meta = new Metadata();
> >>>                     magicType = this.mimeTypes.detect(in, 
> >>> meta).toString();
> >>>                     LOG.debug("Magic Type for" + url + " is " + 
> >>> magicType);
> >>>             } catch (Exception e) {
> >>>                     //Can't complete magic detection
> >>>             }
> >>>
> >>> However, my confidence that I haven’t broken something else is modest at 
> >>> best.
> >>>
> >>> If this looks like a bug I am happy to create the JIRA entry and submit 
> >>> this as a patch, but before I do so can you tell me if this looks 
> >>> sensible?
> >>>
> >>> -----Original Message-----
> >>> From: Iain Lopata [mailto:[email protected]] 
> >>> Sent: Tuesday, April 14, 2015 8:43 PM
> >>> To: [email protected]
> >>> Subject: RE: Mimetype detection for JSON
> >>>
> >>> It seems to me that setting tika-mimetypes.xml in the Nutch configuration 
> >>> causes MimeUtil.java to use the specified file for initial lookup and for 
> >>> URL resolution.  However, when it comes to magic detection, the 
> >>> tika-mimetypes.xml file in the Tika jar file seems to be used instead.  
> >>>
> >>> If I update the Tika jar with my match rule it works perfectly. If I only 
> >>> place the updated tika-mimetypes.xml file in my Nutch configuration 
> >>> directory, the magic detection does not use my match rule.
> >>>
> >>> Can anyone familiar with the Tika implementation tell me if there is a 
> >>> way to update Nutch's MimeUtil.java to instantiate Tika to use the 
> >>> configuration file from Nutch?  Or would it be better just to update the 
> >>> configuration file in the Tika jar?
> >>>
> >>> -----Original Message-----
> >>> From: Iain Lopata [mailto:[email protected]]
> >>> Sent: Tuesday, April 14, 2015 5:32 PM
> >>> To: [email protected]
> >>> Subject: RE: Mimetype detection for JSON
> >>>
> >>> Thanks Sebastian.
> >>>
> >>> mime.type.magic is true.
> >>>
> >>> I don’t have control over the web server, so cannot test with 
> >>> application/javascript
> >>>
> >>> Time for some deeper debugging it seems.  Will update the list with 
> >>> findings.
> >>>
> >>> -----Original Message-----
> >>> From: Sebastian Nagel [mailto:[email protected]]
> >>> Sent: Tuesday, April 14, 2015 4:09 PM
> >>> To: [email protected]
> >>> Subject: Re: Mimetype detection for JSON
> >>>
> >>> Hi Iain,
> >>>
> >>>> I have copied tika-mimetypes.xml from the tika jar file and installed 
> >>>> a copy in my configuration directory.  I have updated nutch-site.xml 
> >>>> to point to this file and the log entries indicate that this is being 
> >>>> found.
> >>>
> >>> ... and the property mime.type.magic is true (default)?
> >>>
> >>>
> >>>> <mime-type type="application/json">
> >>>>           <sub-class-of type="application/javascript"/>
> >>>
> >>> Just as a trial: What happens if you make the web server return 
> >>> "application/javascript"
> >>> as content type?
> >>>
> >>>
> >>>> I am still getting the content type detected as text/html and the json 
> >>>> parser is not being invoked.  Any suggestions as to what to look at next?
> >>>
> >>> The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the 
> >>> following resources to Tika:
> >>> - byte stream for magic detection
> >>> - URL for additional file name patterns
> >>> - content type sent by server
> >>> URL and server content type are required as additional hints, e.g., for 
> >>> zip containers such as .xlsx, etc.
> >>>
> >>> I fear that you have to run a debugger to find out what is going wrong.
> >>> I would also run first Tika alone with the modified tika-mimetypes.xml, 
> >>> just to make sure that the mime magic works as expected.
> >>>
> >>> Cheers,
> >>> Sebastian
> >>>
> >>> On 04/13/2015 04:26 PM, Iain Lopata wrote:
> >>>> I have a page that I am fetching that contains JSON and I have a 
> >>>> plugin for parsing JSON.
> >>>>
> >>>>  
> >>>>
> >>>> The server sets a mimetype of "text/html" and consequently my json 
> >>>> parser does not get invoked.
> >>>>
> >>>>  
> >>>>
> >>>> If I run parsechecker from the command line and specify -forceAs 
> >>>> "application/json" the json parser is invoked and works successfully.
> >>>>
> >>>>  
> >>>>
> >>>> So, I believe that if I can get tika to give me "application/json" as 
> >>>> the detected content type for this page, it should work during a crawl.
> >>>>
> >>>>  
> >>>>
> >>>> I have copied tika-mimetypes.xml from the tika jar file and installed 
> >>>> a copy in my configuration directory.  I have updated nutch-site.xml 
> >>>> to point to this file and the log entries indicate that this is being 
> >>>> found.
> >>>>
> >>>>  
> >>>>
> >>>> In my copy of tika-mimetypes.xml I have added the match rule shown 
> >>>> below
> >>>>
> >>>>  
> >>>>
> >>>> <mime-type type="application/json">
> >>>>
> >>>>           <sub-class-of type="application/javascript"/>
> >>>>
> >>>>           <magic priority="100">
> >>>>
> >>>>                   <match value="{" type="string" offset="0"/>
> >>>>
> >>>>           </magic>
> >>>>
> >>>>           <glob pattern="*.json"/>
> >>>>
> >>>>   </mime-type>
> >>>>
> >>>>  
> >>>>
> >>>> I know that my match is much too broad, but I am using this just while 
> >>>> trying to resolve this problem.
> >>>>
> >>>>  
> >>>>
> >>>> I have also set lang.extraction.policy to identify in nutch-site.xml 
> >>>> (again primarily for testing purposes).
> >>>>
> >>>>  
> >>>>
> >>>> I am still getting the content type detected as text/html and the json 
> >>>> parser is not being invoked.  Any suggestions as to what to look at next?
> >>>>
> >>>>  
> >>>>
> >>>> Thanks!
> >>>>
> >>>>  
> >>>>
> >>>> Iain
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>
> >                                       
> > 
>

RE: Mimetype detection for JSON

Reply via email to