Glad we are on the same page now. So much harder to discuss code over e-mail than in person!! I will open the report in JIRA shortly. Thanks for sticking with me through this discussion!
> Date: Fri, 17 Apr 2015 00:07:08 +0200 > From: [email protected] > To: [email protected] > Subject: Re: Mimetype detection for JSON > > Hi Iain, > > > It would also seems simple enough to pass these hints to Tika > > with the following modification to my previously proposed code: > > try { > > InputStream in = new ByteArrayInputStream(data); > > Metadata meta = new Metadata(); > > meta.set(Metadata.CONTENT_TYPE,typeName); > > meta.set(Metadata.RESOURCE_NAME_KEY,url); > > magicType = this.mimeTypes.detect(in, meta).toString(); > > > Ok, that would be quite close to the current state in trunk: > > if (this.mimeMagic) { > String magicType = null; > // pass URL (file name) and (cleansed) content type from protocol to > Tika > Metadata tikaMeta = new Metadata(); > tikaMeta.add(Metadata.RESOURCE_NAME_KEY, url); > tikaMeta.add(Metadata.CONTENT_TYPE, > (cleanedMimeType != null ? cleanedMimeType : typeName)); > try { > InputStream stream = TikaInputStream.get(data); > try { > magicType = tika.detect(stream, tikaMeta); > } finally { > > > Sorry, looks like I've overseen this little difference: > > magicType = tika.detect(stream, tikaMeta); > vs. > magicType = this.mimeTypes.detect(in, meta).toString(); > > > Is this correct? > Seems plausible because only "mimeTypes" is explicitly configured to use the > mime.types.file, while "tika" is not, cf. MimeUtil constructor. > > > In fact, I am left wondering why the entire autoResolveContentType in > > MimeUtil.java can not be replace by this code > > Yes, of course. Please, open a Jira. That's a bug, definitely. > > Thanks, > Sebastian > > > On 04/16/2015 04:13 PM, Iain Lopata wrote: > > Sebastian, > > > > I am not sure I understand your response. > > > > While you are correct that the call to the detect method in my revised code > > below only uses the content, in the broader context of MimeUtil.java both > > the mime type returned by the server and the filename are both considered > > before MimeUtil returns a final value. > > > > It would also seems simple enough to pass these hints to Tika with the > > following modification to my previously proposed code: > > > > try { > > InputStream in = new ByteArrayInputStream(data); > > Metadata meta = new Metadata(); > > meta.set(Metadata.CONTENT_TYPE,typeName); > > meta.set(Metadata.RESOURCE_NAME_KEY,url); magicType = > > this.mimeTypes.detect(in, meta).toString(); > > > > LOG.debug("Magic Type for" + url + " is " + magicType); > > > > } catch (Exception e) { > > //Can't complete magic detection > > } > > > > In fact, I am left wondering why the entire autoResolveContentType in > > MimeUtil.java can not be replace by this code, but for now I will be happy > > with a solution that allows me to add rules to tika-mimetypes.xml such that > > these rules get used by Nutch. > > > > Iain > > > > > >> Date: Thu, 16 Apr 2015 00:26:20 +0200 > >> From: [email protected] > >> To: [email protected] > >> Subject: Re: Mimetype detection for JSON > >> > >> Hi Iain, > >> > >> that means mime type detection is done exclusively on content > >> without URL and server content type. There are examples where > >> both will definitely add necessary support, cf. NUTCH-1605. > >> > >> Maybe it's best to let Tika improve the mime detectors, there > >> is still some work ongoing, cf. TIKA-1517. > >> > >> It could be an option, instead of a binary mime.type.magic > >> to set a (weighted) hierarchy of heuristics > >> magic > URL pattern > HTTP content type > >> or just a list of hints to be used. > >> > >> But it's not as easy because often these are used in combination > >> a zip file by signature with extension .xlsx is likely to be an Excel > >> Office Open XML spreadsheet. JSON is similar or even worse: > >> a '{' 0x7B in position 0 is only a little hint: > >> - could be also '[' (but less likely) > >> - also RTF has a '{' in position 0 > >> > >> Sebastian > >> > >> > >> On 04/15/2015 02:05 PM, Iain Lopata wrote: > >>> The following change to MimeUtil.java seems to solve my problem: > >>> > >>> // magicType = tika.detect(data); > >>> try { > >>> InputStream in = new ByteArrayInputStream(data); > >>> Metadata meta = new Metadata(); > >>> magicType = this.mimeTypes.detect(in, > >>> meta).toString(); > >>> LOG.debug("Magic Type for" + url + " is " + > >>> magicType); > >>> } catch (Exception e) { > >>> //Can't complete magic detection > >>> } > >>> > >>> However, my confidence that I haven’t broken something else is modest at > >>> best. > >>> > >>> If this looks like a bug I am happy to create the JIRA entry and submit > >>> this as a patch, but before I do so can you tell me if this looks > >>> sensible? > >>> > >>> -----Original Message----- > >>> From: Iain Lopata [mailto:[email protected]] > >>> Sent: Tuesday, April 14, 2015 8:43 PM > >>> To: [email protected] > >>> Subject: RE: Mimetype detection for JSON > >>> > >>> It seems to me that setting tika-mimetypes.xml in the Nutch configuration > >>> causes MimeUtil.java to use the specified file for initial lookup and for > >>> URL resolution. However, when it comes to magic detection, the > >>> tika-mimetypes.xml file in the Tika jar file seems to be used instead. > >>> > >>> If I update the Tika jar with my match rule it works perfectly. If I only > >>> place the updated tika-mimetypes.xml file in my Nutch configuration > >>> directory, the magic detection does not use my match rule. > >>> > >>> Can anyone familiar with the Tika implementation tell me if there is a > >>> way to update Nutch's MimeUtil.java to instantiate Tika to use the > >>> configuration file from Nutch? Or would it be better just to update the > >>> configuration file in the Tika jar? > >>> > >>> -----Original Message----- > >>> From: Iain Lopata [mailto:[email protected]] > >>> Sent: Tuesday, April 14, 2015 5:32 PM > >>> To: [email protected] > >>> Subject: RE: Mimetype detection for JSON > >>> > >>> Thanks Sebastian. > >>> > >>> mime.type.magic is true. > >>> > >>> I don’t have control over the web server, so cannot test with > >>> application/javascript > >>> > >>> Time for some deeper debugging it seems. Will update the list with > >>> findings. > >>> > >>> -----Original Message----- > >>> From: Sebastian Nagel [mailto:[email protected]] > >>> Sent: Tuesday, April 14, 2015 4:09 PM > >>> To: [email protected] > >>> Subject: Re: Mimetype detection for JSON > >>> > >>> Hi Iain, > >>> > >>>> I have copied tika-mimetypes.xml from the tika jar file and installed > >>>> a copy in my configuration directory. I have updated nutch-site.xml > >>>> to point to this file and the log entries indicate that this is being > >>>> found. > >>> > >>> ... and the property mime.type.magic is true (default)? > >>> > >>> > >>>> <mime-type type="application/json"> > >>>> <sub-class-of type="application/javascript"/> > >>> > >>> Just as a trial: What happens if you make the web server return > >>> "application/javascript" > >>> as content type? > >>> > >>> > >>>> I am still getting the content type detected as text/html and the json > >>>> parser is not being invoked. Any suggestions as to what to look at next? > >>> > >>> The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the > >>> following resources to Tika: > >>> - byte stream for magic detection > >>> - URL for additional file name patterns > >>> - content type sent by server > >>> URL and server content type are required as additional hints, e.g., for > >>> zip containers such as .xlsx, etc. > >>> > >>> I fear that you have to run a debugger to find out what is going wrong. > >>> I would also run first Tika alone with the modified tika-mimetypes.xml, > >>> just to make sure that the mime magic works as expected. > >>> > >>> Cheers, > >>> Sebastian > >>> > >>> On 04/13/2015 04:26 PM, Iain Lopata wrote: > >>>> I have a page that I am fetching that contains JSON and I have a > >>>> plugin for parsing JSON. > >>>> > >>>> > >>>> > >>>> The server sets a mimetype of "text/html" and consequently my json > >>>> parser does not get invoked. > >>>> > >>>> > >>>> > >>>> If I run parsechecker from the command line and specify -forceAs > >>>> "application/json" the json parser is invoked and works successfully. > >>>> > >>>> > >>>> > >>>> So, I believe that if I can get tika to give me "application/json" as > >>>> the detected content type for this page, it should work during a crawl. > >>>> > >>>> > >>>> > >>>> I have copied tika-mimetypes.xml from the tika jar file and installed > >>>> a copy in my configuration directory. I have updated nutch-site.xml > >>>> to point to this file and the log entries indicate that this is being > >>>> found. > >>>> > >>>> > >>>> > >>>> In my copy of tika-mimetypes.xml I have added the match rule shown > >>>> below > >>>> > >>>> > >>>> > >>>> <mime-type type="application/json"> > >>>> > >>>> <sub-class-of type="application/javascript"/> > >>>> > >>>> <magic priority="100"> > >>>> > >>>> <match value="{" type="string" offset="0"/> > >>>> > >>>> </magic> > >>>> > >>>> <glob pattern="*.json"/> > >>>> > >>>> </mime-type> > >>>> > >>>> > >>>> > >>>> I know that my match is much too broad, but I am using this just while > >>>> trying to resolve this problem. > >>>> > >>>> > >>>> > >>>> I have also set lang.extraction.policy to identify in nutch-site.xml > >>>> (again primarily for testing purposes). > >>>> > >>>> > >>>> > >>>> I am still getting the content type detected as text/html and the json > >>>> parser is not being invoked. Any suggestions as to what to look at next? > >>>> > >>>> > >>>> > >>>> Thanks! > >>>> > >>>> > >>>> > >>>> Iain > >>>> > >>>> > >>> > >>> > >>> > >>> > >> > > > > >

