Hi Iain,
> It would also seems simple enough to pass these hints to Tika
> with the following modification to my previously proposed code:
> try {
> InputStream in = new ByteArrayInputStream(data);
> Metadata meta = new Metadata();
> meta.set(Metadata.CONTENT_TYPE,typeName);
> meta.set(Metadata.RESOURCE_NAME_KEY,url);
> magicType = this.mimeTypes.detect(in, meta).toString();
Ok, that would be quite close to the current state in trunk:
if (this.mimeMagic) {
String magicType = null;
// pass URL (file name) and (cleansed) content type from protocol to Tika
Metadata tikaMeta = new Metadata();
tikaMeta.add(Metadata.RESOURCE_NAME_KEY, url);
tikaMeta.add(Metadata.CONTENT_TYPE,
(cleanedMimeType != null ? cleanedMimeType : typeName));
try {
InputStream stream = TikaInputStream.get(data);
try {
magicType = tika.detect(stream, tikaMeta);
} finally {
Sorry, looks like I've overseen this little difference:
magicType = tika.detect(stream, tikaMeta);
vs.
magicType = this.mimeTypes.detect(in, meta).toString();
Is this correct?
Seems plausible because only "mimeTypes" is explicitly configured to use the
mime.types.file, while "tika" is not, cf. MimeUtil constructor.
> In fact, I am left wondering why the entire autoResolveContentType in
> MimeUtil.java can not be replace by this code
Yes, of course. Please, open a Jira. That's a bug, definitely.
Thanks,
Sebastian
On 04/16/2015 04:13 PM, Iain Lopata wrote:
> Sebastian,
>
> I am not sure I understand your response.
>
> While you are correct that the call to the detect method in my revised code
> below only uses the content, in the broader context of MimeUtil.java both the
> mime type returned by the server and the filename are both considered before
> MimeUtil returns a final value.
>
> It would also seems simple enough to pass these hints to Tika with the
> following modification to my previously proposed code:
>
> try {
> InputStream in = new ByteArrayInputStream(data);
> Metadata meta = new Metadata();
> meta.set(Metadata.CONTENT_TYPE,typeName);
> meta.set(Metadata.RESOURCE_NAME_KEY,url); magicType =
> this.mimeTypes.detect(in, meta).toString();
>
> LOG.debug("Magic Type for" + url + " is " + magicType);
>
> } catch (Exception e) {
> //Can't complete magic detection
> }
>
> In fact, I am left wondering why the entire autoResolveContentType in
> MimeUtil.java can not be replace by this code, but for now I will be happy
> with a solution that allows me to add rules to tika-mimetypes.xml such that
> these rules get used by Nutch.
>
> Iain
>
>
>> Date: Thu, 16 Apr 2015 00:26:20 +0200
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: Mimetype detection for JSON
>>
>> Hi Iain,
>>
>> that means mime type detection is done exclusively on content
>> without URL and server content type. There are examples where
>> both will definitely add necessary support, cf. NUTCH-1605.
>>
>> Maybe it's best to let Tika improve the mime detectors, there
>> is still some work ongoing, cf. TIKA-1517.
>>
>> It could be an option, instead of a binary mime.type.magic
>> to set a (weighted) hierarchy of heuristics
>> magic > URL pattern > HTTP content type
>> or just a list of hints to be used.
>>
>> But it's not as easy because often these are used in combination
>> a zip file by signature with extension .xlsx is likely to be an Excel
>> Office Open XML spreadsheet. JSON is similar or even worse:
>> a '{' 0x7B in position 0 is only a little hint:
>> - could be also '[' (but less likely)
>> - also RTF has a '{' in position 0
>>
>> Sebastian
>>
>>
>> On 04/15/2015 02:05 PM, Iain Lopata wrote:
>>> The following change to MimeUtil.java seems to solve my problem:
>>>
>>> // magicType = tika.detect(data);
>>> try {
>>> InputStream in = new ByteArrayInputStream(data);
>>> Metadata meta = new Metadata();
>>> magicType = this.mimeTypes.detect(in, meta).toString();
>>> LOG.debug("Magic Type for" + url + " is " + magicType);
>>> } catch (Exception e) {
>>> //Can't complete magic detection
>>> }
>>>
>>> However, my confidence that I haven’t broken something else is modest at
>>> best.
>>>
>>> If this looks like a bug I am happy to create the JIRA entry and submit
>>> this as a patch, but before I do so can you tell me if this looks sensible?
>>>
>>> -----Original Message-----
>>> From: Iain Lopata [mailto:[email protected]]
>>> Sent: Tuesday, April 14, 2015 8:43 PM
>>> To: [email protected]
>>> Subject: RE: Mimetype detection for JSON
>>>
>>> It seems to me that setting tika-mimetypes.xml in the Nutch configuration
>>> causes MimeUtil.java to use the specified file for initial lookup and for
>>> URL resolution. However, when it comes to magic detection, the
>>> tika-mimetypes.xml file in the Tika jar file seems to be used instead.
>>>
>>> If I update the Tika jar with my match rule it works perfectly. If I only
>>> place the updated tika-mimetypes.xml file in my Nutch configuration
>>> directory, the magic detection does not use my match rule.
>>>
>>> Can anyone familiar with the Tika implementation tell me if there is a way
>>> to update Nutch's MimeUtil.java to instantiate Tika to use the
>>> configuration file from Nutch? Or would it be better just to update the
>>> configuration file in the Tika jar?
>>>
>>> -----Original Message-----
>>> From: Iain Lopata [mailto:[email protected]]
>>> Sent: Tuesday, April 14, 2015 5:32 PM
>>> To: [email protected]
>>> Subject: RE: Mimetype detection for JSON
>>>
>>> Thanks Sebastian.
>>>
>>> mime.type.magic is true.
>>>
>>> I don’t have control over the web server, so cannot test with
>>> application/javascript
>>>
>>> Time for some deeper debugging it seems. Will update the list with
>>> findings.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:[email protected]]
>>> Sent: Tuesday, April 14, 2015 4:09 PM
>>> To: [email protected]
>>> Subject: Re: Mimetype detection for JSON
>>>
>>> Hi Iain,
>>>
>>>> I have copied tika-mimetypes.xml from the tika jar file and installed
>>>> a copy in my configuration directory. I have updated nutch-site.xml
>>>> to point to this file and the log entries indicate that this is being
>>>> found.
>>>
>>> ... and the property mime.type.magic is true (default)?
>>>
>>>
>>>> <mime-type type="application/json">
>>>> <sub-class-of type="application/javascript"/>
>>>
>>> Just as a trial: What happens if you make the web server return
>>> "application/javascript"
>>> as content type?
>>>
>>>
>>>> I am still getting the content type detected as text/html and the json
>>>> parser is not being invoked. Any suggestions as to what to look at next?
>>>
>>> The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the
>>> following resources to Tika:
>>> - byte stream for magic detection
>>> - URL for additional file name patterns
>>> - content type sent by server
>>> URL and server content type are required as additional hints, e.g., for zip
>>> containers such as .xlsx, etc.
>>>
>>> I fear that you have to run a debugger to find out what is going wrong.
>>> I would also run first Tika alone with the modified tika-mimetypes.xml,
>>> just to make sure that the mime magic works as expected.
>>>
>>> Cheers,
>>> Sebastian
>>>
>>> On 04/13/2015 04:26 PM, Iain Lopata wrote:
>>>> I have a page that I am fetching that contains JSON and I have a
>>>> plugin for parsing JSON.
>>>>
>>>>
>>>>
>>>> The server sets a mimetype of "text/html" and consequently my json
>>>> parser does not get invoked.
>>>>
>>>>
>>>>
>>>> If I run parsechecker from the command line and specify -forceAs
>>>> "application/json" the json parser is invoked and works successfully.
>>>>
>>>>
>>>>
>>>> So, I believe that if I can get tika to give me "application/json" as
>>>> the detected content type for this page, it should work during a crawl.
>>>>
>>>>
>>>>
>>>> I have copied tika-mimetypes.xml from the tika jar file and installed
>>>> a copy in my configuration directory. I have updated nutch-site.xml
>>>> to point to this file and the log entries indicate that this is being
>>>> found.
>>>>
>>>>
>>>>
>>>> In my copy of tika-mimetypes.xml I have added the match rule shown
>>>> below
>>>>
>>>>
>>>>
>>>> <mime-type type="application/json">
>>>>
>>>> <sub-class-of type="application/javascript"/>
>>>>
>>>> <magic priority="100">
>>>>
>>>> <match value="{" type="string" offset="0"/>
>>>>
>>>> </magic>
>>>>
>>>> <glob pattern="*.json"/>
>>>>
>>>> </mime-type>
>>>>
>>>>
>>>>
>>>> I know that my match is much too broad, but I am using this just while
>>>> trying to resolve this problem.
>>>>
>>>>
>>>>
>>>> I have also set lang.extraction.policy to identify in nutch-site.xml
>>>> (again primarily for testing purposes).
>>>>
>>>>
>>>>
>>>> I am still getting the content type detected as text/html and the json
>>>> parser is not being invoked. Any suggestions as to what to look at next?
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> Iain
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>