Hi madmurphy, The 'correct' place for GNU libextractor discussions would be
https://lists.gnu.org/mailman/listinfo/libextractor Alas, with my libextractor maintainer hat on, I would say this: On 2/7/22 10:01 PM, madmurphy wrote: > Hi again, GNUnet people. > > Is this the place where to discuss about libextractor? I have two points. > > #1 I often see something interesting. Key-value pairs are categorized as > |EXTRACTOR_METATYPE_UNKNOWN|: > > unknown: chroma-format=4:2:0 > unknown: bit-depth-chroma=8 > unknown: colorimetry=bt709 > unknown: stream-format=avc > unknown: stream-format=raw > unknown: bit-depth-luma=8 > unknown: base-profile=lc > unknown: mpegversion=4 > unknown: profile=high > unknown: alignment=au > unknown: parsed=true > unknown: framed=true > unknown: variant=iso > unknown: profile=lc > unknown: level=4.1 > > But one point is that they are often numerous, and another point is that > that of a key-value type is a really interesting metatype to have (and > is not really “unknown”, since the key is self-explanatory). Would it > not make sense to add an |EXTRACTOR_METATYPE_KEY_VALUE_PAIR| to the list > of MetaTypes? We could do that. Sometimes I think it would be better to add new specific LE types for some of the above, but until that is done, a key-value pair type would at least be better than 'unknown'. > ... > > /* generic attributes */ > EXTRACTOR_METATYPE_UNKNOWN = 45, > EXTRACTOR_METATYPE_DESCRIPTION = 46, > EXTRACTOR_METATYPE_COPYRIGHT = 47, > EXTRACTOR_METATYPE_RIGHTS = 48, > EXTRACTOR_METATYPE_KEYWORDS = 49, > EXTRACTOR_METATYPE_ABSTRACT = 50, > EXTRACTOR_METATYPE_SUMMARY = 51, > EXTRACTOR_METATYPE_SUBJECT = 52, > EXTRACTOR_METATYPE_CREATOR = 53, > EXTRACTOR_METATYPE_FORMAT = 54, > EXTRACTOR_METATYPE_FORMAT_VERSION = 55, > *EXTRACTOR_METATYPE_KEY_VALUE_PAIR* = XXX, > > ... > > #2 I often see that files get tagged with multiple mime types according > to libextractor: > > mimetype: video/quicktime > mimetype: video/x-h264 > mimetype: audio/mpeg > mimetype: video/mp4 That is because different plugins (using different methods/libraries) disagree on the 'correct' mime-type. Ideally, we'd identify which plugin gets it wrong (and why), and unify the mime-types. > But that never reflects the reality, since files should have only one > mime type (or at most, multiple mime types that mean the same thing). > But then I see what happens with file names: there is only one > |EXTRACTOR_METATYPE_GNUNET_ORIGINAL_FILENAME|, but there can be many > |EXTRACTOR_METATYPE_FILENAME|s (in the case of archives, for example): > > EXTRACTOR_METATYPE_FILENAME = 2, > ... > EXTRACTOR_METATYPE_GNUNET_ORIGINAL_FILENAME = 180, > > Would it not make sense to do something similar for mime types? Only one > “original mime type”, and an infinity of secondary mime types…? > > EXTRACTOR_METATYPE_MIMETYPE = 1, > ... > *EXTRACTOR_METATYPE_GNUNET_ORIGINAL_MIMETYPE* = XXX, I guess it depends. If this is for archives where files _inside_ the archive are given mime-types, then a different metatype makes sense (ditto for FILENAME: here we probably could have two types, one for the 'archive' and one for the 'contents'). But if the different mime-types are all about the 'original' file, then we should rather figure out which plugin gets it wrong. As for the "_GNUNET_" in the "_GNUNET_ORIGINAL_FILENAME" there, IIRC this again different because that is NOT a metatype used by GNU libextractor, but one that GNUnet itself generates and puts with the 'rest ' of the metadata. > So, two simple proposals: > > 1. Create |EXTRACTOR_METATYPE_KEY_VALUE_PAIR| > 2. Create |EXTRACTOR_METATYPE_GNUNET_ORIGINAL_MIMETYPE| > > What do you think? Does it make sense? It should definitively not be "GNUNET_ORIGINAL_MIMETYPE", and the real question is what is the origin of the different mime-types. If this is from an archive, maybe we should introduce EXTRACTOR_MIMETYPE_ARCHIVE_CONTENT_FILENAME EXTRACTOR_MIMETYPE_ARCHIVE_CONTENT_MIMETYPE and reserve EXTRACTOR_MIMETYPE_FILENAME EXTRACTOR_MIMETYPE_MIMETYPE for the top-level file. But AFAIK that won't solve your mime-type issue, which should really be resolved by going over the plugins and finding out why and where they disagree and picking the 'right' answer. My 2 cents Christian
