Got it! I agree about your solution for the duplicate mime types. but until that is done, a key-value pair type would at least be better than 'unknown'.
“Unknown” can continue to exist as an identifier for other cases, just not the key-value ones :) Also I forgot to mention a third point: 3. Add an EXTRACTOR_METATYPE_NO_METATYPE = -1 to enum EXTRACTOR_MetaType (more or less like NULL if that was a pointer). Without a EXTRACTOR_METATYPE_NO_METATYPE a programmer is forced to save the have_metatype information in another variable. The fact that it is a negative number is not a problem, because as the name suggests, *it is not a metatype*. P.S. Sorry for picking the wrong mailing list! On Tue, Feb 8, 2022 at 9:57 AM Christian Grothoff <groth...@gnunet.org> wrote: > Hi madmurphy, > > The 'correct' place for GNU libextractor discussions would be > > https://lists.gnu.org/mailman/listinfo/libextractor > > Alas, with my libextractor maintainer hat on, I would say this: > > On 2/7/22 10:01 PM, madmurphy wrote: > > Hi again, GNUnet people. > > > > Is this the place where to discuss about libextractor? I have two points. > > > > #1 I often see something interesting. Key-value pairs are categorized as > > |EXTRACTOR_METATYPE_UNKNOWN|: > > > > unknown: chroma-format=4:2:0 > > unknown: bit-depth-chroma=8 > > unknown: colorimetry=bt709 > > unknown: stream-format=avc > > unknown: stream-format=raw > > unknown: bit-depth-luma=8 > > unknown: base-profile=lc > > unknown: mpegversion=4 > > unknown: profile=high > > unknown: alignment=au > > unknown: parsed=true > > unknown: framed=true > > unknown: variant=iso > > unknown: profile=lc > > unknown: level=4.1 > > > > But one point is that they are often numerous, and another point is that > > that of a key-value type is a really interesting metatype to have (and > > is not really “unknown”, since the key is self-explanatory). Would it > > not make sense to add an |EXTRACTOR_METATYPE_KEY_VALUE_PAIR| to the list > > of MetaTypes? > > We could do that. Sometimes I think it would be better to add new > specific LE types for some of the above, but until that is done, a > key-value pair type would at least be better than 'unknown'. > > > ... > > > > /* generic attributes */ > > EXTRACTOR_METATYPE_UNKNOWN = 45, > > EXTRACTOR_METATYPE_DESCRIPTION = 46, > > EXTRACTOR_METATYPE_COPYRIGHT = 47, > > EXTRACTOR_METATYPE_RIGHTS = 48, > > EXTRACTOR_METATYPE_KEYWORDS = 49, > > EXTRACTOR_METATYPE_ABSTRACT = 50, > > EXTRACTOR_METATYPE_SUMMARY = 51, > > EXTRACTOR_METATYPE_SUBJECT = 52, > > EXTRACTOR_METATYPE_CREATOR = 53, > > EXTRACTOR_METATYPE_FORMAT = 54, > > EXTRACTOR_METATYPE_FORMAT_VERSION = 55, > > *EXTRACTOR_METATYPE_KEY_VALUE_PAIR* = XXX, > > > > ... > > > > #2 I often see that files get tagged with multiple mime types according > > to libextractor: > > > > mimetype: video/quicktime > > mimetype: video/x-h264 > > mimetype: audio/mpeg > > mimetype: video/mp4 > > That is because different plugins (using different methods/libraries) > disagree on the 'correct' mime-type. Ideally, we'd identify which plugin > gets it wrong (and why), and unify the mime-types. > > > But that never reflects the reality, since files should have only one > > mime type (or at most, multiple mime types that mean the same thing). > > But then I see what happens with file names: there is only one > > |EXTRACTOR_METATYPE_GNUNET_ORIGINAL_FILENAME|, but there can be many > > |EXTRACTOR_METATYPE_FILENAME|s (in the case of archives, for example): > > > > EXTRACTOR_METATYPE_FILENAME = 2, > > ... > > EXTRACTOR_METATYPE_GNUNET_ORIGINAL_FILENAME = 180, > > > > Would it not make sense to do something similar for mime types? Only one > > “original mime type”, and an infinity of secondary mime types…? > > > > EXTRACTOR_METATYPE_MIMETYPE = 1, > > ... > > *EXTRACTOR_METATYPE_GNUNET_ORIGINAL_MIMETYPE* = XXX, > > I guess it depends. If this is for archives where files _inside_ the > archive are given mime-types, then a different metatype makes sense > (ditto for FILENAME: here we probably could have two types, one for the > 'archive' and one for the 'contents'). But if the different mime-types > are all about the 'original' file, then we should rather figure out > which plugin gets it wrong. As for the "_GNUNET_" in the > "_GNUNET_ORIGINAL_FILENAME" there, IIRC this again different because > that is NOT a metatype used by GNU libextractor, but one that GNUnet > itself generates and puts with the 'rest ' of the metadata. > > > So, two simple proposals: > > > > 1. Create |EXTRACTOR_METATYPE_KEY_VALUE_PAIR| > > 2. Create |EXTRACTOR_METATYPE_GNUNET_ORIGINAL_MIMETYPE| > > > > What do you think? Does it make sense? > > It should definitively not be "GNUNET_ORIGINAL_MIMETYPE", and the real > question is what is the origin of the different mime-types. If this is > from an archive, maybe we should introduce > > EXTRACTOR_MIMETYPE_ARCHIVE_CONTENT_FILENAME > EXTRACTOR_MIMETYPE_ARCHIVE_CONTENT_MIMETYPE > > and reserve > > EXTRACTOR_MIMETYPE_FILENAME > EXTRACTOR_MIMETYPE_MIMETYPE > > for the top-level file. But AFAIK that won't solve your mime-type issue, > which should really be resolved by going over the plugins and finding > out why and where they disagree and picking the 'right' answer. > > My 2 cents > > Christian > >