Thanks professor for the prompt and kind response, will keep you updated on the progress and findings.
-----Original Message----- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Wednesday, January 28, 2015 8:17 PM To: Luke; 'Christian Alan Mattmann'; dev@tika.apache.org Cc: nsf-polar-usc-stude...@googlegroups.com Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes Hi Luke, -----Original Message----- From: Luke <hanson311...@gmail.com> Date: Wednesday, January 28, 2015 at 7:15 PM To: Chris Mattmann <mattm...@usc.edu>, Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>, "dev@tika.apache.org" <dev@tika.apache.org> Cc: NSF Polar CyberInfrastructure DR Students <nsf-polar-usc-stude...@googlegroups.com> Subject: RE: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes >Hi Professor and all, > >Bayesian or machine learning Detector is different from Bayesian >Selection mechanism reported in TIKA-1517. >It would make sense if we implemented a machine learning algorithm in >separate Detector class, I have not gone too far with this design >thought, as I am still on the stage of the research with data >collection, once I have enough data and am able to form a model, >especially I am able to prove my concept, then I will be able to come >down to the machine learning detector implementation with design >consideration. (BTW, I think I have some ideas with data collection and >training, it still takes some time to come up with something even quick >and dirty that can prove the concept with machine learning, I am still >working on the data collection, there are also some design problems >within learning techniques too, I will come to them once I will have >clear idea with the data, i think I may have to crawl the data and >label them for training, there are some certain preprocessing steps to >be cared too....) +1. > >However, my current implementation in TIKA-1517 is solely based on mime >type "selection"(I cannot find any clearer name disguisable from >detection) with probability that might have nothing to do with the >genuine machine learning detector, it is a feature for adding weights >to each Tika mime type detection algorithm. Gotcha. > > >But I think you are right, and in the future we kinda need it to assign >weights to a pool of detection algorithms including machine learning >techniques or content based detection algorithms, and the current >implementation of MIMEtypes with final has its design purpose, and I >don’t think it is a good idea to lump detector code within the >MimeTypes, but I will come down to this design or architecture problem >once I have some clear ideas of the machine learning model (not >necessary Bayesian model for detection). > > >BTW, off the top of my head, I would tend to distill the detector >semantics out of the MIMEtypes mentioned as below; What do you think >about creating a say TikaDetector class independent from the MimeTypes, >and get rid of MimeTypes from the detectors (i.e. getting rid of the >"implements Detector" in the MimeTypes)? Yes, can you explore doing this? > >I will continue to think about this design problem as we move alone, >and I will leave notes on the ticket for sure. It looks like an >important or big change, so any kind suggestion will be welcomed and >appreciate Thank you Luke, will do. I will read more and comment on it. Thanks for sharing this with the list! Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >-----Original Message----- >From: Christian Alan Mattmann [mailto:mattm...@usc.edu] >Sent: Wednesday, January 28, 2015 6:30 PM >To: Luke; 'Mattmann, Chris A (3980)' >Cc: nsf-polar-usc-stude...@googlegroups.com >Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification >for the class MIMETypes > >Hi Luke, thanks much. I think we should be having this discussion on >the dev@tika.apache.org list too, but thanks also for CC’ing the Polar >students list. > >My feeling is that Tyler has a good point and that having a >BayesianDetector makes a ton of sense. How about we try that as a >start, and see where it goes? > >Cheers, >Chris > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Adjunct Associate Professor, Computer Science Department University of >Southern California Los Angeles, CA 90089 USA >Email: mattm...@usc.edu >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > >-----Original Message----- >From: Luke <hanson311...@gmail.com> >Date: Wednesday, January 28, 2015 at 5:48 PM >To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov> >Cc: Chris Mattmann <mattm...@usc.edu>, NSF Polar CyberInfrastructure DR >Students <nsf-polar-usc-stude...@googlegroups.com> >Subject: FW: [jira] [Commented] (TIKA-1535) Inheritance modification >for the class MIMETypes > >>Hi Professor, >> >>I was about to modify the code to be able to work with inheritance and >>code reuse, Tyler in the following just came across and posted a >>suggestion, which is a bit enlightening. >> >>Defining class with final in this case seems to tell me that any input >>stream that gets passed to the class is attached to one fixed type of >>MimeTypes (I tend to think the MimeTypes should be tied up with one >>input stream), or it can be interpreted it as the MimeTypes of an >>input stream. >>If we inherit this by calling my implementation of >>MimeTypesBaysianSelection, that will look weird in a sense of >>inheritance. As my Bayesian implementation is more like an operation >>attached to that input stream's MimeTypes. >> >>It seems MimeTypes class is not only used as a MimeType detector (it >>implements Detector interface though), but it also has some other >>purposes, eg. Users can take a peak on the input stream mimetypes, >>extension, magics, etc, that is probably why it is called MimeTypes >>rather than something like Detector; I think it is not a detector, but >>some of its methods such as getMagics or something make it easier fit >>into the slot of Detectors, as it is easier to just outfit it with an >>Detector interface and just use it as one of detectors, I was >>initially confused why it is not called something with detector in it >>and now I am getting the idea....:) but if you have any thoughts, >>please kindly let me know. >> >>It looks like a clearer oo design for me would be a detector class >>(say >>TikaDetector) that take MimeTypes of an input stream as an argument, >>and execute the detect method with MimeTypes of the input stream, >>although the current detect method only takes an input stream as one >>of its argument.... we can create create an MimeTypes instance inside >>this detect method; However, this is my premature thought, and also if >>we modify like this, I am afraid it is highly possible we would >>violate some of the original design with mime and this will probably >>and potentially break some of the semantics... , although I do feel >>the current design has a few little flaws in this respect. >> >>On the other hands, if we stick to original implementation by >>attaching the Bayesian selection function to the MimeTypes, after >>digging up a bit I personally think this is a bit clearer than >>inheritance (getting rid of the final). Probably this also minimizes >>the code change and potential impact. Every time I make code change, I >>always fear there would be a 'butterfly effect', thorough testing >>would be needed for sure....which does take some time and it is quite >>tedious.... quite important though.... >> >>Anyway If you have any advice/idea/thoughts, please kindly let me know >>and they will be welcomed and appreciated as usual. >> >>Thanks >>Luke >> >>-----Original Message----- >>From: Tyler Palsulich (JIRA) [mailto:j...@apache.org] >>Sent: Wednesday, January 28, 2015 3:52 PM >>To: hanson311...@gmail.com >>Subject: [jira] [Commented] (TIKA-1535) Inheritance modification for >>the class MIMETypes >> >> >> [ >>https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jir >>a >>.pl >>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296084# >>c >>omm >>ent-14296084 ] >> >>Tyler Palsulich commented on TIKA-1535: >>--------------------------------------- >> >>Maybe someone else can comment on this too. But, I believe MimeTypes >>is {{final}} because there are restrictions of what can be done with >>the given {{InputStream}}. If those restrictions are broken, key >>features of Tika may break. So, we declared the class as final to >>ensure no one could break those semantics. >> >>But, as seen here, it's difficult to predict whether or not there will >>be a valid {{extend}} use case. So, you have to be careful when >>marking a class {{final}}. >> >>> Inheritance modification for the class MIMETypes >>> ------------------------------------------------ >>> >>> Key: TIKA-1535 >>> URL: https://issues.apache.org/jira/browse/TIKA-1535 >>> Project: Tika >>> Issue Type: Improvement >>> Components: mime >>> Reporter: Luke sh >>> Priority: Trivial >>> >>> The Class MIMETypes does not currently allow for inheritance. >>> There are a couple of methods in this class which looks independent, >>>and some of which needs to be exposed or overwritten for special >>>needs or use cases, this will enable tika users with more flexibility >>>for new mime detection algorithm. >> >> >> >>-- >>This message was sent by Atlassian JIRA >>(v6.3.4#6332) >> > >