Thanks professor for the prompt and kind response, will keep you updated on the 
progress and findings.

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Wednesday, January 28, 2015 8:17 PM
To: Luke; 'Christian Alan Mattmann'; dev@tika.apache.org
Cc: nsf-polar-usc-stude...@googlegroups.com
Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the 
class MIMETypes

Hi Luke,


-----Original Message-----
From: Luke <hanson311...@gmail.com>
Date: Wednesday, January 28, 2015 at 7:15 PM
To: Chris Mattmann <mattm...@usc.edu>, Chris Mattmann 
<chris.a.mattm...@jpl.nasa.gov>, "dev@tika.apache.org"
<dev@tika.apache.org>
Cc: NSF Polar CyberInfrastructure DR Students 
<nsf-polar-usc-stude...@googlegroups.com>
Subject: RE: [jira] [Commented] (TIKA-1535) Inheritance modification for the 
class MIMETypes

>Hi Professor and all,
>
>Bayesian or machine learning Detector is different from Bayesian 
>Selection mechanism reported in TIKA-1517.
>It would make sense if we implemented a machine learning algorithm in 
>separate Detector class, I have not gone too far with this design 
>thought, as I am still on the stage of the research with data 
>collection, once I have enough data and am able to form a model, 
>especially I am able to prove my concept, then I will be able to come 
>down to the machine learning detector implementation with design 
>consideration. (BTW, I think I have some ideas with data collection and 
>training, it still takes some time to come up with something even quick 
>and dirty that can prove the concept with machine learning, I am still 
>working on the data collection, there are also some design problems 
>within learning techniques too, I will come to them once I will have 
>clear idea with the data, i think I may have to crawl the data and 
>label them for training, there are some certain preprocessing steps to 
>be cared too....)

+1.

>
>However, my current implementation in TIKA-1517 is solely based on mime 
>type "selection"(I cannot find any clearer name disguisable from
>detection) with probability that might have nothing to do with the 
>genuine machine learning detector, it is a feature for adding weights 
>to each Tika mime type detection algorithm.

Gotcha.

> 
>
>But I think you are right, and in the future we kinda need it to assign 
>weights to a pool of detection algorithms including machine learning 
>techniques or content based detection algorithms, and the current 
>implementation of MIMEtypes with final has its design purpose, and I 
>don’t think it is a good idea to lump detector code within the 
>MimeTypes, but I will come down to this design or architecture problem 
>once I have some clear ideas of the machine learning model (not 
>necessary Bayesian model for detection).
> 
>
>BTW, off the top of my head, I would tend to distill the detector 
>semantics out of the MIMEtypes mentioned as below; What do you think 
>about creating a say TikaDetector class independent from the MimeTypes, 
>and get rid of MimeTypes from the  detectors (i.e. getting rid of the 
>"implements Detector" in the MimeTypes)?

Yes, can you explore doing this?

>
>I will continue to think about this design problem as we move alone, 
>and I will leave notes on the ticket for sure. It looks like an 
>important or big change, so any kind suggestion will be welcomed and 
>appreciate

Thank you Luke, will do. I will read more and comment on it. Thanks for sharing 
this with the list!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




>
>-----Original Message-----
>From: Christian Alan Mattmann [mailto:mattm...@usc.edu]
>Sent: Wednesday, January 28, 2015 6:30 PM
>To: Luke; 'Mattmann, Chris A (3980)'
>Cc: nsf-polar-usc-stude...@googlegroups.com
>Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification 
>for the class MIMETypes
>
>Hi Luke, thanks much. I think we should be having this discussion on 
>the dev@tika.apache.org list too, but thanks also for CC’ing the Polar 
>students list.
>
>My feeling is that Tyler has a good point and that having a 
>BayesianDetector makes a ton of sense. How about we try that as a 
>start, and see where it goes?
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Adjunct Associate Professor, Computer Science Department University of 
>Southern California Los Angeles, CA 90089 USA
>Email: mattm...@usc.edu
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: Luke <hanson311...@gmail.com>
>Date: Wednesday, January 28, 2015 at 5:48 PM
>To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>
>Cc: Chris Mattmann <mattm...@usc.edu>, NSF Polar CyberInfrastructure DR 
>Students <nsf-polar-usc-stude...@googlegroups.com>
>Subject: FW: [jira] [Commented] (TIKA-1535) Inheritance modification 
>for the class MIMETypes
>
>>Hi Professor,
>>
>>I was about to modify the code to be able to work with inheritance and 
>>code reuse, Tyler in the following just came across and posted a 
>>suggestion, which is a bit enlightening.
>>
>>Defining class with final in this case seems to tell me that any input 
>>stream that gets passed to the class is attached to one fixed type of 
>>MimeTypes (I tend to think the MimeTypes should be tied up with one 
>>input stream), or it can be interpreted it as the MimeTypes of an 
>>input stream.
>>If we inherit this by calling my implementation of 
>>MimeTypesBaysianSelection, that will look weird in a sense of 
>>inheritance. As my Bayesian implementation is more like an operation 
>>attached to that input stream's MimeTypes.
>>
>>It seems MimeTypes class is not only used as a MimeType detector (it 
>>implements Detector interface though), but it also has some other 
>>purposes, eg. Users can take a peak on the input stream mimetypes, 
>>extension, magics, etc, that is probably why it is called MimeTypes 
>>rather than something like Detector; I think it is not a detector, but 
>>some of its methods such as getMagics or something make it easier fit 
>>into the slot of Detectors, as it is easier to just outfit it with an 
>>Detector interface and just use it as one of detectors, I was 
>>initially confused why it is not called something with detector in it 
>>and now I am getting the idea....:) but if you have any thoughts, 
>>please kindly let me know.
>>
>>It looks like a clearer oo design for me would be a detector class 
>>(say
>>TikaDetector) that take MimeTypes of an input stream as an argument, 
>>and execute the detect method with MimeTypes of the input stream, 
>>although the current detect method only takes an input stream as one 
>>of its argument.... we can create create an MimeTypes instance inside 
>>this detect method; However, this is my premature thought, and also if 
>>we modify like this, I am afraid it is highly possible we would 
>>violate some of the original design with mime and this will probably 
>>and potentially break some of the semantics... , although I do feel 
>>the current design has a few little flaws in this respect.
>>
>>On the other hands, if we stick to original implementation by 
>>attaching the Bayesian selection function to the MimeTypes, after 
>>digging up a bit I personally think this is a bit clearer than 
>>inheritance (getting rid of the final). Probably this also minimizes 
>>the code change and potential impact. Every time I make code change, I 
>>always fear there would be a 'butterfly effect', thorough testing 
>>would be needed for sure....which does take some time and it is quite 
>>tedious....  quite important though....
>>
>>Anyway If you have any advice/idea/thoughts, please kindly let me know 
>>and they will be welcomed and appreciated as usual.
>>
>>Thanks
>>Luke
>>
>>-----Original Message-----
>>From: Tyler Palsulich (JIRA) [mailto:j...@apache.org]
>>Sent: Wednesday, January 28, 2015 3:52 PM
>>To: hanson311...@gmail.com
>>Subject: [jira] [Commented] (TIKA-1535) Inheritance modification for 
>>the class MIMETypes
>>
>>
>>    [
>>https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jir
>>a
>>.pl
>>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296084#
>>c
>>omm
>>ent-14296084 ]
>>
>>Tyler Palsulich commented on TIKA-1535:
>>---------------------------------------
>>
>>Maybe someone else can comment on this too. But, I believe MimeTypes 
>>is {{final}} because there are restrictions of what can be done with 
>>the given {{InputStream}}. If those restrictions are broken, key 
>>features of Tika may break. So, we declared the class as final to 
>>ensure no one could break those semantics.
>>
>>But, as seen here, it's difficult to predict whether or not there will 
>>be a valid {{extend}} use case. So, you have to be careful when 
>>marking a class {{final}}.
>>
>>> Inheritance modification for the class MIMETypes
>>> ------------------------------------------------
>>>
>>>                 Key: TIKA-1535
>>>                 URL: https://issues.apache.org/jira/browse/TIKA-1535
>>>             Project: Tika
>>>          Issue Type: Improvement
>>>          Components: mime
>>>            Reporter: Luke sh
>>>            Priority: Trivial
>>>
>>> The Class MIMETypes does not currently allow for inheritance.
>>> There are a couple of methods in this class which looks independent, 
>>>and some of which needs to be exposed or overwritten for special 
>>>needs or use cases, this will enable tika users with more flexibility 
>>>for new mime detection algorithm.
>>
>>
>>
>>--
>>This message was sent by Atlassian JIRA
>>(v6.3.4#6332)
>>
>
>


Reply via email to