Re: GSOC 2018 SpamAssassin Statistical Classifier Plugin

Rajkiran Rajkumar Wed, 21 Mar 2018 04:52:40 -0700

@Saahil, kindly make your doc view-only for people with a link to it.
Giving edit permissions to the world is a bad idea.


Thanks,
Rajkiran

On Tue, Mar 20, 2018 at 5:17 PM, Kevin A. McGrail <kmcgr...@apache.org>
wrote:

> +users
>
> All we give is feedback.  The submission to GSoC is what matters.  So if
> you mentioned perl here that's not going to carryover to the reviewers.
>
> Can someone with fresh eyes take a look at this?  I read it too recently
> so I will gloss over it too much.
>
> Here are some posts the mentors list thought might be helpful.  The first
> I believe covers someone's pov who did not get selected.
>
> https://medium.freecodecamp.org/hacking-gsoc-how-to-gain-
> real-life-experience-and-support-open-source-
> b1e6a664f6e4?source=linkShare-53ba2bb84284-1521381334
>
> https://sanatt.me/2017/12/30/cracking-google-summer-code-2018/
>
> Regards, KAM
>
> On Tue, Mar 20, 2018, 03:57 Saahil Sirowa <cs16btech11...@iith.ac.in>
> wrote:
>
>> Hi Kevin and Apache SpamAssassin Dev Community,
>>
>> I have resolved all the changes you suggested in the previous draft.
>> 1) I mentioned about learning PERL a week before the community bonding
>> period. It will not take much time. I can assure you that language is not
>> going to be an issue.
>> 2) I updated the biography part a bit
>> 3) Significant changes have been made in the Timeline.
>> 4) I'm planning to used cmake/travis ci for automated testing. If there
>> is a better alternative please do suggest.
>> 5) I gave links to research papers that i will be reading in the timeline.
>> 6) I updated the timeline by mentioning to gain advanced information
>> about email traffic and spams. I listed some links for the purpose.
>> 7) I updated the credits
>> 8) There are other changes made in various parts of proposal.
>>
>> Thanks for your previous detailed feedback.
>>
>> Here is link to the updated proposal
>> GSoC 2018 proposal
>> <https://docs.google.com/document/d/1-OCNv79sHvVViKwnrRYtlMiKWLCzz4xUW4tNOlmaTmw/edit#heading=h.q7h3lddabdvh>
>> Please rigorously review it and suggest any changes that I should make.
>>
>> Awaiting for a favorable response.
>>
>>
>> Thanks...
>> Saahil Sirowa
>> B. Tech Computer Science and Engineering
>> Indian Institute of Technology, Hyderabd
>>
>> On Mon, Mar 19, 2018 at 3:27 AM, Kevin A. McGrail <kmcgr...@apache.org>
>> wrote:
>>
>>> Hi Saahil
>>>
>>> re: Perl. As the project is primarily in Perl and you do not list that
>>> in your Proficiencies or any similar languages like PHP, I would address
>>> that.  The word Perl does not appear a single time.
>>>
>>> Your Biography is a little light on why this is something you feel you
>>> can implement.  The mentors will likely NOT be able to help you with the
>>> science rather focusing on the community, processes, and open source in
>>> general.
>>>
>>> re: Email and SPam, do you have any experience with email traffic or
>>> spam?  if so, add it.  If not, explain what you plan to do to address that.
>>>
>>> Re: Deliverables, I think you'll need to propose the first draft of
>>> that.  But your goal will likely be a plugin for Apache SpamAssassin that
>>> can be installed and configured to provide multiple configurable
>>> statistical analysis algorithms to better identify ham (good email) and/or
>>> spam (bad email)
>>>
>>> Please use Apache SpamAssassin to properly brand the title.
>>>
>>> Re: I have no input on the scheduling/timelines except that past
>>> proposal I have read have included more phases and do not add "optional"
>>> items.  I'd prefer to see small increments to make sure you stay on
>>> schedule and don't get overwhelmed and find yourself way behind as the time
>>> progresses.
>>>
>>> Re: Testing Methodology, this is likely the most critical missing part.
>>> I am a fan of test driven development where you set up tests that should
>>> pass and fall and use continuous testing as you add code to confirm your
>>> development is progressing well.
>>>
>>> This is especially important because spam analysis often doesn't work
>>> the way people expect and tests w/statistics can help identify issues.
>>>
>>> For example, this is a hypothesis that this statistical algorithms will
>>> be better than Bayes.  So you'll need a baseline for comparison.
>>>
>>> Additionally, even experts in the field are surprised when they think
>>> something will prove the hamminess of an email but in fact shows the
>>> opposite.  Real world example, SPF is a policy when introduced was supposed
>>> to allow an automated mechanism that says "this is an email from a
>>> legitimate mail server for my domain".
>>>
>>> However, the FIRST wave of people to adobt it were all spammers.  So it
>>> became a spam indicator more than a spam indicator.  It was a very
>>> interesting outcome.
>>>
>>> Re: Corpora, you'll want a corpora of carefully hand sorted ham and
>>> spam.  Have you thought about how you'll get that?  I *might* be able to
>>> help but it's 50/50.
>>>
>>> Re: You mention reading research papers on statisical algorithms from a
>>> previous proposal.  You'll want to list them to show which ones you plan to
>>> study
>>>
>>> re: "Discussions with the SA community regarding the various types of
>>> spams that the present SA can handle." is unclear.  What is a "type of
>>> spam" to you?  Do you have a list of types of spam?
>>>
>>> re: "Brainstorming with the mentors and SA community about the various
>>> input features and parameters that can have a huge impact on the overall
>>> performance of the listed neural nets models." I think this is flawed.
>>> There won't be a ton of people who can discuss this with you.  You'll need
>>> to likely use scientific process to show what has a performance impact.
>>> This is not busy work or school work.  This is an experiment that has not
>>> been tried at the SA project.
>>>
>>> re: "actively involved with the community." is a stretch.  A few emails
>>> do not active involvement make.
>>>
>>> re: Bonding, you might consider raising that to 1-2 major bugs and 10-20
>>> minor bugs.
>>>
>>> Re: Credits/references, I would add more clarity about where each of
>>> those references are used.
>>>
>>> Regards,
>>> KAM
>>>
>>
>>

Re: GSOC 2018 SpamAssassin Statistical Classifier Plugin

Reply via email to