[ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944466#comment-15944466 ]
Kenneth Knowles commented on BEAM-1439: --------------------------------------- And also, please engage with the Beam community early - before applications are reviewed! Here are some ideas for getting engaged: # Work through Beam's "getting started" materials such as https://beam.apache.org/get-started/quickstart-java/ #* Especially get as familiar as you can with the runner that you are interested in # Subscribe to d...@beam.apache.org and/or u...@beam.apache.org # You are welcome to share your applications for early commentary on d...@beam.apache.org to get early feedback and mentorship (this is quite normal for GSoC+Apache; even if you don't get selected by GSoC you will learn and make new acquaintances) # Pick up starter bugs to get familiar with the codebase beyond our getting started material > Beam Example(s) exploring public document datasets > -------------------------------------------------- > > Key: BEAM-1439 > URL: https://issues.apache.org/jira/browse/BEAM-1439 > Project: Beam > Issue Type: Wish > Components: examples-java > Reporter: Kenneth Knowles > Assignee: Kenneth Knowles > Priority: Minor > Labels: gsoc2017, java, mentor, python > > In Beam, we have examples illustrating counting the occurrences of words and > performing a basic TF-IDF analysis on the works of Shakespeare (or whatever > you point it at). It would be even cooler to do these analyses, and more, on > a much larger data set that is really the subject of current investigations. > In chatting with professors at the University of Washington, I've learned > that scholars of many fields would really like to explore new and highly > customized ways of processing the growing body of publicly-available > scholarly documents, such as PubMed Central. Queries like "show me documents > where chemical compounds X and Y were both used in the 'method' section" > So I propose a Google Summer of Code project wherein a student writes some > large-scale Beam pipelines to perform analyses such as term frequency, bigram > frequency, etc. > Skills required: > - Java or Python > - (nice to have) Working through the Beam getting started materials -- This message was sent by Atlassian JIRA (v6.3.15#6346)