[jira] [Updated] (SYSTEMML-1813) Preprocessing simplification and cleanup

Glenn Weidner (JIRA) Fri, 08 Sep 2017 22:09:28 -0700

     [ 
https://issues.apache.org/jira/browse/SYSTEMML-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Glenn Weidner updated SYSTEMML-1813:
------------------------------------
    Fix Version/s:     (was: SystemML 1.0)
                   SystemML 0.15

> Preprocessing simplification and cleanup
> ----------------------------------------
>
>                 Key: SYSTEMML-1813
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1813
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
>            Assignee: Mike Dusenberry
>             Fix For: SystemML 0.15
>
>
> In anticipation of near-future algorithmic improvements to the preprocessing 
> to improve model training, this simplifies and cleans up the preprocessing 
> code as follows.
> - Previously, we were processing all slides into one large saved
> DataFrame, and then splitting that DataFrame into train and validation
> DataFrames.  We should simplify this by splitting the slide numbers
> into train and validation sets, and then processing those slides
> separately.  This will effectively skip the creation of the large DataFrame,
> and remove the need to split that large DataFrame into train/val ones,
> which should provide a large performance benefit.  The DataFrame `union`
> method can be used to combine two DataFrames row-wise.
> - Previously, we maintained a list of "broken" slides that were manually
> removed.  We should remove that manual list, and instead add a
> try/except filtering step to automatically remove problematic slides.
> - We should move ad-hoc sampling code into a new `sample` function.
> - We should move code to add row indices to a DataFrame into a new
> `add_row_indices` function.
> The benefit is that near-future algorithmic improvements to the
> preprocessing code will be much easier to incorporate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SYSTEMML-1813) Preprocessing simplification and cleanup

Reply via email to