[ https://issues.apache.org/jira/browse/SYSTEMML-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Glenn Weidner updated SYSTEMML-1813: ------------------------------------ Fix Version/s: (was: SystemML 1.0) SystemML 0.15 > Preprocessing simplification and cleanup > ---------------------------------------- > > Key: SYSTEMML-1813 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1813 > Project: SystemML > Issue Type: Improvement > Reporter: Mike Dusenberry > Assignee: Mike Dusenberry > Fix For: SystemML 0.15 > > > In anticipation of near-future algorithmic improvements to the preprocessing > to improve model training, this simplifies and cleans up the preprocessing > code as follows. > - Previously, we were processing all slides into one large saved > DataFrame, and then splitting that DataFrame into train and validation > DataFrames. We should simplify this by splitting the slide numbers > into train and validation sets, and then processing those slides > separately. This will effectively skip the creation of the large DataFrame, > and remove the need to split that large DataFrame into train/val ones, > which should provide a large performance benefit. The DataFrame `union` > method can be used to combine two DataFrames row-wise. > - Previously, we maintained a list of "broken" slides that were manually > removed. We should remove that manual list, and instead add a > try/except filtering step to automatically remove problematic slides. > - We should move ad-hoc sampling code into a new `sample` function. > - We should move code to add row indices to a DataFrame into a new > `add_row_indices` function. > The benefit is that near-future algorithmic improvements to the > preprocessing code will be much easier to incorporate. -- This message was sent by Atlassian JIRA (v6.4.14#64029)