Kevin Marlis created SDAP-406:
---------------------------------
Summary: Time series comparison stats issues
Key: SDAP-406
URL: https://issues.apache.org/jira/browse/SDAP-406
Project: Apache Science Data Analytics Platform
Issue Type: Bug
Components: analysis
Reporter: Kevin Marlis
{*}In short{*}: the time series comparison stats only compute the linear
regression for the results that have sync'd up times. ex: DS1 and DS2 are both
monthly products, but DS1 data falls on the first of the month and DS2 falls on
the middle of the month. With no matching times across the two datasets, none
of the algorithm results data gets provided to the regression algorithm.
{*}In detail{*}: The issue is at this line:
[https://github.com/apache/incubator-sdap-nexus/blob/22b10f661f02e4b8329e3973234b83b188133d8c/analysis/webservice/algorithms_spark/TimeSeriesSpark.py#L314]
{{}}
{{`xy`}} is appended to if there are 2 dictionaries of results in
`{{{}item`{}}}. That only happens if there are two identical time values
between the two datasets. The linear regression algorithm will return nans if x
and y arrays only contain one value, which can be problematic downstream. The
xs and ys for the regression never get appended to because the dates never sync
up ({{{}if len(item) == 2{}}} is never satisfied). Empty comparison stats don't
appear to cause an impact to the charts on the frontend.
{*}Possible fixes...{*}{*}{*} * check if lin regression results are nan, if so
set stats to empty dict
* Date normalization to make the time steps consistent across multiple datasets
For now we're going with the first option, although the second option could be
looked into.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)