[jira] [Created] (SDAP-406) Time series comparison stats issues

Kevin Marlis (Jira) Mon, 17 Oct 2022 15:47:09 -0700

Kevin Marlis created SDAP-406:
---------------------------------

             Summary: Time series comparison stats issues
                 Key: SDAP-406
                 URL: https://issues.apache.org/jira/browse/SDAP-406
             Project: Apache Science Data Analytics Platform
          Issue Type: Bug
          Components: analysis
            Reporter: Kevin Marlis



{*}In short{*}: the time series comparison stats only compute the linear 
regression for the results that have sync'd up times. ex: DS1 and DS2 are both 
monthly products, but DS1 data falls on the first of the month and DS2 falls on 
the middle of the month. With no matching times across the two datasets, none 
of the algorithm results data gets provided to the regression algorithm.
 
{*}In detail{*}: The issue is at this line: 
[https://github.com/apache/incubator-sdap-nexus/blob/22b10f661f02e4b8329e3973234b83b188133d8c/analysis/webservice/algorithms_spark/TimeSeriesSpark.py#L314]
{{}}
{{`xy`}} is appended to if there are 2 dictionaries of results in 
`{{{}item`{}}}. That only happens if there are two identical time values 
between the two datasets. The linear regression algorithm will return nans if x 
and y arrays only contain one value, which can be problematic downstream. The 
xs and ys for the regression never get appended to because the dates never sync 
up ({{{}if len(item) == 2{}}} is never satisfied). Empty comparison stats don't 
appear to cause an impact to the charts on the frontend.
 
{*}Possible fixes...{*}{*}{*} * check if lin regression results are nan, if so 
set stats to empty dict
 * Date normalization to make the time steps consistent across multiple datasets

 

For now we're going with the first option, although the second option could be 
looked into.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (SDAP-406) Time series comparison stats issues

Reply via email to