[ 
https://issues.apache.org/jira/browse/SPARK-25996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Gómez updated SPARK-25996:
----------------------------------
    Description: 
Hi all,

When using pyspark I perform a count operation prior to the previous date of 
the current row, including in the count the current row, with the corresponding 
query:

query = """
 select *,* count ( * ) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as 
total_count
 from df3
 """
 df3 = sqlContext.sql(query)

and return the following:

 
|ACCOUNTID|AMOUNT|TS|total_count|
|1|100|2018-01-01 00:00:01|1|
|1|1000|2018-01-01 10:00:01|1|
|1|25|2018-01-01 10:00:02|2|
|1|500|2018-01-01 10:00:03|3|
|1|100|2018-01-01 10:00:04|4|
|1|80|2018-01-01 10:00:05|5|
|1|700|2018-01-01 11:00:04|1|
|1|205|2018-01-02 10:00:02|1|
|1|500|2018-01-02 10:00:03|2|
|3|80|2018-01-02 10:00:05|1|

 

As you can see, in the third row, the total_count should give 3 instead of 2 
because there are 2 previous records and not 1. In the following rows, the 
error is dragged.
This happens with the other aggregation operations.

Beyond the fact that the date of the first rows is the same, that does not mean 
that these two exist and should not be considered as the only one that exists 
is the last one with the same date.

 

Could you help me?

Thank you

  was:
Qué, tal?

Al utilizar pyspark realizo una operación de conteo de registros previos a la 
fecha anterior de la row actual, incluyendo en el conteo la row actual, con la 
correspondiente query:

query = """
 select *,* count ( * ) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as 
total_count
 from df3
 """
 df3 = sqlContext.sql(query)

y retorna lo siguiente:

 
|ACCOUNTID|AMOUNT|TS|total_count|
|1|100|2018-01-01 00:00:01|1|
|1|1000|2018-01-01 10:00:01|1|
|1|25|2018-01-01 10:00:02|2|
|1|500|2018-01-01 10:00:03|3|
|1|100|2018-01-01 10:00:04|4|
|1|80|2018-01-01 10:00:05|5|
|1|700|2018-01-01 11:00:04|1|
|1|205|2018-01-02 10:00:02|1|
|1|500|2018-01-02 10:00:03|2|
|3|80|2018-01-02 10:00:05|1|

Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez 
de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se 
arrastra el error.
 Esto ocurre con las demás operaciones de agregación.

Más allá de que la fecha de las primeras rows sea la misma, eso no quita que 
estas dos existan y no debería considerarse como que la única que existe es la 
última que tenga la misma fecha.

 

Me podrían ayudar?

Muchas gracias


> Agregaciones no retornan los valores correctos con rows con timestamps iguales
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-25996
>                 URL: https://issues.apache.org/jira/browse/SPARK-25996
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.1, 2.4.0
>         Environment: Windows 10
> PyCharm 2018.2.2
> Python 3.6
>  
>            Reporter: Ignacio Gómez
>            Priority: Major
>
> Hi all,
> When using pyspark I perform a count operation prior to the previous date of 
> the current row, including in the count the current row, with the 
> corresponding query:
> query = """
>  select *,* count ( * ) over (partition by ACCOUNTID
>  order by TS
>  range between interval 5000 milliseconds preceding and current row) as 
> total_count
>  from df3
>  """
>  df3 = sqlContext.sql(query)
> and return the following:
>  
> |ACCOUNTID|AMOUNT|TS|total_count|
> |1|100|2018-01-01 00:00:01|1|
> |1|1000|2018-01-01 10:00:01|1|
> |1|25|2018-01-01 10:00:02|2|
> |1|500|2018-01-01 10:00:03|3|
> |1|100|2018-01-01 10:00:04|4|
> |1|80|2018-01-01 10:00:05|5|
> |1|700|2018-01-01 11:00:04|1|
> |1|205|2018-01-02 10:00:02|1|
> |1|500|2018-01-02 10:00:03|2|
> |3|80|2018-01-02 10:00:05|1|
>  
> As you can see, in the third row, the total_count should give 3 instead of 2 
> because there are 2 previous records and not 1. In the following rows, the 
> error is dragged.
> This happens with the other aggregation operations.
> Beyond the fact that the date of the first rows is the same, that does not 
> mean that these two exist and should not be considered as the only one that 
> exists is the last one with the same date.
>  
> Could you help me?
> Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to