from:"\"Jan Gorecki \\\(JIRA\\\)\""

[jira] [Created] (SPARK-26589) proper `median` method for spark dataframe

2019-01-10 Thread Jan Gorecki (JIRA)

Jan Gorecki created SPARK-26589:
---

 Summary: proper `median` method for spark dataframe
 Key: SPARK-26589
 URL: https://issues.apache.org/jira/browse/SPARK-26589
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jan Gorecki


I found multiple tickets asking for median function to be implemented in Spark. 
Most of those tickets links to "SPARK-6761 Approximate quantile" as duplicate 
of it. The thing is that approximate quantile is a workaround for lack of 
median function. Thus I am filling this Feature Request for proper, exact, not 
approximation of, median function. I am aware about difficulties that are 
caused by distributed environment when trying to compute median, nevertheless I 
don't think those difficulties is reason good enough to drop out `median` 
function from scope of Spark. I am not asking about efficient median but exact 
median.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26433) Tail method for spark DataFrame

2018-12-29 Thread Jan Gorecki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730887#comment-16730887
 ] 

Jan Gorecki commented on SPARK-26433:
-

[~hyukjin.kwon] Thank you for your comment but not sure if I understood 
correctly. You mean I should first collect data to client and then extract last 
few rows of dataframe? If so it doesn't seems to be a feasible solution, as 
data in spark are likely to not fit into client machine. `Tail` is exactly the 
operation that one would want to perform BEFORE collecting data to client. 
Could you confirm?

> Tail method for spark DataFrame
> ---
>
> Key: SPARK-26433
> URL: https://issues.apache.org/jira/browse/SPARK-26433
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jan Gorecki
>Priority: Major
>
> There is a head method for spark dataframes which work fine but there doesn't 
> seems to be tail method.
> ```
> >>> ans   
> >>>   
> DataFrame[v1: bigint] 
>   
> >>> ans.head(3)   
> >>>  
> [Row(v1=299443), Row(v1=299493), Row(v1=300751)]
> >>> ans.tail(3)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py
> spark/sql/dataframe.py", line 1300, in __getattr__
> "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
> AttributeError: 'DataFrame' object has no attribute 'tail'
> ```
> I would like to feature request Tail method for spark dataframe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26433) Tail method for spark DataFrame

2018-12-24 Thread Jan Gorecki (JIRA)

Jan Gorecki created SPARK-26433:
---

 Summary: Tail method for spark DataFrame
 Key: SPARK-26433
 URL: https://issues.apache.org/jira/browse/SPARK-26433
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Jan Gorecki


There is a head method for spark dataframes which work fine but there doesn't 
seems to be tail method.

```
>>> ans 
DataFrame[v1: bigint]   
>>> ans.head(3)
[Row(v1=299443), Row(v1=299493), Row(v1=300751)]
>>> ans.tail(3)
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py
spark/sql/dataframe.py", line 1300, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'tail'
```
I would like to feature request Tail method for spark dataframe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16864) Comprehensive version info

2016-08-06 Thread Jan Gorecki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410596#comment-15410596
 ] 

Jan Gorecki edited comment on SPARK-16864 at 8/6/16 11:50 AM:
--

Record exact spark source code reference while processing ETL workflow so 
performance implication can be measures precisely referencing point in time of 
source code. I doubt if version number or date/time is a natural key for spark 
source code, is it? If you don't have a natural key you can't build reliable 
workflow. How would you automatically git clone, reset, build, deploy and 
re-run your workflow - based on data collected by spark - if you don't even 
have git commit there? Lookup git commit hash by version and date... sure it 
works, but why users can't just access that info directly? I don't see ANY 
reason to not have that feature. If you have any I would be glad to read. And 
no, even for developers that info is not available on runtime.


was (Author: jangorecki):
Record exact spark source code reference while processing ETL workflow so 
performance implication can be measures precisely referencing point in time of 
source code. I doubt if version number or date/time is a natural key for spark 
source code, is it? If you don't have a natural key you can't build reliable 
workflow. How would you automatically git clone, reset, build, deploy and 
re-run your workflow - based on data collected by spark - if you don't even 
have git commit there? Lookup git commit hash by version and date... sure it 
works, but why users can't just access that info directly? I don't see ANY 
reason to not have that feature? If you have any I would be glad to read. And 
no, even for developers that info is not available on runtime.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16864) Comprehensive version info

2016-08-06 Thread Jan Gorecki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410596#comment-15410596
 ] 

Jan Gorecki commented on SPARK-16864:
-

Record exact spark source code reference while processing ETL workflow so 
performance implication can be measures precisely referencing point in time of 
source code. I doubt if version number or date/time is a natural key for spark 
source code, is it? If you don't have a natural key you can't build reliable 
workflow. How would you automatically git clone, reset, build, deploy and 
re-run your workflow - based on data collected by spark - if you don't even 
have git commit there? Lookup git commit hash by version and date... sure it 
works, but why users can't just access that info directly? I don't see ANY 
reason to not have that feature? If you have any I would be glad to read. And 
no, even for developers that info is not available on runtime.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16864) Comprehensive version info

2016-08-05 Thread Jan Gorecki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410238#comment-15410238
 ] 

Jan Gorecki edited comment on SPARK-16864 at 8/5/16 10:52 PM:
--

Hi, git commit is relevant to applications at runtime as long as the subject 
for an application (in any dimension) is spark itself. I don't understand why 
that info would not be included. This may not be a problem for people who build 
from source, they can eventually put that metadata in plaintext file (still an 
overhead). The bigger problem is for those who grab binaries and for example 
just want to track performance in their cluster over spark git history. Git 
commit hash is a natural key for a source code of a project, you won't find 
better field to reference source code. Referencing release version is a 
different thing.


was (Author: jangorecki):
Hi, git commit is relevant to applications at runtime as long as the subject 
for an application (in any dimension) is spark itself. I don't understand why 
that info would not be included. This may not be a problem for people who build 
from source, they can eventually put that metadata in plaintext file (still an 
overhead). The bigger problem is for those who grab binaries and for example 
just want to track performance in their cluster over spark git history. Git 
commit hash is a natural key for a source code a project, you won't find better 
field to references the source code. Referencing release versions is simply a 
different thing.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16864) Comprehensive version info

2016-08-05 Thread Jan Gorecki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410238#comment-15410238
 ] 

Jan Gorecki edited comment on SPARK-16864 at 8/5/16 10:51 PM:
--

Hi, git commit is relevant to applications at runtime as long as the subject 
for an application (in any dimension) is spark itself. I don't understand why 
that info would not be included. This may not be a problem for people who build 
from source, they can eventually put that metadata in plaintext file (still an 
overhead). The bigger problem is for those who grab binaries and for example 
just want to track performance in their cluster over spark git history. Git 
commit hash is a natural key for a source code a project, you won't find better 
field to references the source code. Referencing release versions is simply a 
different thing.


was (Author: jangorecki):
Hi, git commit is relevant to applications at runtime as long as the subject 
for an application (in any dimension) is spark itself. I don't understand why 
that info would not be included. This may not be a problem for people who build 
from source, they can eventually put that metadata in plaintext file (still an 
overhead). The bigger problem is for those who just grab binaries and for 
example just want to track performance in their cluster over spark git history. 
Git commit hash is a natural key for a source code a project, you won't find 
better field to references the source code. Referencing release versions is 
simply a different thing.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16864) Comprehensive version info

2016-08-05 Thread Jan Gorecki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410238#comment-15410238
 ] 

Jan Gorecki commented on SPARK-16864:
-

Hi, git commit is relevant to applications at runtime as long as the subject 
for an application (in any dimension) is spark itself. I don't understand why 
that info would not be included. This may not be a problem for people who build 
from source, they can eventually put that metadata in plaintext file (still an 
overhead). The bigger problem is for those who just grab binaries and for 
example just want to track performance in their cluster over spark git history. 
Git commit hash is a natural key for a source code a project, you won't find 
better field to references the source code. Referencing release versions is 
simply a different thing.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26589) proper `median` method for spark dataframe

[jira] [Commented] (SPARK-26433) Tail method for spark DataFrame

[jira] [Created] (SPARK-26433) Tail method for spark DataFrame

[jira] [Comment Edited] (SPARK-16864) Comprehensive version info

[jira] [Commented] (SPARK-16864) Comprehensive version info

[jira] [Comment Edited] (SPARK-16864) Comprehensive version info

[jira] [Comment Edited] (SPARK-16864) Comprehensive version info

[jira] [Commented] (SPARK-16864) Comprehensive version info

8 matches

Site Navigation

Mail list logo

Footer information