[jira] [Created] (SPARK-26589) proper `median` method for spark dataframe
Jan Gorecki created SPARK-26589: --- Summary: proper `median` method for spark dataframe Key: SPARK-26589 URL: https://issues.apache.org/jira/browse/SPARK-26589 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.0 Reporter: Jan Gorecki I found multiple tickets asking for median function to be implemented in Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as duplicate of it. The thing is that approximate quantile is a workaround for lack of median function. Thus I am filling this Feature Request for proper, exact, not approximation of, median function. I am aware about difficulties that are caused by distributed environment when trying to compute median, nevertheless I don't think those difficulties is reason good enough to drop out `median` function from scope of Spark. I am not asking about efficient median but exact median. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26433) Tail method for spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730887#comment-16730887 ] Jan Gorecki commented on SPARK-26433: - [~hyukjin.kwon] Thank you for your comment but not sure if I understood correctly. You mean I should first collect data to client and then extract last few rows of dataframe? If so it doesn't seems to be a feasible solution, as data in spark are likely to not fit into client machine. `Tail` is exactly the operation that one would want to perform BEFORE collecting data to client. Could you confirm? > Tail method for spark DataFrame > --- > > Key: SPARK-26433 > URL: https://issues.apache.org/jira/browse/SPARK-26433 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jan Gorecki >Priority: Major > > There is a head method for spark dataframes which work fine but there doesn't > seems to be tail method. > ``` > >>> ans > >>> > DataFrame[v1: bigint] > > >>> ans.head(3) > >>> > [Row(v1=299443), Row(v1=299493), Row(v1=300751)] > >>> ans.tail(3) > Traceback (most recent call last): > File "", line 1, in > File > "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py > spark/sql/dataframe.py", line 1300, in __getattr__ > "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) > AttributeError: 'DataFrame' object has no attribute 'tail' > ``` > I would like to feature request Tail method for spark dataframe -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26433) Tail method for spark DataFrame
Jan Gorecki created SPARK-26433: --- Summary: Tail method for spark DataFrame Key: SPARK-26433 URL: https://issues.apache.org/jira/browse/SPARK-26433 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 2.4.0 Reporter: Jan Gorecki There is a head method for spark dataframes which work fine but there doesn't seems to be tail method. ``` >>> ans DataFrame[v1: bigint] >>> ans.head(3) [Row(v1=299443), Row(v1=299493), Row(v1=300751)] >>> ans.tail(3) Traceback (most recent call last): File "", line 1, in File "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py spark/sql/dataframe.py", line 1300, in __getattr__ "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) AttributeError: 'DataFrame' object has no attribute 'tail' ``` I would like to feature request Tail method for spark dataframe -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410596#comment-15410596 ] Jan Gorecki edited comment on SPARK-16864 at 8/6/16 11:50 AM: -- Record exact spark source code reference while processing ETL workflow so performance implication can be measures precisely referencing point in time of source code. I doubt if version number or date/time is a natural key for spark source code, is it? If you don't have a natural key you can't build reliable workflow. How would you automatically git clone, reset, build, deploy and re-run your workflow - based on data collected by spark - if you don't even have git commit there? Lookup git commit hash by version and date... sure it works, but why users can't just access that info directly? I don't see ANY reason to not have that feature. If you have any I would be glad to read. And no, even for developers that info is not available on runtime. was (Author: jangorecki): Record exact spark source code reference while processing ETL workflow so performance implication can be measures precisely referencing point in time of source code. I doubt if version number or date/time is a natural key for spark source code, is it? If you don't have a natural key you can't build reliable workflow. How would you automatically git clone, reset, build, deploy and re-run your workflow - based on data collected by spark - if you don't even have git commit there? Lookup git commit hash by version and date... sure it works, but why users can't just access that info directly? I don't see ANY reason to not have that feature? If you have any I would be glad to read. And no, even for developers that info is not available on runtime. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410596#comment-15410596 ] Jan Gorecki commented on SPARK-16864: - Record exact spark source code reference while processing ETL workflow so performance implication can be measures precisely referencing point in time of source code. I doubt if version number or date/time is a natural key for spark source code, is it? If you don't have a natural key you can't build reliable workflow. How would you automatically git clone, reset, build, deploy and re-run your workflow - based on data collected by spark - if you don't even have git commit there? Lookup git commit hash by version and date... sure it works, but why users can't just access that info directly? I don't see ANY reason to not have that feature? If you have any I would be glad to read. And no, even for developers that info is not available on runtime. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410238#comment-15410238 ] Jan Gorecki edited comment on SPARK-16864 at 8/5/16 10:52 PM: -- Hi, git commit is relevant to applications at runtime as long as the subject for an application (in any dimension) is spark itself. I don't understand why that info would not be included. This may not be a problem for people who build from source, they can eventually put that metadata in plaintext file (still an overhead). The bigger problem is for those who grab binaries and for example just want to track performance in their cluster over spark git history. Git commit hash is a natural key for a source code of a project, you won't find better field to reference source code. Referencing release version is a different thing. was (Author: jangorecki): Hi, git commit is relevant to applications at runtime as long as the subject for an application (in any dimension) is spark itself. I don't understand why that info would not be included. This may not be a problem for people who build from source, they can eventually put that metadata in plaintext file (still an overhead). The bigger problem is for those who grab binaries and for example just want to track performance in their cluster over spark git history. Git commit hash is a natural key for a source code a project, you won't find better field to references the source code. Referencing release versions is simply a different thing. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410238#comment-15410238 ] Jan Gorecki edited comment on SPARK-16864 at 8/5/16 10:51 PM: -- Hi, git commit is relevant to applications at runtime as long as the subject for an application (in any dimension) is spark itself. I don't understand why that info would not be included. This may not be a problem for people who build from source, they can eventually put that metadata in plaintext file (still an overhead). The bigger problem is for those who grab binaries and for example just want to track performance in their cluster over spark git history. Git commit hash is a natural key for a source code a project, you won't find better field to references the source code. Referencing release versions is simply a different thing. was (Author: jangorecki): Hi, git commit is relevant to applications at runtime as long as the subject for an application (in any dimension) is spark itself. I don't understand why that info would not be included. This may not be a problem for people who build from source, they can eventually put that metadata in plaintext file (still an overhead). The bigger problem is for those who just grab binaries and for example just want to track performance in their cluster over spark git history. Git commit hash is a natural key for a source code a project, you won't find better field to references the source code. Referencing release versions is simply a different thing. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410238#comment-15410238 ] Jan Gorecki commented on SPARK-16864: - Hi, git commit is relevant to applications at runtime as long as the subject for an application (in any dimension) is spark itself. I don't understand why that info would not be included. This may not be a problem for people who build from source, they can eventually put that metadata in plaintext file (still an overhead). The bigger problem is for those who just grab binaries and for example just want to track performance in their cluster over spark git history. Git commit hash is a natural key for a source code a project, you won't find better field to references the source code. Referencing release versions is simply a different thing. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org