[jira] [Commented] (SPARK-14006) Builds of 1.6 branch fail R style check
[ https://issues.apache.org/jira/browse/SPARK-14006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205841#comment-15205841 ] Yin Huai commented on SPARK-14006: -- 1.6 branch is broken because of the R style issue. Can you take a look at it? If backport that PR can fix the problem, yes, please backport it. > Builds of 1.6 branch fail R style check > --- > > Key: SPARK-14006 > URL: https://issues.apache.org/jira/browse/SPARK-14006 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Reporter: Yin Huai > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-1.6-test-sbt-hadoop-2.2/152/console -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14006) Builds of 1.6 branch fail R style check
[ https://issues.apache.org/jira/browse/SPARK-14006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205832#comment-15205832 ] Sun Rui commented on SPARK-14006: - [~yhuai] Do you mean a backport PR to branch 1.6? > Builds of 1.6 branch fail R style check > --- > > Key: SPARK-14006 > URL: https://issues.apache.org/jira/browse/SPARK-14006 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Reporter: Yin Huai > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-1.6-test-sbt-hadoop-2.2/152/console -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14030) Add parameter check to LBFGS
[ https://issues.apache.org/jira/browse/SPARK-14030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14030: -- Assignee: zhengruifeng > Add parameter check to LBFGS > > > Key: SPARK-14030 > URL: https://issues.apache.org/jira/browse/SPARK-14030 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > > Add the missing parameter verification in LBFGS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14030) Add parameter check to LBFGS
[ https://issues.apache.org/jira/browse/SPARK-14030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14030: -- Target Version/s: 2.0.0 > Add parameter check to LBFGS > > > Key: SPARK-14030 > URL: https://issues.apache.org/jira/browse/SPARK-14030 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Trivial > > Add the missing parameter verification in LBFGS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205790#comment-15205790 ] Sun Rui edited comment on SPARK-14037 at 3/22/16 4:49 AM: -- If possible, just use read.df() to load a DataFrame from a CSV file. Loading a CSV file into a local R data.frame and calling createDataFrame() on it to create a DataFrame is more time-consuming because it involves launching of external R processes on worker nodes and two rounds of data serialization/deserialization. 30 seconds is really slow, could you help to get metrics information? Since you are running on standalone mode, you can goto the web UI and find something like below in the worker stderr logs: {code} INFO r.RRDD: Times: boot = 0.518 s, init = 0.009 s, broadcast = 0.000 s, read-input = 0.001 s, compute = 0.002 s, write-output = 0.074 s, total = 0.604 s {code} was (Author: sunrui): If possible, just use read.df() to load a DataFrame from a CSV file. Loading a CSV file into a local R data.frame and calling createDataFrame() on it to create a DataFrame is more time-consuming because it involves launching of external R processes on worker nodes and two rounds of data serialization/deserialization. 30 seconds is really slow, could you help to get metrics information? Since you are running on standalone mode, you can goto the web UI and find something like below in the worker stderr logs: ``` INFO r.RRDD: Times: boot = 0.518 s, init = 0.009 s, broadcast = 0.000 s, read-input = 0.001 s, compute = 0.002 s, write-output = 0.074 s, total = 0.604 s ``` > count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame > -- > > Key: SPARK-14037 > URL: https://issues.apache.org/jira/browse/SPARK-14037 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 > Environment: Ubuntu 12.04 > RAM : 6 GB > Spark 1.6.1 Standalone >Reporter: Samuel Alexander > Labels: performance, sparkR > > Any operations on dataframe created using SparkR::createDataFrame is very > slow. > I have a CSV of size ~ 6MB. Below is the sample content > 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter > 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter > 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter > 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter > 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter > 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter > 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter > 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter > 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter > 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter > I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, > sep=","). And then converted into Spark dataframe using sp_df <- > createDataFrame(sqlContext, r_df) > Now count(sp_df) took more than 30 seconds > When I load the same CSV using spark-csv like, direct_df <- > read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = > "com.databricks.spark.csv", inferSchema = "false", header="true") > count(direct_df) took below 1 sec. > I know performance has been improved in createDataFrame in Spark 1.6. But > other operations like count(), is very slow. > How can I get rid of this performance issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205790#comment-15205790 ] Sun Rui commented on SPARK-14037: - If possible, just use read.df() to load a DataFrame from a CSV file. Loading a CSV file into a local R data.frame and calling createDataFrame() on it to create a DataFrame is more time-consuming because it involves launching of external R processes on worker nodes and two rounds of data serialization/deserialization. 30 seconds is really slow, could you help to get metrics information? Since you are running on standalone mode, you can goto the web UI and find something like below in the worker stderr logs: ``` INFO r.RRDD: Times: boot = 0.518 s, init = 0.009 s, broadcast = 0.000 s, read-input = 0.001 s, compute = 0.002 s, write-output = 0.074 s, total = 0.604 s ``` > count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame > -- > > Key: SPARK-14037 > URL: https://issues.apache.org/jira/browse/SPARK-14037 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 > Environment: Ubuntu 12.04 > RAM : 6 GB > Spark 1.6.1 Standalone >Reporter: Samuel Alexander > Labels: performance, sparkR > > Any operations on dataframe created using SparkR::createDataFrame is very > slow. > I have a CSV of size ~ 6MB. Below is the sample content > 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter > 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter > 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter > 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter > 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter > 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter > 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter > 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter > 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter > 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter > I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, > sep=","). And then converted into Spark dataframe using sp_df <- > createDataFrame(sqlContext, r_df) > Now count(sp_df) took more than 30 seconds > When I load the same CSV using spark-csv like, direct_df <- > read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = > "com.databricks.spark.csv", inferSchema = "false", header="true") > count(direct_df) took below 1 sec. > I know performance has been improved in createDataFrame in Spark 1.6. But > other operations like count(), is very slow. > How can I get rid of this performance issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11507) Error thrown when using BlockMatrix.add
[ https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205734#comment-15205734 ] yuhao yang commented on SPARK-11507: Sure we can do it. About the fix, I assume we should copy first and invoke compact, right? > Error thrown when using BlockMatrix.add > --- > > Key: SPARK-11507 > URL: https://issues.apache.org/jira/browse/SPARK-11507 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1, 1.5.0 > Environment: Mac/local machine, EC2 > Scala >Reporter: Kareem Alhazred >Priority: Minor > > In certain situations when adding two block matrices, I get an error > regarding colPtr and the operation fails. External issue URL includes full > error and code for reproducing the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13883) buildReader implementation for parquet
[ https://issues.apache.org/jira/browse/SPARK-13883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13883. -- Resolution: Fixed Issue resolved by pull request 11709 [https://github.com/apache/spark/pull/11709] > buildReader implementation for parquet > -- > > Key: SPARK-13883 > URL: https://issues.apache.org/jira/browse/SPARK-13883 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 2.0.0 > > > Port parquet to the new strategy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration
[ https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205679#comment-15205679 ] Apache Spark commented on SPARK-14056: -- User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/11876 > Add s3 configurations and spark.hadoop.* configurations to hive configuration > - > > Key: SPARK-14056 > URL: https://issues.apache.org/jira/browse/SPARK-14056 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL >Affects Versions: 1.6.1 >Reporter: Sital Kedia > > Currently when creating a HiveConf in TableReader.scala, we are not passing > s3 specific configurations (like aws s3 credentials) and spark.hadoop.* > configurations set by the user. We should fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration
[ https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14056: Assignee: (was: Apache Spark) > Add s3 configurations and spark.hadoop.* configurations to hive configuration > - > > Key: SPARK-14056 > URL: https://issues.apache.org/jira/browse/SPARK-14056 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL >Affects Versions: 1.6.1 >Reporter: Sital Kedia > > Currently when creating a HiveConf in TableReader.scala, we are not passing > s3 specific configurations (like aws s3 credentials) and spark.hadoop.* > configurations set by the user. We should fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration
[ https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14056: Assignee: Apache Spark > Add s3 configurations and spark.hadoop.* configurations to hive configuration > - > > Key: SPARK-14056 > URL: https://issues.apache.org/jira/browse/SPARK-14056 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL >Affects Versions: 1.6.1 >Reporter: Sital Kedia >Assignee: Apache Spark > > Currently when creating a HiveConf in TableReader.scala, we are not passing > s3 specific configurations (like aws s3 credentials) and spark.hadoop.* > configurations set by the user. We should fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14057) sql time stamps do not respect time zones
Andrew Davidson created SPARK-14057: --- Summary: sql time stamps do not respect time zones Key: SPARK-14057 URL: https://issues.apache.org/jira/browse/SPARK-14057 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Andrew Davidson Priority: Minor we have time stamp data. The time stamp data is UTC how ever when we load the data into spark data frames, the system assume the time stamps are in the local time zone. This causes problems for our data scientists. Often they pull data from our data center into their local macs. The data centers run UTC. There computers are typically in PST or EST. It is possible to hack around this problem This cause a lot of errors in their analysis A complete description of this issue can be found in the following mail msg https://www.mail-archive.com/user@spark.apache.org/msg48121.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14055) AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method
[ https://issues.apache.org/jira/browse/SPARK-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-14055: --- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 Priority: Critical (was: Minor) > AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' > method > > > Key: SPARK-14055 > URL: https://issues.apache.org/jira/browse/SPARK-14055 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0 > Environment: Spark 2.0-SNAPSHOT > Single Rack > Standalone mode scheduling > 8 node cluster > 16 cores & 64G RAM / node > Data Replication factor of 2 > Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM. >Reporter: Ernest >Priority: Critical > > We got the following log when running _LiveJournalPageRank_. > {quote} > 452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to > acquire write lock for rdd_3_183 > 452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write > lock for rdd_3_183 > 456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from > memory > 456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size > 418784648 dropped from memory (free 3504141600) > 457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block > rdd_3_183 > 457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block > rdd_3_183 > 457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to > remove block rdd_3_183 > 500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put > rdd_3_183 > 500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to > acquire read lock for rdd_3_183 > 500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to > acquire write lock for rdd_3_183 > 500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write > lock for rdd_3_183 > 517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ** taskAttemptId is: > 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError > happeneds here* > 517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage > 10.0 (TID 1662) > 517259-java.lang.AssertionError: assertion failed > 517260- at scala.Predef$.assert(Predef.scala:151) > 517261- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356) > 517262- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351) > 517263- at scala.Option.foreach(Option.scala:257) > 517264- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351) > 517265- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350) > 517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > 517267- at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350) > 517268- at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626) > 517269- at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238) > {quote} > When memory for RDD storage is not sufficient and have to evict several > partitions, this _AssertionError_ may happened. > For the above example, this is because while running _Task 1662_, several > partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired > read and write locks at first, then doing _dropBlock_ method in > _MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from > memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into > _BlockInfoManager.removeBlock_, but _writeLocksByTask_ is not update here. > Unfortunately, _Task 1681_ is already started and needed to reproduce > rdd\_3\_183 to produce it's target rdd here , and this task acquired write > lock of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last, > this _AssertionError_ occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14055) AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method
[ https://issues.apache.org/jira/browse/SPARK-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14055: Assignee: (was: Apache Spark) > AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' > method > > > Key: SPARK-14055 > URL: https://issues.apache.org/jira/browse/SPARK-14055 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core > Environment: Spark 2.0-SNAPSHOT > Single Rack > Standalone mode scheduling > 8 node cluster > 16 cores & 64G RAM / node > Data Replication factor of 2 > Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM. >Reporter: Ernest >Priority: Minor > > We got the following log when running _LiveJournalPageRank_. > {quote} > 452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to > acquire write lock for rdd_3_183 > 452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write > lock for rdd_3_183 > 456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from > memory > 456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size > 418784648 dropped from memory (free 3504141600) > 457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block > rdd_3_183 > 457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block > rdd_3_183 > 457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to > remove block rdd_3_183 > 500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put > rdd_3_183 > 500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to > acquire read lock for rdd_3_183 > 500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to > acquire write lock for rdd_3_183 > 500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write > lock for rdd_3_183 > 517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ** taskAttemptId is: > 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError > happeneds here* > 517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage > 10.0 (TID 1662) > 517259-java.lang.AssertionError: assertion failed > 517260- at scala.Predef$.assert(Predef.scala:151) > 517261- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356) > 517262- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351) > 517263- at scala.Option.foreach(Option.scala:257) > 517264- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351) > 517265- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350) > 517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > 517267- at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350) > 517268- at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626) > 517269- at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238) > {quote} > When memory for RDD storage is not sufficient and have to evict several > partitions, this _AssertionError_ may happened. > For the above example, this is because while running _Task 1662_, several > partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired > read and write locks at first, then doing _dropBlock_ method in > _MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from > memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into > _BlockInfoManager.removeBlock_, but _writeLocksByTask_ is not update here. > Unfortunately, _Task 1681_ is already started and needed to reproduce > rdd\_3\_183 to produce it's target rdd here , and this task acquired write > lock of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last, > this _AssertionError_ occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14055) AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method
[ https://issues.apache.org/jira/browse/SPARK-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205659#comment-15205659 ] Apache Spark commented on SPARK-14055: -- User 'Earne' has created a pull request for this issue: https://github.com/apache/spark/pull/11875 > AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' > method > > > Key: SPARK-14055 > URL: https://issues.apache.org/jira/browse/SPARK-14055 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core > Environment: Spark 2.0-SNAPSHOT > Single Rack > Standalone mode scheduling > 8 node cluster > 16 cores & 64G RAM / node > Data Replication factor of 2 > Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM. >Reporter: Ernest >Priority: Minor > > We got the following log when running _LiveJournalPageRank_. > {quote} > 452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to > acquire write lock for rdd_3_183 > 452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write > lock for rdd_3_183 > 456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from > memory > 456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size > 418784648 dropped from memory (free 3504141600) > 457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block > rdd_3_183 > 457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block > rdd_3_183 > 457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to > remove block rdd_3_183 > 500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put > rdd_3_183 > 500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to > acquire read lock for rdd_3_183 > 500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to > acquire write lock for rdd_3_183 > 500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write > lock for rdd_3_183 > 517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ** taskAttemptId is: > 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError > happeneds here* > 517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage > 10.0 (TID 1662) > 517259-java.lang.AssertionError: assertion failed > 517260- at scala.Predef$.assert(Predef.scala:151) > 517261- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356) > 517262- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351) > 517263- at scala.Option.foreach(Option.scala:257) > 517264- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351) > 517265- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350) > 517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > 517267- at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350) > 517268- at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626) > 517269- at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238) > {quote} > When memory for RDD storage is not sufficient and have to evict several > partitions, this _AssertionError_ may happened. > For the above example, this is because while running _Task 1662_, several > partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired > read and write locks at first, then doing _dropBlock_ method in > _MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from > memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into > _BlockInfoManager.removeBlock_, but _writeLocksByTask_ is not update here. > Unfortunately, _Task 1681_ is already started and needed to reproduce > rdd\_3\_183 to produce it's target rdd here , and this task acquired write > lock of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last, > this _AssertionError_ occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14055) AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method
[ https://issues.apache.org/jira/browse/SPARK-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14055: Assignee: Apache Spark > AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' > method > > > Key: SPARK-14055 > URL: https://issues.apache.org/jira/browse/SPARK-14055 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core > Environment: Spark 2.0-SNAPSHOT > Single Rack > Standalone mode scheduling > 8 node cluster > 16 cores & 64G RAM / node > Data Replication factor of 2 > Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM. >Reporter: Ernest >Assignee: Apache Spark >Priority: Minor > > We got the following log when running _LiveJournalPageRank_. > {quote} > 452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to > acquire write lock for rdd_3_183 > 452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write > lock for rdd_3_183 > 456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from > memory > 456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size > 418784648 dropped from memory (free 3504141600) > 457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block > rdd_3_183 > 457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block > rdd_3_183 > 457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to > remove block rdd_3_183 > 500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put > rdd_3_183 > 500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to > acquire read lock for rdd_3_183 > 500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to > acquire write lock for rdd_3_183 > 500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write > lock for rdd_3_183 > 517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ** taskAttemptId is: > 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError > happeneds here* > 517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage > 10.0 (TID 1662) > 517259-java.lang.AssertionError: assertion failed > 517260- at scala.Predef$.assert(Predef.scala:151) > 517261- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356) > 517262- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351) > 517263- at scala.Option.foreach(Option.scala:257) > 517264- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351) > 517265- at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350) > 517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > 517267- at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350) > 517268- at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626) > 517269- at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238) > {quote} > When memory for RDD storage is not sufficient and have to evict several > partitions, this _AssertionError_ may happened. > For the above example, this is because while running _Task 1662_, several > partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired > read and write locks at first, then doing _dropBlock_ method in > _MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from > memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into > _BlockInfoManager.removeBlock_, but _writeLocksByTask_ is not update here. > Unfortunately, _Task 1681_ is already started and needed to reproduce > rdd\_3\_183 to produce it's target rdd here , and this task acquired write > lock of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last, > this _AssertionError_ occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3000) Drop old blocks to disk in parallel when memory is not large enough for caching new blocks
[ https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3000: --- Assignee: Josh Rosen (was: Apache Spark) > Drop old blocks to disk in parallel when memory is not large enough for > caching new blocks > -- > > Key: SPARK-3000 > URL: https://issues.apache.org/jira/browse/SPARK-3000 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye >Assignee: Josh Rosen > Attachments: Spark-3000 Design Doc.pdf > > > In spark, rdd can be cached in memory for later use, and the cached memory > size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark > version before 1.1.0, and "*spark.executor.memory * > spark.storage.memoryFraction * spark.storage.safetyFraction*" after > [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. > For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache > new blocks, old blocks might be dropped to disk to free up memory for new > blocks. This operation is processed by _ensureFreeSpace_ in > _MemoryStore.scala_, there will always be a "*accountingLock*" held by the > caller to ensure only one thread is dropping blocks. This method can not > fully used the disks throughput when there are multiple disks on the working > node. When testing our workload, we found this is really a bottleneck when > size of old blocks to be dropped is really large. > We have tested the parallel method on spark 1.0, the speedup is significant. > So it's necessary to make dropping blocks operation in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3000) Drop old blocks to disk in parallel when memory is not large enough for caching new blocks
[ https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205651#comment-15205651 ] Apache Spark commented on SPARK-3000: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11874 > Drop old blocks to disk in parallel when memory is not large enough for > caching new blocks > -- > > Key: SPARK-3000 > URL: https://issues.apache.org/jira/browse/SPARK-3000 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye >Assignee: Josh Rosen > Attachments: Spark-3000 Design Doc.pdf > > > In spark, rdd can be cached in memory for later use, and the cached memory > size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark > version before 1.1.0, and "*spark.executor.memory * > spark.storage.memoryFraction * spark.storage.safetyFraction*" after > [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. > For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache > new blocks, old blocks might be dropped to disk to free up memory for new > blocks. This operation is processed by _ensureFreeSpace_ in > _MemoryStore.scala_, there will always be a "*accountingLock*" held by the > caller to ensure only one thread is dropping blocks. This method can not > fully used the disks throughput when there are multiple disks on the working > node. When testing our workload, we found this is really a bottleneck when > size of old blocks to be dropped is really large. > We have tested the parallel method on spark 1.0, the speedup is significant. > So it's necessary to make dropping blocks operation in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3000) Drop old blocks to disk in parallel when memory is not large enough for caching new blocks
[ https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3000: --- Assignee: Apache Spark (was: Josh Rosen) > Drop old blocks to disk in parallel when memory is not large enough for > caching new blocks > -- > > Key: SPARK-3000 > URL: https://issues.apache.org/jira/browse/SPARK-3000 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye >Assignee: Apache Spark > Attachments: Spark-3000 Design Doc.pdf > > > In spark, rdd can be cached in memory for later use, and the cached memory > size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark > version before 1.1.0, and "*spark.executor.memory * > spark.storage.memoryFraction * spark.storage.safetyFraction*" after > [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. > For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache > new blocks, old blocks might be dropped to disk to free up memory for new > blocks. This operation is processed by _ensureFreeSpace_ in > _MemoryStore.scala_, there will always be a "*accountingLock*" held by the > caller to ensure only one thread is dropping blocks. This method can not > fully used the disks throughput when there are multiple disks on the working > node. When testing our workload, we found this is really a bottleneck when > size of old blocks to be dropped is really large. > We have tested the parallel method on spark 1.0, the speedup is significant. > So it's necessary to make dropping blocks operation in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration
[ https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-14056: Affects Version/s: 1.6.1 Component/s: SQL EC2 > Add s3 configurations and spark.hadoop.* configurations to hive configuration > - > > Key: SPARK-14056 > URL: https://issues.apache.org/jira/browse/SPARK-14056 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL >Affects Versions: 1.6.1 >Reporter: Sital Kedia > > Currently when creating a HiveConf in TableReader.scala, we are not passing > s3 specific configurations (like aws s3 credentials) and spark.hadoop.* > configurations set by the user. We should fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration
Sital Kedia created SPARK-14056: --- Summary: Add s3 configurations and spark.hadoop.* configurations to hive configuration Key: SPARK-14056 URL: https://issues.apache.org/jira/browse/SPARK-14056 Project: Spark Issue Type: Improvement Reporter: Sital Kedia Currently when creating a HiveConf in TableReader.scala, we are not passing s3 specific configurations (like aws s3 credentials) and spark.hadoop.* configurations set by the user. We should fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14055) AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method
Ernest created SPARK-14055: -- Summary: AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method Key: SPARK-14055 URL: https://issues.apache.org/jira/browse/SPARK-14055 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Environment: Spark 2.0-SNAPSHOT Single Rack Standalone mode scheduling 8 node cluster 16 cores & 64G RAM / node Data Replication factor of 2 Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM. Reporter: Ernest Priority: Minor We got the following log when running _LiveJournalPageRank_. {quote} 452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to acquire write lock for rdd_3_183 452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write lock for rdd_3_183 456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from memory 456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size 418784648 dropped from memory (free 3504141600) 457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block rdd_3_183 457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block rdd_3_183 457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to remove block rdd_3_183 500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put rdd_3_183 500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to acquire read lock for rdd_3_183 500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to acquire write lock for rdd_3_183 500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write lock for rdd_3_183 517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ** taskAttemptId is: 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError happeneds here* 517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage 10.0 (TID 1662) 517259-java.lang.AssertionError: assertion failed 517260- at scala.Predef$.assert(Predef.scala:151) 517261- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356) 517262- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351) 517263- at scala.Option.foreach(Option.scala:257) 517264- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351) 517265- at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350) 517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) 517267- at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350) 517268- at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626) 517269- at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238) {quote} When memory for RDD storage is not sufficient and have to evict several partitions, this _AssertionError_ may happened. For the above example, this is because while running _Task 1662_, several partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired read and write locks at first, then doing _dropBlock_ method in _MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into _BlockInfoManager.removeBlock_, but _writeLocksByTask_ is not update here. Unfortunately, _Task 1681_ is already started and needed to reproduce rdd\_3\_183 to produce it's target rdd here , and this task acquired write lock of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last, this _AssertionError_ occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14036) Remove mllib.tree.model.Node.build
[ https://issues.apache.org/jira/browse/SPARK-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205590#comment-15205590 ] Apache Spark commented on SPARK-14036: -- User 'rishabhbhardwaj' has created a pull request for this issue: https://github.com/apache/spark/pull/11873 > Remove mllib.tree.model.Node.build > -- > > Key: SPARK-14036 > URL: https://issues.apache.org/jira/browse/SPARK-14036 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Trivial > > mllib.tree.model.Node.build has been deprecated for a year. We should remove > it for 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14036) Remove mllib.tree.model.Node.build
[ https://issues.apache.org/jira/browse/SPARK-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14036: Assignee: (was: Apache Spark) > Remove mllib.tree.model.Node.build > -- > > Key: SPARK-14036 > URL: https://issues.apache.org/jira/browse/SPARK-14036 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Trivial > > mllib.tree.model.Node.build has been deprecated for a year. We should remove > it for 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14036) Remove mllib.tree.model.Node.build
[ https://issues.apache.org/jira/browse/SPARK-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14036: Assignee: Apache Spark > Remove mllib.tree.model.Node.build > -- > > Key: SPARK-14036 > URL: https://issues.apache.org/jira/browse/SPARK-14036 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Trivial > > mllib.tree.model.Node.build has been deprecated for a year. We should remove > it for 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14038) Enable native view by default
[ https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205579#comment-15205579 ] Apache Spark commented on SPARK-14038: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/11872 > Enable native view by default > - > > Key: SPARK-14038 > URL: https://issues.apache.org/jira/browse/SPARK-14038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: releasenotes > > Release note update: > {quote} > Starting from 2.0.0, Spark SQL handles views natively by default. When > defining a view, now Spark SQL canonicalizes view definition by generating a > canonical SQL statement from the parsed logical query plan, and then stores > it into the catalog. If you hit any problems, you may try to turn off native > view by setting {{spark.sql.nativeView}} to false. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14016) Support high-precision decimals in vectorized parquet reader
[ https://issues.apache.org/jira/browse/SPARK-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14016. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11869 [https://github.com/apache/spark/pull/11869] > Support high-precision decimals in vectorized parquet reader > > > Key: SPARK-14016 > URL: https://issues.apache.org/jira/browse/SPARK-14016 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14016) Support high-precision decimals in vectorized parquet reader
[ https://issues.apache.org/jira/browse/SPARK-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14016: - Assignee: Sameer Agarwal > Support high-precision decimals in vectorized parquet reader > > > Key: SPARK-14016 > URL: https://issues.apache.org/jira/browse/SPARK-14016 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3000) Drop old blocks to disk in parallel when memory is not large enough for caching new blocks
[ https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3000: -- Target Version/s: 2.0.0 Summary: Drop old blocks to disk in parallel when memory is not large enough for caching new blocks (was: drop old blocks to disk in parallel when memory is not large enough for caching new blocks) > Drop old blocks to disk in parallel when memory is not large enough for > caching new blocks > -- > > Key: SPARK-3000 > URL: https://issues.apache.org/jira/browse/SPARK-3000 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye >Assignee: Josh Rosen > Attachments: Spark-3000 Design Doc.pdf > > > In spark, rdd can be cached in memory for later use, and the cached memory > size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark > version before 1.1.0, and "*spark.executor.memory * > spark.storage.memoryFraction * spark.storage.safetyFraction*" after > [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. > For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache > new blocks, old blocks might be dropped to disk to free up memory for new > blocks. This operation is processed by _ensureFreeSpace_ in > _MemoryStore.scala_, there will always be a "*accountingLock*" held by the > caller to ensure only one thread is dropping blocks. This method can not > fully used the disks throughput when there are multiple disks on the working > node. When testing our workload, we found this is really a bottleneck when > size of old blocks to be dropped is really large. > We have tested the parallel method on spark 1.0, the speedup is significant. > So it's necessary to make dropping blocks operation in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14054) Support parameters for UDTs
Kevin Chen created SPARK-14054: -- Summary: Support parameters for UDTs Key: SPARK-14054 URL: https://issues.apache.org/jira/browse/SPARK-14054 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.6.1 Reporter: Kevin Chen Priority: Minor Currently UDTs with parameters, e.g. generic types are not supported. Json serialized UDTs are instantiated with reflection by a parameter-less constructor (DataType.fromJson). This means a user needs to create a separate UDT for types that differ only in generic types, e.g. one backed by a list of string and another backed by a list of integer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3000) drop old blocks to disk in parallel when memory is not large enough for caching new blocks
[ https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reopened SPARK-3000: --- Assignee: Josh Rosen (was: Zhang, Liye) I'm going to re-open this issue and will submit a significantly simplified patch for it. > drop old blocks to disk in parallel when memory is not large enough for > caching new blocks > -- > > Key: SPARK-3000 > URL: https://issues.apache.org/jira/browse/SPARK-3000 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye >Assignee: Josh Rosen > Attachments: Spark-3000 Design Doc.pdf > > > In spark, rdd can be cached in memory for later use, and the cached memory > size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark > version before 1.1.0, and "*spark.executor.memory * > spark.storage.memoryFraction * spark.storage.safetyFraction*" after > [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. > For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache > new blocks, old blocks might be dropped to disk to free up memory for new > blocks. This operation is processed by _ensureFreeSpace_ in > _MemoryStore.scala_, there will always be a "*accountingLock*" held by the > caller to ensure only one thread is dropping blocks. This method can not > fully used the disks throughput when there are multiple disks on the working > node. When testing our workload, we found this is really a bottleneck when > size of old blocks to be dropped is really large. > We have tested the parallel method on spark 1.0, the speedup is significant. > So it's necessary to make dropping blocks operation in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3000) drop old blocks to disk in parallel when memory is not large enough for caching new blocks
[ https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3000: -- Component/s: Block Manager > drop old blocks to disk in parallel when memory is not large enough for > caching new blocks > -- > > Key: SPARK-3000 > URL: https://issues.apache.org/jira/browse/SPARK-3000 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye > Attachments: Spark-3000 Design Doc.pdf > > > In spark, rdd can be cached in memory for later use, and the cached memory > size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark > version before 1.1.0, and "*spark.executor.memory * > spark.storage.memoryFraction * spark.storage.safetyFraction*" after > [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. > For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache > new blocks, old blocks might be dropped to disk to free up memory for new > blocks. This operation is processed by _ensureFreeSpace_ in > _MemoryStore.scala_, there will always be a "*accountingLock*" held by the > caller to ensure only one thread is dropping blocks. This method can not > fully used the disks throughput when there are multiple disks on the working > node. When testing our workload, we found this is really a bottleneck when > size of old blocks to be dropped is really large. > We have tested the parallel method on spark 1.0, the speedup is significant. > So it's necessary to make dropping blocks operation in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14053) Merge absTol and relTol into one in MLlib tests
[ https://issues.apache.org/jira/browse/SPARK-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205514#comment-15205514 ] DB Tsai edited comment on SPARK-14053 at 3/22/16 1:13 AM: -- This makes sense for me. We just need to document it properly. Also, the current code for comparing double is symmetric. We can do If abs(y) > eps / t && abs(x) > eps / t test abs(y - x) < t * math.min(absX, absY) else test abs(y - x) < eps ``` was (Author: dbtsai): This makes sense for me. We just need to document it properly. Also, the current code for comparing double is symmetric. We can do ```If (abs(y) > eps / t && abs(x) > eps / t) test abs(y - x) < t * math.min(absX, absY) else test abs(y - x) < eps ``` > Merge absTol and relTol into one in MLlib tests > --- > > Key: SPARK-14053 > URL: https://issues.apache.org/jira/browse/SPARK-14053 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, Tests >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We have absTol and relTol in MLlib tests to compare values with possible > numerical differences. However, in most cases we should just use relTol. Many > absTol are not used properly. See > https://github.com/apache/spark/search?q=absTol. One corner case relTol > doesn't handle is when the target value is 0. We can make the following > change to relTol to solve the issue. Consider `x ~== y relTol t`. > 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t, > 2. else test abs(y - x) < eps > where eps is a reasonably small value, e.g., 1e-14. Note that the transition > is smooth at abs( y ) = eps / t. > cc [~dbtsai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14053) Merge absTol and relTol into one in MLlib tests
[ https://issues.apache.org/jira/browse/SPARK-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205514#comment-15205514 ] DB Tsai edited comment on SPARK-14053 at 3/22/16 1:14 AM: -- This makes sense for me. We just need to document it properly. Also, the current code for comparing double is symmetric. We can do If abs( y ) > eps / t && abs( x ) > eps / t test abs(y - x) < t * math.min(absX, absY) else test abs(y - x) < eps ``` was (Author: dbtsai): This makes sense for me. We just need to document it properly. Also, the current code for comparing double is symmetric. We can do If abs(y) > eps / t && abs(x) > eps / t test abs(y - x) < t * math.min(absX, absY) else test abs(y - x) < eps ``` > Merge absTol and relTol into one in MLlib tests > --- > > Key: SPARK-14053 > URL: https://issues.apache.org/jira/browse/SPARK-14053 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, Tests >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We have absTol and relTol in MLlib tests to compare values with possible > numerical differences. However, in most cases we should just use relTol. Many > absTol are not used properly. See > https://github.com/apache/spark/search?q=absTol. One corner case relTol > doesn't handle is when the target value is 0. We can make the following > change to relTol to solve the issue. Consider `x ~== y relTol t`. > 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t, > 2. else test abs(y - x) < eps > where eps is a reasonably small value, e.g., 1e-14. Note that the transition > is smooth at abs( y ) = eps / t. > cc [~dbtsai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14053) Merge absTol and relTol into one in MLlib tests
[ https://issues.apache.org/jira/browse/SPARK-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205514#comment-15205514 ] DB Tsai edited comment on SPARK-14053 at 3/22/16 1:13 AM: -- This makes sense for me. We just need to document it properly. Also, the current code for comparing double is symmetric. We can do ```If (abs(y) > eps / t && abs(x) > eps / t) test abs(y - x) < t * math.min(absX, absY) else test abs(y - x) < eps ``` was (Author: dbtsai): This makes sense for me. We just need to document it properly. Also, the current code for comparing double is symmetric. We can do If (abs(y) > eps / t && abs(x) > eps / t) test abs(y - x) < t * math.min(absX, absY) else test abs(y - x) < eps > Merge absTol and relTol into one in MLlib tests > --- > > Key: SPARK-14053 > URL: https://issues.apache.org/jira/browse/SPARK-14053 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, Tests >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We have absTol and relTol in MLlib tests to compare values with possible > numerical differences. However, in most cases we should just use relTol. Many > absTol are not used properly. See > https://github.com/apache/spark/search?q=absTol. One corner case relTol > doesn't handle is when the target value is 0. We can make the following > change to relTol to solve the issue. Consider `x ~== y relTol t`. > 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t, > 2. else test abs(y - x) < eps > where eps is a reasonably small value, e.g., 1e-14. Note that the transition > is smooth at abs( y ) = eps / t. > cc [~dbtsai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14053) Merge absTol and relTol into one in MLlib tests
[ https://issues.apache.org/jira/browse/SPARK-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205514#comment-15205514 ] DB Tsai commented on SPARK-14053: - This makes sense for me. We just need to document it properly. Also, the current code for comparing double is symmetric. We can do If (abs(y) > eps / t && abs(x) > eps / t) test abs(y - x) < t * math.min(absX, absY) else test abs(y - x) < eps > Merge absTol and relTol into one in MLlib tests > --- > > Key: SPARK-14053 > URL: https://issues.apache.org/jira/browse/SPARK-14053 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, Tests >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We have absTol and relTol in MLlib tests to compare values with possible > numerical differences. However, in most cases we should just use relTol. Many > absTol are not used properly. See > https://github.com/apache/spark/search?q=absTol. One corner case relTol > doesn't handle is when the target value is 0. We can make the following > change to relTol to solve the issue. Consider `x ~== y relTol t`. > 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t, > 2. else test abs(y - x) < eps > where eps is a reasonably small value, e.g., 1e-14. Note that the transition > is smooth at abs( y ) = eps / t. > cc [~dbtsai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205488#comment-15205488 ] Josh Rosen commented on SPARK-6305: --- Hey Sean, did you get very far along with this? I'd like to revisit doing a Log4J 2.x upgrade in Spark 2.0 in order to benefit from some performance benefits in the new Log4J. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
[ https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205446#comment-15205446 ] Jason C Lee commented on SPARK-13802: - I will give it a shot! Working on the PR at the moment. > Fields order in Row(**kwargs) is not consistent with Schema.toInternal method > - > > Key: SPARK-13802 > URL: https://issues.apache.org/jira/browse/SPARK-13802 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Szymon Matejczyk > > When using Row constructor from kwargs, fields in the tuple underneath are > sorted by name. When Schema is reading the row, it is not using the fields in > this order. > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > schema = StructType([ > StructField("id", StringType()), > StructField("first_name", StringType())]) > row = Row(id="39", first_name="Szymon") > schema.toInternal(row) > Out[5]: ('Szymon', '39') > {code} > {code} > df = sqlContext.createDataFrame([row], schema) > df.show(1) > +--+--+ > |id|first_name| > +--+--+ > |Szymon|39| > +--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13320) Confusing error message for Dataset API when using sum("*")
[ https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-13320. - Resolution: Fixed > Confusing error message for Dataset API when using sum("*") > --- > > Key: SPARK-13320 > URL: https://issues.apache.org/jira/browse/SPARK-13320 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Xiao Li > > {code} > pagecounts4PartitionsDS > .map(line => (line._1, line._3)) > .toDF() > .groupBy($"_1") > .agg(sum("*") as "sumOccurances") > {code} > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input > columns _1, _2; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57) > at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213) > {code} > The error is with sum("*"), not the resolution of group by "_1". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-13320) Confusing error message for Dataset API when using sum("*")
[ https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-13320: Assignee: Xiao Li > Confusing error message for Dataset API when using sum("*") > --- > > Key: SPARK-13320 > URL: https://issues.apache.org/jira/browse/SPARK-13320 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Xiao Li > > {code} > pagecounts4PartitionsDS > .map(line => (line._1, line._3)) > .toDF() > .groupBy($"_1") > .agg(sum("*") as "sumOccurances") > {code} > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input > columns _1, _2; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57) > at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213) > {code} > The error is with sum("*"), not the resolution of group by "_1". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (SPARK-13990) Automatically pick serializer when caching RDDs
[ https://issues.apache.org/jira/browse/SPARK-13990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-13990. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11801 [https://github.com/apache/spark/pull/11801] > Automatically pick serializer when caching RDDs > --- > > Key: SPARK-13990 > URL: https://issues.apache.org/jira/browse/SPARK-13990 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > Building on the SerializerManager infrastructure introduced in SPARK-13926, > we should use RDDs ClassTags to automatically pick serializers when caching > RDDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13822) Follow-ups of DataFrame/Dataset API unification
[ https://issues.apache.org/jira/browse/SPARK-13822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13822. - Resolution: Fixed Fix Version/s: 2.0.0 > Follow-ups of DataFrame/Dataset API unification > --- > > Key: SPARK-13822 > URL: https://issues.apache.org/jira/browse/SPARK-13822 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > Fix For: 2.0.0 > > > This is an umbrella ticket for all follow-up work of DataFrame/Dataset API > unification (SPARK-13244). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13898) Merge DatasetHolder and DataFrameHolder
[ https://issues.apache.org/jira/browse/SPARK-13898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13898. - Resolution: Fixed Fix Version/s: 2.0.0 > Merge DatasetHolder and DataFrameHolder > --- > > Key: SPARK-13898 > URL: https://issues.apache.org/jira/browse/SPARK-13898 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > Not 100% sure yet, but I think maybe they should just be a single class, and > most things in SQLImplicits should probably return Datasets of specific types > instead of DataFrames. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-13587: --- Issue Type: New Feature (was: Improvement) > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14050) Add multiple languages support for Stop Words Remover
[ https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14050: Assignee: Apache Spark > Add multiple languages support for Stop Words Remover > - > > Key: SPARK-14050 > URL: https://issues.apache.org/jira/browse/SPARK-14050 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Burak KÖSE >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14050) Add multiple languages support for Stop Words Remover
[ https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205399#comment-15205399 ] Apache Spark commented on SPARK-14050: -- User 'burakkose' has created a pull request for this issue: https://github.com/apache/spark/pull/11871 > Add multiple languages support for Stop Words Remover > - > > Key: SPARK-14050 > URL: https://issues.apache.org/jira/browse/SPARK-14050 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Burak KÖSE > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14050) Add multiple languages support for Stop Words Remover
[ https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14050: Assignee: (was: Apache Spark) > Add multiple languages support for Stop Words Remover > - > > Key: SPARK-14050 > URL: https://issues.apache.org/jira/browse/SPARK-14050 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Burak KÖSE > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13916) For whole stage codegen, measure and add the execution duration as a metric
[ https://issues.apache.org/jira/browse/SPARK-13916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13916. - Resolution: Fixed Assignee: Nong Li Fix Version/s: 2.0.0 > For whole stage codegen, measure and add the execution duration as a metric > --- > > Key: SPARK-13916 > URL: https://issues.apache.org/jira/browse/SPARK-13916 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li >Assignee: Nong Li >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14053) Merge absTol and relTol into one in MLlib tests
Xiangrui Meng created SPARK-14053: - Summary: Merge absTol and relTol into one in MLlib tests Key: SPARK-14053 URL: https://issues.apache.org/jira/browse/SPARK-14053 Project: Spark Issue Type: Improvement Components: ML, MLlib, Tests Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We have absTol and relTol in MLlib tests to compare values with possible numerical differences. However, in most cases we should just use relTol. Many absTol are not used properly. See https://github.com/apache/spark/search?q=absTol. One corner case relTol doesn't handle is when the target value is 0. We can make the following change to relTol to solve the issue. Consider `x ~== y relTol t`. 1. If abs(y) > eps / t, test abs(y - x) / abs(y) < t, 2. else test abs(y - x) < eps where eps is a reasonably small value, e.g., 1e-14. Note that the transition is smooth at abs(y) = eps / t. cc [~dbtsai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14053) Merge absTol and relTol into one in MLlib tests
[ https://issues.apache.org/jira/browse/SPARK-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14053: -- Description: We have absTol and relTol in MLlib tests to compare values with possible numerical differences. However, in most cases we should just use relTol. Many absTol are not used properly. See https://github.com/apache/spark/search?q=absTol. One corner case relTol doesn't handle is when the target value is 0. We can make the following change to relTol to solve the issue. Consider `x ~== y relTol t`. 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t, 2. else test abs(y - x) < eps where eps is a reasonably small value, e.g., 1e-14. Note that the transition is smooth at abs( y ) = eps / t. cc [~dbtsai] was: We have absTol and relTol in MLlib tests to compare values with possible numerical differences. However, in most cases we should just use relTol. Many absTol are not used properly. See https://github.com/apache/spark/search?q=absTol. One corner case relTol doesn't handle is when the target value is 0. We can make the following change to relTol to solve the issue. Consider `x ~== y relTol t`. 1. If abs(y) > eps / t, test abs(y - x) / abs(y) < t, 2. else test abs(y - x) < eps where eps is a reasonably small value, e.g., 1e-14. Note that the transition is smooth at abs(y) = eps / t. cc [~dbtsai] > Merge absTol and relTol into one in MLlib tests > --- > > Key: SPARK-14053 > URL: https://issues.apache.org/jira/browse/SPARK-14053 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, Tests >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We have absTol and relTol in MLlib tests to compare values with possible > numerical differences. However, in most cases we should just use relTol. Many > absTol are not used properly. See > https://github.com/apache/spark/search?q=absTol. One corner case relTol > doesn't handle is when the target value is 0. We can make the following > change to relTol to solve the issue. Consider `x ~== y relTol t`. > 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t, > 2. else test abs(y - x) < eps > where eps is a reasonably small value, e.g., 1e-14. Note that the transition > is smooth at abs( y ) = eps / t. > cc [~dbtsai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14052) Build BytesToBytesMap in HashedRelation
[ https://issues.apache.org/jira/browse/SPARK-14052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14052: Assignee: Davies Liu (was: Apache Spark) > Build BytesToBytesMap in HashedRelation > --- > > Key: SPARK-14052 > URL: https://issues.apache.org/jira/browse/SPARK-14052 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Currently, for the key that can not fit within a long, we build a hash map > for UnsafeHashedRelation, it's converted to BytesToBytesMap after > serialization and deserialization. > We should build a BytesToBytesMap directly to have better memory efficiency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14052) Build BytesToBytesMap in HashedRelation
[ https://issues.apache.org/jira/browse/SPARK-14052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14052: Assignee: Apache Spark (was: Davies Liu) > Build BytesToBytesMap in HashedRelation > --- > > Key: SPARK-14052 > URL: https://issues.apache.org/jira/browse/SPARK-14052 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > Currently, for the key that can not fit within a long, we build a hash map > for UnsafeHashedRelation, it's converted to BytesToBytesMap after > serialization and deserialization. > We should build a BytesToBytesMap directly to have better memory efficiency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14052) Build BytesToBytesMap in HashedRelation
[ https://issues.apache.org/jira/browse/SPARK-14052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205376#comment-15205376 ] Apache Spark commented on SPARK-14052: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11870 > Build BytesToBytesMap in HashedRelation > --- > > Key: SPARK-14052 > URL: https://issues.apache.org/jira/browse/SPARK-14052 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Currently, for the key that can not fit within a long, we build a hash map > for UnsafeHashedRelation, it's converted to BytesToBytesMap after > serialization and deserialization. > We should build a BytesToBytesMap directly to have better memory efficiency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14016) Support high-precision decimals in vectorized parquet reader
[ https://issues.apache.org/jira/browse/SPARK-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14016: Assignee: (was: Apache Spark) > Support high-precision decimals in vectorized parquet reader > > > Key: SPARK-14016 > URL: https://issues.apache.org/jira/browse/SPARK-14016 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14016) Support high-precision decimals in vectorized parquet reader
[ https://issues.apache.org/jira/browse/SPARK-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14016: Assignee: Apache Spark > Support high-precision decimals in vectorized parquet reader > > > Key: SPARK-14016 > URL: https://issues.apache.org/jira/browse/SPARK-14016 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14016) Support high-precision decimals in vectorized parquet reader
[ https://issues.apache.org/jira/browse/SPARK-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205363#comment-15205363 ] Apache Spark commented on SPARK-14016: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/11869 > Support high-precision decimals in vectorized parquet reader > > > Key: SPARK-14016 > URL: https://issues.apache.org/jira/browse/SPARK-14016 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14051) Implement `Double.NaN==Float.NaN` in `row.equals` for consistency
[ https://issues.apache.org/jira/browse/SPARK-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14051: Assignee: (was: Apache Spark) > Implement `Double.NaN==Float.NaN` in `row.equals` for consistency > - > > Key: SPARK-14051 > URL: https://issues.apache.org/jira/browse/SPARK-14051 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Priority: Minor > > Since SPARK-9079 and SPARK-9145, `NaN = NaN` returns true and works well. The > only exception case is direct comparison between `Row(Float.NaN)` and > `Row(Double.NaN)`. The following is the example: the last expression should > be true for consistency. > {code} > scala> > Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp") > scala> sql("select a,b,a=b from tmp").collect() > res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true]) > scala> val row_a = sql("select a from tmp").collect() > row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN]) > scala> val row_b = sql("select b from tmp").collect() > row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN]) > scala> row_a(0) == row_b(0) > res2: Boolean = true > scala> row_a(1) == row_b(1) > res3: Boolean = false > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14051) Implement `Double.NaN==Float.NaN` in `row.equals` for consistency
[ https://issues.apache.org/jira/browse/SPARK-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205359#comment-15205359 ] Apache Spark commented on SPARK-14051: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11868 > Implement `Double.NaN==Float.NaN` in `row.equals` for consistency > - > > Key: SPARK-14051 > URL: https://issues.apache.org/jira/browse/SPARK-14051 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Priority: Minor > > Since SPARK-9079 and SPARK-9145, `NaN = NaN` returns true and works well. The > only exception case is direct comparison between `Row(Float.NaN)` and > `Row(Double.NaN)`. The following is the example: the last expression should > be true for consistency. > {code} > scala> > Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp") > scala> sql("select a,b,a=b from tmp").collect() > res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true]) > scala> val row_a = sql("select a from tmp").collect() > row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN]) > scala> val row_b = sql("select b from tmp").collect() > row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN]) > scala> row_a(0) == row_b(0) > res2: Boolean = true > scala> row_a(1) == row_b(1) > res3: Boolean = false > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14051) Implement `Double.NaN==Float.NaN` in `row.equals` for consistency
[ https://issues.apache.org/jira/browse/SPARK-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14051: Assignee: Apache Spark > Implement `Double.NaN==Float.NaN` in `row.equals` for consistency > - > > Key: SPARK-14051 > URL: https://issues.apache.org/jira/browse/SPARK-14051 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Since SPARK-9079 and SPARK-9145, `NaN = NaN` returns true and works well. The > only exception case is direct comparison between `Row(Float.NaN)` and > `Row(Double.NaN)`. The following is the example: the last expression should > be true for consistency. > {code} > scala> > Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp") > scala> sql("select a,b,a=b from tmp").collect() > res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true]) > scala> val row_a = sql("select a from tmp").collect() > row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN]) > scala> val row_b = sql("select b from tmp").collect() > row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN]) > scala> row_a(0) == row_b(0) > res2: Boolean = true > scala> row_a(1) == row_b(1) > res3: Boolean = false > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14052) Build BytesToBytesMap in HashedRelation
Davies Liu created SPARK-14052: -- Summary: Build BytesToBytesMap in HashedRelation Key: SPARK-14052 URL: https://issues.apache.org/jira/browse/SPARK-14052 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Currently, for the key that can not fit within a long, we build a hash map for UnsafeHashedRelation, it's converted to BytesToBytesMap after serialization and deserialization. We should build a BytesToBytesMap directly to have better memory efficiency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14051) Implement `Double.NaN==Float.NaN` in `row.equals` for consistency
Dongjoon Hyun created SPARK-14051: - Summary: Implement `Double.NaN==Float.NaN` in `row.equals` for consistency Key: SPARK-14051 URL: https://issues.apache.org/jira/browse/SPARK-14051 Project: Spark Issue Type: Bug Components: SQL Reporter: Dongjoon Hyun Priority: Minor Since SPARK-9079 and SPARK-9145, `NaN = NaN` returns true and works well. The only exception case is direct comparison between `Row(Float.NaN)` and `Row(Double.NaN)`. The following is the example: the last expression should be true for consistency. {code} scala> Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp") scala> sql("select a,b,a=b from tmp").collect() res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true]) scala> val row_a = sql("select a from tmp").collect() row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN]) scala> val row_b = sql("select b from tmp").collect() row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN]) scala> row_a(0) == row_b(0) res2: Boolean = true scala> row_a(1) == row_b(1) res3: Boolean = false {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14050) Add multiple languages support for Stop Words Remover
[ https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205347#comment-15205347 ] Burak KÖSE commented on SPARK-14050: I am working on this, using nltk's words list. > Add multiple languages support for Stop Words Remover > - > > Key: SPARK-14050 > URL: https://issues.apache.org/jira/browse/SPARK-14050 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Burak KÖSE > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14050) Add multiple languages support for Stop Words Remover
Burak KÖSE created SPARK-14050: -- Summary: Add multiple languages support for Stop Words Remover Key: SPARK-14050 URL: https://issues.apache.org/jira/browse/SPARK-14050 Project: Spark Issue Type: Improvement Components: ML Reporter: Burak KÖSE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13806) SQL round() produces incorrect results for negative values
[ https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-13806: -- Assignee: Davies Liu > SQL round() produces incorrect results for negative values > -- > > Key: SPARK-13806 > URL: https://issues.apache.org/jira/browse/SPARK-13806 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Mark Hamstra >Assignee: Davies Liu > > Round in catalyst/expressions/mathExpressions.scala appears to be untested > with negative values, and it doesn't handle them correctly. > There are at least two issues here: > First, in the genCode for FloatType and DoubleType with _scale == 0, round() > will not produce the same results as for the BigDecimal.ROUND_HALF_UP > strategy used in all other cases. This is because Math.round is used for > these _scale == 0 cases. For example, Math.round(-3.5) is -3, while > BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. > Even after this bug is fixed with something like... > {code} > if (${ce.value} < 0) { > ${ev.value} = -1 * Math.round(-1 * ${ce.value}); > } else { > ${ev.value} = Math.round(${ce.value}); > } > {code} > ...which will allow an additional test like this to succeed in > MathFunctionsSuite.scala: > {code} > checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow) > {code} > ...there still appears to be a problem on at least the > checkEvalutionWithUnsafeProjection path, where failures like this are > produced: > {code} > Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: > [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13019) Replace example code in mllib-statistics.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13019. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11108 [https://github.com/apache/spark/pull/11108] > Replace example code in mllib-statistics.md using include_example > - > > Key: SPARK-13019 > URL: https://issues.apache.org/jira/browse/SPARK-13019 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > Fix For: 2.0.0 > > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example > scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` > and pick code blocks marked "example" and replace code block in > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete
[ https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205300#comment-15205300 ] Vincent Ohprecio edited comment on SPARK-14031 at 3/21/16 10:40 PM: GC accounts for less than 0.3-1.5% of CPU time. Here is the sampler report for CPU: com.univorcity.parsers.common.input.DefaultCharAppender.() ... 64% io.netty.channel.nio.NioEventLoop.select() ... 21% org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect() ... 10% org.apache.sparl.sql.execution.datasources.csv.LineCsvWriter.writeRow() ... 4% with a stack trace after detaching VisualVM: https://gist.github.com/bigsnarfdude/9f15fd55da3a6d85582a was (Author: vohprecio): GC accounts for less than 0.3-1.5% of CPU time. Here is the sampler report for CPU: com.univorcity.parsers.common.input.DefaultCharAppender.() ... 64% io.netty.channel.nio.NioEventLoop.select() ... 21% org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect() ... 10% org.apache.sparl.sql.execution.datasources.csv.LineCsvWriter.writeRow() ... 4% > Dataframe to csv IO, system performance enters high CPU state and write > operation takes 1 hour to complete > -- > > Key: SPARK-14031 > URL: https://issues.apache.org/jira/browse/SPARK-14031 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 > Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 > -1TB and Ubuntu14.04 Vagrant 4 Cores 8g >Reporter: Vincent Ohprecio >Priority: Minor > Attachments: visualVMscreenshot.png > > > Summary > When using spark-assembly-2.0.0/spark-shell trying to write out results of > dataframe to csv, system performance enters high CPU state and write > operation takes 1 hour to complete. > * Affecting: [Stage 5:> (0 + 2) / 21] > * Stage 5 elapsed time 348827227ns > In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data > and Stage5 csv write times where between 2 - 22 seconds. > In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where > similar between 2 - 22 seconds. > Files > 1. Data File is "2008.csv" > 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html > 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb > Observation 1 - Setup > High CPU and 58 minute average completion time > * MACOSX 10.11.2 > * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB > * spark-assembly-2.0.0 > * spark-csv_2.11-1.4 > * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb > Observation 2 - Setup > High CPU and waited over hour for csv write but didnt wait to complete > * Ubuntu14.04 > * 4cores 8gb > * spark-assembly-2.0.0 > * spark-csv_2.11-1.4 > Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete
[ https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205300#comment-15205300 ] Vincent Ohprecio commented on SPARK-14031: -- GC accounts for less than 0.3-1.5% of CPU time. Here is the hotspot report for CPU: com.univorcity.parsers.common.input.DefaultCharAppender.() ... 64% io.netty.channel.nio.NioEventLoop.select() ... 21% org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect() ... 10% org.apache.sparl.sql.execution.datasources.csv.LineCsvWriter.writeRow() ... 4% > Dataframe to csv IO, system performance enters high CPU state and write > operation takes 1 hour to complete > -- > > Key: SPARK-14031 > URL: https://issues.apache.org/jira/browse/SPARK-14031 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 > Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 > -1TB and Ubuntu14.04 Vagrant 4 Cores 8g >Reporter: Vincent Ohprecio >Priority: Minor > Attachments: visualVMscreenshot.png > > > Summary > When using spark-assembly-2.0.0/spark-shell trying to write out results of > dataframe to csv, system performance enters high CPU state and write > operation takes 1 hour to complete. > * Affecting: [Stage 5:> (0 + 2) / 21] > * Stage 5 elapsed time 348827227ns > In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data > and Stage5 csv write times where between 2 - 22 seconds. > In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where > similar between 2 - 22 seconds. > Files > 1. Data File is "2008.csv" > 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html > 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb > Observation 1 - Setup > High CPU and 58 minute average completion time > * MACOSX 10.11.2 > * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB > * spark-assembly-2.0.0 > * spark-csv_2.11-1.4 > * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb > Observation 2 - Setup > High CPU and waited over hour for csv write but didnt wait to complete > * Ubuntu14.04 > * 4cores 8gb > * spark-assembly-2.0.0 > * spark-csv_2.11-1.4 > Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete
[ https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205300#comment-15205300 ] Vincent Ohprecio edited comment on SPARK-14031 at 3/21/16 10:39 PM: GC accounts for less than 0.3-1.5% of CPU time. Here is the sampler report for CPU: com.univorcity.parsers.common.input.DefaultCharAppender.() ... 64% io.netty.channel.nio.NioEventLoop.select() ... 21% org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect() ... 10% org.apache.sparl.sql.execution.datasources.csv.LineCsvWriter.writeRow() ... 4% was (Author: vohprecio): GC accounts for less than 0.3-1.5% of CPU time. Here is the hotspot report for CPU: com.univorcity.parsers.common.input.DefaultCharAppender.() ... 64% io.netty.channel.nio.NioEventLoop.select() ... 21% org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect() ... 10% org.apache.sparl.sql.execution.datasources.csv.LineCsvWriter.writeRow() ... 4% > Dataframe to csv IO, system performance enters high CPU state and write > operation takes 1 hour to complete > -- > > Key: SPARK-14031 > URL: https://issues.apache.org/jira/browse/SPARK-14031 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 > Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 > -1TB and Ubuntu14.04 Vagrant 4 Cores 8g >Reporter: Vincent Ohprecio >Priority: Minor > Attachments: visualVMscreenshot.png > > > Summary > When using spark-assembly-2.0.0/spark-shell trying to write out results of > dataframe to csv, system performance enters high CPU state and write > operation takes 1 hour to complete. > * Affecting: [Stage 5:> (0 + 2) / 21] > * Stage 5 elapsed time 348827227ns > In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data > and Stage5 csv write times where between 2 - 22 seconds. > In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where > similar between 2 - 22 seconds. > Files > 1. Data File is "2008.csv" > 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html > 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb > Observation 1 - Setup > High CPU and 58 minute average completion time > * MACOSX 10.11.2 > * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB > * spark-assembly-2.0.0 > * spark-csv_2.11-1.4 > * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb > Observation 2 - Setup > High CPU and waited over hour for csv write but didnt wait to complete > * Ubuntu14.04 > * 4cores 8gb > * spark-assembly-2.0.0 > * spark-csv_2.11-1.4 > Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14049) Add functionality in spark history sever API to query applications by end time
[ https://issues.apache.org/jira/browse/SPARK-14049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14049: Assignee: (was: Apache Spark) > Add functionality in spark history sever API to query applications by end > time > --- > > Key: SPARK-14049 > URL: https://issues.apache.org/jira/browse/SPARK-14049 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Parag Chaudhari > > Currently, spark history server provides functionality to query applications > by application start time range based on minDate and maxDate query > parameters, but it lacks support to query applications by their end time. In > this Jira we are proposing optional minEndDate and maxEndDate query > parameters and filtering capability based on these parameters to spark > history server. This functionality can be used for following queries, > 1. Applications finished in last 'x' minutes > 2. Applications finished before 'y' time > 3. Applications finished between 'x' time to 'y' time > 4. Applications started from 'x' time and finished before 'y' time. > For backward compatibility, we can keep existing minDate and maxDate query > parameters as they are and they can continue support filtering based on start > time range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14049) Add functionality in spark history sever API to query applications by end time
[ https://issues.apache.org/jira/browse/SPARK-14049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14049: Assignee: Apache Spark > Add functionality in spark history sever API to query applications by end > time > --- > > Key: SPARK-14049 > URL: https://issues.apache.org/jira/browse/SPARK-14049 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Parag Chaudhari >Assignee: Apache Spark > > Currently, spark history server provides functionality to query applications > by application start time range based on minDate and maxDate query > parameters, but it lacks support to query applications by their end time. In > this Jira we are proposing optional minEndDate and maxEndDate query > parameters and filtering capability based on these parameters to spark > history server. This functionality can be used for following queries, > 1. Applications finished in last 'x' minutes > 2. Applications finished before 'y' time > 3. Applications finished between 'x' time to 'y' time > 4. Applications started from 'x' time and finished before 'y' time. > For backward compatibility, we can keep existing minDate and maxDate query > parameters as they are and they can continue support filtering based on start > time range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14049) Add functionality in spark history sever API to query applications by end time
[ https://issues.apache.org/jira/browse/SPARK-14049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205287#comment-15205287 ] Apache Spark commented on SPARK-14049: -- User 'paragpc' has created a pull request for this issue: https://github.com/apache/spark/pull/11867 > Add functionality in spark history sever API to query applications by end > time > --- > > Key: SPARK-14049 > URL: https://issues.apache.org/jira/browse/SPARK-14049 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Parag Chaudhari > > Currently, spark history server provides functionality to query applications > by application start time range based on minDate and maxDate query > parameters, but it lacks support to query applications by their end time. In > this Jira we are proposing optional minEndDate and maxEndDate query > parameters and filtering capability based on these parameters to spark > history server. This functionality can be used for following queries, > 1. Applications finished in last 'x' minutes > 2. Applications finished before 'y' time > 3. Applications finished between 'x' time to 'y' time > 4. Applications started from 'x' time and finished before 'y' time. > For backward compatibility, we can keep existing minDate and maxDate query > parameters as they are and they can continue support filtering based on start > time range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10433) Gradient boosted trees: increasing input size in 1.4
[ https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-10433. --- Resolution: Fixed Fix Version/s: 1.5.0 I'm closing this since it seems to have been fixed in 1.5, but please say if it has occurred again after that. > Gradient boosted trees: increasing input size in 1.4 > > > Key: SPARK-10433 > URL: https://issues.apache.org/jira/browse/SPARK-10433 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1 >Reporter: Sean Owen > Fix For: 1.5.0 > > > (Sorry to say I don't have any leads on a fix, but this was reported by three > different people and I confirmed it at fairly close range, so think it's > legitimate:) > This is probably best explained in the words from the mailing list thread at > http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E > . Matt Forbes says: > {quote} > I am training a boosted trees model on a couple million input samples (with > around 300 features) and am noticing that the input size of each stage is > increasing each iteration. For each new tree, the first step seems to be > building the decision tree metadata, which does a .count() on the input data, > so this is the step I've been using to track the input size changing. Here is > what I'm seeing: > {quote} > {code} > count at DecisionTreeMetadata.scala:111 > 1. Input Size / Records: 726.1 MB / 1295620 > 2. Input Size / Records: 106.9 GB / 64780816 > 3. Input Size / Records: 160.3 GB / 97171224 > 4. Input Size / Records: 214.8 GB / 129680959 > 5. Input Size / Records: 268.5 GB / 162533424 > > Input Size / Records: 1912.6 GB / 1382017686 > > {code} > {quote} > This step goes from taking less than 10s up to 5 minutes by the 15th or so > iteration. I'm not quite sure what could be causing this. I am passing a > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train > {quote} > Johannes Bauer showed me a very similar problem. > Peter Rudenko offers this sketch of a reproduction: > {code} > val boostingStrategy = BoostingStrategy.defaultParams("Classification") > boostingStrategy.setNumIterations(30) > boostingStrategy.setLearningRate(1.0) > boostingStrategy.treeStrategy.setMaxDepth(3) > boostingStrategy.treeStrategy.setMaxBins(128) > boostingStrategy.treeStrategy.setSubsamplingRate(1.0) > boostingStrategy.treeStrategy.setMinInstancesPerNode(1) > boostingStrategy.treeStrategy.setUseNodeIdCache(true) > boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( > > mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, > java.lang.Integer]]) > val model = GradientBoostedTrees.train(instances, boostingStrategy) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14049) Add functionality in spark history sever API to query applications by end time
Parag Chaudhari created SPARK-14049: --- Summary: Add functionality in spark history sever API to query applications by end time Key: SPARK-14049 URL: https://issues.apache.org/jira/browse/SPARK-14049 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.1, 2.0.0 Reporter: Parag Chaudhari Currently, spark history server provides functionality to query applications by application start time range based on minDate and maxDate query parameters, but it lacks support to query applications by their end time. In this Jira we are proposing optional minEndDate and maxEndDate query parameters and filtering capability based on these parameters to spark history server. This functionality can be used for following queries, 1. Applications finished in last 'x' minutes 2. Applications finished before 'y' time 3. Applications finished between 'x' time to 'y' time 4. Applications started from 'x' time and finished before 'y' time. For backward compatibility, we can keep existing minDate and maxDate query parameters as they are and they can continue support filtering based on start time range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205279#comment-15205279 ] Thomas Graves commented on SPARK-1239: -- I do like the idea of broadcast and originally when I had tried it I had the issue mentioned in the second bullet point, but as long as we are synchronizing on the requests so we only broadcast it once we should be ok. It does seem to have some further constraints though too. With a sufficient large job I don't think it matters but what if we only have a small number of reducers, we broadcast it to all executors when only a couple need it. I guess that doesn't hurt much unless the other executors start going to the executors your reducers are on and add more load to them. Should be pretty minimal though. Broadcast also seems to make less sense when using the dynamic allocation. At least I've seen issues when executors go away, it fails fetch from that one, has to retry, etc, adding additional time. We recently specifically fixed one issue with this to make it go get locations again after certain number of failures. That time should be less now that we fixed that but I'll have to run the numbers. I'll do some more analysis/testing of this and see if that really matters. with a sufficient number of threads I don't think a few slow nodes would make much of a difference here, if you have that many slow nodes the shuffle itself is going to be impacted which I would see as a larger affect. The slow nodes could just as well affect the broadcast as well. Hopefully you skip those as it takes longer for those to get a chunk, buts its possible that once that slow one has a chunk or two, more and more executors start going to that one for the broadcast data instead of the driver thus slowing down more transfers. But its a good point and my current method would truly block (for a certain time) rather then being slow. Note that there is a timeout on waiting for the send to happen and when it does it closes the connection and executor would retry. You don't have to worry about that with broadcast. I'll do some more analysis with that approach. I wish Netty had some other built in mechanisms for flow control. > Don't fetch all map output statuses at each reducer during shuffles > --- > > Key: SPARK-1239 > URL: https://issues.apache.org/jira/browse/SPARK-1239 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Patrick Wendell >Assignee: Thomas Graves > > Instead we should modify the way we fetch map output statuses to take both a > mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6362) Broken pipe error when training a RandomForest on a union of two RDDs
[ https://issues.apache.org/jira/browse/SPARK-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-6362. Resolution: Fixed Fix Version/s: 1.3.0 I'm going to close this since it appears to be fixed (based on running it locally just now on master). > Broken pipe error when training a RandomForest on a union of two RDDs > - > > Key: SPARK-6362 > URL: https://issues.apache.org/jira/browse/SPARK-6362 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.2.0 > Environment: Kubuntu 14.04, local driver >Reporter: Pavel Laskov >Priority: Minor > Fix For: 1.3.0 > > > Training a RandomForest classifier on a dataset obtained as a union of two > RDDs throws a broken pipe error: > Traceback (most recent call last): > File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 162, in > manager > code = worker(sock) > File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 64, in > worker > outfile.flush() > IOError: [Errno 32] Broken pipe > Despite an error the job runs to completion. > The following code reproduces the error: > from pyspark.context import SparkContext > from pyspark.mllib.rand import RandomRDDs > from pyspark.mllib.tree import RandomForest > from pyspark.mllib.linalg import DenseVector > from pyspark.mllib.regression import LabeledPoint > import random > if __name__ == "__main__": > sc = SparkContext(appName="Union bug test") > data1 = RandomRDDs.normalVectorRDD(sc,numRows=1,numCols=200) > data1 = data1.map(lambda x: LabeledPoint(random.randint(0,1),\ > DenseVector(x))) > data2 = RandomRDDs.normalVectorRDD(sc,numRows=1,numCols=200) > data2 = data2.map(lambda x: LabeledPoint(random.randint(0,1),\ > DenseVector(x))) > training_data = data1.union(data2) > #training_data = training_data.repartition(2) > model = RandomForest.trainClassifier(training_data, numClasses=2, > categoricalFeaturesInfo={}, > numTrees=50, maxDepth=30) > Interestingly, re-partitioning the data after the union operation rectifies > the problem (uncomment the line before training in the code above). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
Simeon Simeonov created SPARK-14048: --- Summary: Aggregation operations on structs fail when the structs have fields with special characters Key: SPARK-14048 URL: https://issues.apache.org/jira/browse/SPARK-14048 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Environment: Databricks w/ 1.6.0 Reporter: Simeon Simeonov Consider a schema where a struct has field names with special characters, e.g., {code} |-- st: struct (nullable = true) ||-- x.y: long (nullable = true) {code} Schema such as these are frequently generated by the JSON schema generator, which seems to never want to map JSON data to {{MapType}} always preferring to use {{StructType}}. In SparkSQL, referring to these fields requires backticks, e.g., {{st.`x.y`}}. There is no problem manipulating these structs unless one is using an aggregation function. It seems that, under the covers, the code is not escaping fields with special characters correctly. For example, {code} select first(st) as st from tbl group by something {code} generates {code} org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: struct. If you have a struct and a field name of it has any special characters, please use backticks (`) to quote that field name, e.g. `x+y`. Please note that backtick itself is not supported in a field name. at org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) at org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) at org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) at org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) at com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) at com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) at com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) at com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) at scala.util.Try$.apply(Try.scala:161) at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4607) Add random seed to GradientBoostedTrees
[ https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4607: - Affects Version/s: (was: 1.2.0) > Add random seed to GradientBoostedTrees > --- > > Key: SPARK-4607 > URL: https://issues.apache.org/jira/browse/SPARK-4607 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Gradient Boosted Trees does not take a random seed, but it uses randomness if > the subsampling rate is < 1. It should take a random seed parameter. > This update will also help to make unit tests more stable by allowing > determinism (using a small set of fixed random seeds). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4607) Add random seed to GBTClassifier, GBTRegressor
[ https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4607: - Summary: Add random seed to GBTClassifier, GBTRegressor (was: Add random seed to GradientBoostedTrees) > Add random seed to GBTClassifier, GBTRegressor > -- > > Key: SPARK-4607 > URL: https://issues.apache.org/jira/browse/SPARK-4607 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Gradient Boosted Trees does not take a random seed, but it uses randomness if > the subsampling rate is < 1. It should take a random seed parameter. > This update will also help to make unit tests more stable by allowing > determinism (using a small set of fixed random seeds). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4607) Add random seed to GradientBoostedTrees
[ https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4607: - Component/s: (was: MLlib) ML > Add random seed to GradientBoostedTrees > --- > > Key: SPARK-4607 > URL: https://issues.apache.org/jira/browse/SPARK-4607 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Gradient Boosted Trees does not take a random seed, but it uses randomness if > the subsampling rate is < 1. It should take a random seed parameter. > This update will also help to make unit tests more stable by allowing > determinism (using a small set of fixed random seeds). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4607) Add random seed to GradientBoostedTrees
[ https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4607: - Target Version/s: 2.0.0 > Add random seed to GradientBoostedTrees > --- > > Key: SPARK-4607 > URL: https://issues.apache.org/jira/browse/SPARK-4607 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Gradient Boosted Trees does not take a random seed, but it uses randomness if > the subsampling rate is < 1. It should take a random seed parameter. > This update will also help to make unit tests more stable by allowing > determinism (using a small set of fixed random seeds). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14047) GBT improvement umbrella
Joseph K. Bradley created SPARK-14047: - Summary: GBT improvement umbrella Key: SPARK-14047 URL: https://issues.apache.org/jira/browse/SPARK-14047 Project: Spark Issue Type: Umbrella Components: ML Reporter: Joseph K. Bradley This is an umbrella for improvements to learning Gradient Boosted Trees: GBTClassifier, GBTRegressor. Note: Aspects of GBTs which are related to individual trees should be listed under [SPARK-14045]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14046) RandomForest improvement umbrella
Joseph K. Bradley created SPARK-14046: - Summary: RandomForest improvement umbrella Key: SPARK-14046 URL: https://issues.apache.org/jira/browse/SPARK-14046 Project: Spark Issue Type: Umbrella Components: ML Reporter: Joseph K. Bradley This is an umbrella for improvements to learning Random Forests. Note: Aspects of RFs which are related to individual trees should be listed under [SPARK-14045]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14046) RandomForest improvement umbrella
[ https://issues.apache.org/jira/browse/SPARK-14046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14046: -- Description: This is an umbrella for improvements to learning Random Forests: RandomForestClassifier, RandomForestRegressor. Note: Aspects of RFs which are related to individual trees should be listed under [SPARK-14045]. was: This is an umbrella for improvements to learning Random Forests. Note: Aspects of RFs which are related to individual trees should be listed under [SPARK-14045]. > RandomForest improvement umbrella > - > > Key: SPARK-14046 > URL: https://issues.apache.org/jira/browse/SPARK-14046 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Joseph K. Bradley > > This is an umbrella for improvements to learning Random Forests: > RandomForestClassifier, RandomForestRegressor. > Note: Aspects of RFs which are related to individual trees should be listed > under [SPARK-14045]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14045) DecisionTree improvement umbrella
Joseph K. Bradley created SPARK-14045: - Summary: DecisionTree improvement umbrella Key: SPARK-14045 URL: https://issues.apache.org/jira/browse/SPARK-14045 Project: Spark Issue Type: Umbrella Components: ML Reporter: Joseph K. Bradley This is an umbrella for improvements to decision tree learning. This includes: * DecisionTreeClassifier * DecisionTreeRegressor * aspects of tree ensembles specific to learning individual trees, i.e., issues which will also affect DecisionTreeClassifier/Regressor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205243#comment-15205243 ] Joseph K. Bradley commented on SPARK-3159: -- Sorry for the slow reply. There are several like that. I'll try to check through them and link them under an umbrella, to help drive a bit more attention to them. > Check for reducible DecisionTree > > > Key: SPARK-3159 > URL: https://issues.apache.org/jira/browse/SPARK-3159 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Improvement: test-time computation > Currently, pairs of leaf nodes with the same parent can both output the same > prediction. This happens since the splitting criterion (e.g., Gini) is not > the same as prediction accuracy/MSE; the splitting criterion can sometimes be > improved even when both children would still output the same prediction > (e.g., based on the majority label for classification). > We could check the tree and reduce it if possible after training. > Note: This happens with scikit-learn as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11507) Error thrown when using BlockMatrix.add
[ https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205242#comment-15205242 ] Joseph K. Bradley commented on SPARK-11507: --- Good to hear! I am wondering though if it was a mistake to close your original PR (since the Breeze fix won't be put into Spark that quickly). What do you think about re-opening your PR to get the bug fix into 2.0 and a few backports? > Error thrown when using BlockMatrix.add > --- > > Key: SPARK-11507 > URL: https://issues.apache.org/jira/browse/SPARK-11507 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1, 1.5.0 > Environment: Mac/local machine, EC2 > Scala >Reporter: Kareem Alhazred >Priority: Minor > > In certain situations when adding two block matrices, I get an error > regarding colPtr and the operation fails. External issue URL includes full > error and code for reproducing the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step
[ https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Tiernay updated SPARK-14044: Description: It would be very useful to allow the disabling of this block of code within {{DynamicPartitionWriterContainer#writeRows}} at runtime: https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 The use case is that an upstream {{groupBy}} has already sorted a great many fine grained groups which are the target of the {{partitionBy}}. This {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't even get Spark to succeed due to the sort step and data skew in the partitions. In general, this would make more efficient use of cluster resources. For this to work, there needs to be a way to communicate between the {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. This is very similar in function to Hive's {{hive.optimize.sort.dynamic.partition}} parameter. was: It would be very useful to allow the disabling of this block of code within {{DynamicPartitionWriterContainer#writeRows}} at runtime: https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 The use case is that an upstream {{groupBy}} has already sorted a great many fine grained groups which are the target of the {{partitionBy}}. This {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't even get Spark to succeed due to the sort step and data skew in the partitions. In general, this would make more efficient use of cluster resources. For this to work, there needs to be a way to communicate between the {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. This is very similar in function to Hive's {{hive.enforce.bucketing}} parameter. > Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass > sort step > > > Key: SPARK-14044 > URL: https://issues.apache.org/jira/browse/SPARK-14044 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Bob Tiernay > > It would be very useful to allow the disabling of this block of code within > {{DynamicPartitionWriterContainer#writeRows}} at runtime: > https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 > The use case is that an upstream {{groupBy}} has already sorted a great many > fine grained groups which are the target of the {{partitionBy}}. This > {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't > even get Spark to succeed due to the sort step and data skew in the > partitions. In general, this would make more efficient use of cluster > resources. > For this to work, there needs to be a way to communicate between the > {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. > This is very similar in function to Hive's > {{hive.optimize.sort.dynamic.partition}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step
[ https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Tiernay updated SPARK-14044: Description: It would be very useful to allow the disabling of this block of code within {{DynamicPartitionWriterContainer#writeRows}} at runtime: https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 The use case is that an upstream {{groupBy}} has already sorted a great many fine grained groups which are the target of the {{partitionBy}}. This {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't even get Spark to succeed due to the sort step and data skew in the partitions. In general, this would make more efficient use of cluster resources. For this to work, there needs to be a way to communicate between the {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. This is very similar in function to Hive's {{hive.enforce.bucketing}} parameter. was: It would be very useful to allow the disabling of this block of code within {{DynamicPartitionWriterContainer#writeRows}} at runtime: https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 The use case is that an upstream {{groupBy}} has already sorted a great many fine grained groups which are the target of the {{partitionBy}}. This {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't even get Spark to succeed due to the sort step and data skew in the partitions. In general, this would make more efficient use of cluster resources. For this to work, there needs to be a way to communicate between the {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. > Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass > sort step > > > Key: SPARK-14044 > URL: https://issues.apache.org/jira/browse/SPARK-14044 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Bob Tiernay > > It would be very useful to allow the disabling of this block of code within > {{DynamicPartitionWriterContainer#writeRows}} at runtime: > https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 > The use case is that an upstream {{groupBy}} has already sorted a great many > fine grained groups which are the target of the {{partitionBy}}. This > {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't > even get Spark to succeed due to the sort step and data skew in the > partitions. In general, this would make more efficient use of cluster > resources. > For this to work, there needs to be a way to communicate between the > {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. > This is very similar in function to Hive's {{hive.enforce.bucketing}} > parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13805) Direct consume ColumnVector in generated code when ColumnarBatch is used
[ https://issues.apache.org/jira/browse/SPARK-13805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13805. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11636 [https://github.com/apache/spark/pull/11636] > Direct consume ColumnVector in generated code when ColumnarBatch is used > > > Key: SPARK-13805 > URL: https://issues.apache.org/jira/browse/SPARK-13805 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > Fix For: 2.0.0 > > > When generated code accesses a {{ColumnarBatch}} object, it is possible to > get values of each column from {{ColumnVector}} instead of calling > {{getRow()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step
[ https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Tiernay updated SPARK-14044: Description: It would be very useful to allow the disabling of this block of code within {{DynamicPartitionWriterContainer#writeRows}} at runtime: https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 The use case is that an upstream {{groupBy}} has already sorted a great many fine grained groups which are the target of the {{partitionBy}}. This {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't even get Spark to succeed due to the sort step and data skew in the partitions. In general, this would make more efficient use of cluster resources. For this to work, there needs to be a way to communicate between the {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. was: It would be very useful to allow the disabling of this block of code within {{DynamicPartitionWriterContainer#writeRows}} at runtime: https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 The use case is that an upstream {{groupBy}} has already sorted a great many fine grained groups which are the target of the {{partitionBy}}. Currently, we can't even get Spark to succeed due to the sort step and data skew in the partitions. In general, this would make more efficient use of cluster resources. For this to work, there needs to be a way to communicate between the {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. > Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass > sort step > > > Key: SPARK-14044 > URL: https://issues.apache.org/jira/browse/SPARK-14044 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Bob Tiernay > > It would be very useful to allow the disabling of this block of code within > {{DynamicPartitionWriterContainer#writeRows}} at runtime: > https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 > The use case is that an upstream {{groupBy}} has already sorted a great many > fine grained groups which are the target of the {{partitionBy}}. This > {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't > even get Spark to succeed due to the sort step and data skew in the > partitions. In general, this would make more efficient use of cluster > resources. > For this to work, there needs to be a way to communicate between the > {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step
[ https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Tiernay updated SPARK-14044: Description: It would be very useful to allow the disabling of this block of code within {{DynamicPartitionWriterContainer#writeRows}} at runtime: https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 The use case is that an upstream {{groupBy}} has already sorted a great many fine grained groups which are the target of the {{partitionBy}}. Currently, we can't even get Spark to succeed due to the sort step and data skew in the partitions. In general, this would make more efficient use of cluster resources. For this to work, there needs to be a way to communicate between the {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. was: It would be very useful to allow the disabling of this block of code within {{DynamicPartitionWriterContainer}}: https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 The use case is that an upstream {{groupBy}} has already sorted a great many fine grained groups which are the target of the {{partitionBy}}. For this to work, there needs to be a way to communicate between the {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. > Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass > sort step > > > Key: SPARK-14044 > URL: https://issues.apache.org/jira/browse/SPARK-14044 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Bob Tiernay > > It would be very useful to allow the disabling of this block of code within > {{DynamicPartitionWriterContainer#writeRows}} at runtime: > https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 > The use case is that an upstream {{groupBy}} has already sorted a great many > fine grained groups which are the target of the {{partitionBy}}. Currently, > we can't even get Spark to succeed due to the sort step and data skew in the > partitions. In general, this would make more efficient use of cluster > resources. > For this to work, there needs to be a way to communicate between the > {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer to bypass sort step
Bob Tiernay created SPARK-14044: --- Summary: Allow configuration of DynamicPartitionWriterContainer to bypass sort step Key: SPARK-14044 URL: https://issues.apache.org/jira/browse/SPARK-14044 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.1 Reporter: Bob Tiernay It would be very useful to allow the disabling of this block of code within {{DynamicPartitionWriterContainer}}: https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 The use case is that an upstream {{groupBy}} has already sorted a great many fine grained groups which are the target of the {{partitionBy}}. For this to work, there needs to be a way to communicate between the {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step
[ https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Tiernay updated SPARK-14044: Summary: Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step (was: Allow configuration of DynamicPartitionWriterContainer to bypass sort step) > Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass > sort step > > > Key: SPARK-14044 > URL: https://issues.apache.org/jira/browse/SPARK-14044 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Bob Tiernay > > It would be very useful to allow the disabling of this block of code within > {{DynamicPartitionWriterContainer}}: > https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418 > The use case is that an upstream {{groupBy}} has already sorted a great many > fine grained groups which are the target of the {{partitionBy}}. For this to > work, there needs to be a way to communicate between the {{groupBy}} and the > {{partitionBy}} by way of some runtime configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14023) Make exceptions consistent regarding fields and columns
[ https://issues.apache.org/jira/browse/SPARK-14023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205185#comment-15205185 ] Jacek Laskowski commented on SPARK-14023: - If [~josephkb] or [~srowen] could help me how and where to get started with this, I could look into it and offer a pull req. I'd appreciate any help. Thanks! > Make exceptions consistent regarding fields and columns > --- > > Key: SPARK-14023 > URL: https://issues.apache.org/jira/browse/SPARK-14023 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Trivial > > As you can see below, a column is called a field depending on where an > exception is thrown. I think it should be "column" everywhere (since that's > what has a type from a schema). > {code} > scala> lr > res32: org.apache.spark.ml.regression.LinearRegression = linReg_d9bfe808e743 > scala> lr.fit(ds) > java.lang.IllegalArgumentException: Field "features" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:214) > at > org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:214) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at org.apache.spark.sql.types.StructType.apply(StructType.scala:213) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40) > at > org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50) > at > org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71) > at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:89) > ... 51 elided > scala> lr.fit(ds) > java.lang.IllegalArgumentException: requirement failed: Column label must be > of type DoubleType but was actually StringType. > at scala.Predef$.require(Predef.scala:219) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53) > at > org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71) > at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:89) > ... 51 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13806) SQL round() produces incorrect results for negative values
[ https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205134#comment-15205134 ] Mark Hamstra commented on SPARK-13806: -- Yes, there is the mostly orthogonal question about which rounding strategy should be used -- see the comments in SPARK-8279. But, assuming that we are adopting the ROUND_HALF_UP strategy, there is the problem with negative values that this JIRA points out: When using ROUND_HALF_UP and scale == 0, -x.5 must round to -(x+1), but Math.round will round it to -x. In addition to this, the code gen for rounding of negative floating point values with negative scales is broken. All of this stems from Spark SQL's implementation of round() being untested with negative values. > SQL round() produces incorrect results for negative values > -- > > Key: SPARK-13806 > URL: https://issues.apache.org/jira/browse/SPARK-13806 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Mark Hamstra > > Round in catalyst/expressions/mathExpressions.scala appears to be untested > with negative values, and it doesn't handle them correctly. > There are at least two issues here: > First, in the genCode for FloatType and DoubleType with _scale == 0, round() > will not produce the same results as for the BigDecimal.ROUND_HALF_UP > strategy used in all other cases. This is because Math.round is used for > these _scale == 0 cases. For example, Math.round(-3.5) is -3, while > BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. > Even after this bug is fixed with something like... > {code} > if (${ce.value} < 0) { > ${ev.value} = -1 * Math.round(-1 * ${ce.value}); > } else { > ${ev.value} = Math.round(${ce.value}); > } > {code} > ...which will allow an additional test like this to succeed in > MathFunctionsSuite.scala: > {code} > checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow) > {code} > ...there still appears to be a problem on at least the > checkEvalutionWithUnsafeProjection path, where failures like this are > produced: > {code} > Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: > [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13951) PySpark ml.pipeline support export/import - nested Piplines
[ https://issues.apache.org/jira/browse/SPARK-13951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205112#comment-15205112 ] Apache Spark commented on SPARK-13951: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/11866 > PySpark ml.pipeline support export/import - nested Piplines > --- > > Key: SPARK-13951 > URL: https://issues.apache.org/jira/browse/SPARK-13951 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13806) SQL round() produces incorrect results for negative values
[ https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205103#comment-15205103 ] Davies Liu commented on SPARK-13806: This is because round() in Java/Scala have different sematics than Database, we should figure out that's the right behavior first. cc [~rxin] > SQL round() produces incorrect results for negative values > -- > > Key: SPARK-13806 > URL: https://issues.apache.org/jira/browse/SPARK-13806 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Mark Hamstra > > Round in catalyst/expressions/mathExpressions.scala appears to be untested > with negative values, and it doesn't handle them correctly. > There are at least two issues here: > First, in the genCode for FloatType and DoubleType with _scale == 0, round() > will not produce the same results as for the BigDecimal.ROUND_HALF_UP > strategy used in all other cases. This is because Math.round is used for > these _scale == 0 cases. For example, Math.round(-3.5) is -3, while > BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. > Even after this bug is fixed with something like... > {code} > if (${ce.value} < 0) { > ${ev.value} = -1 * Math.round(-1 * ${ce.value}); > } else { > ${ev.value} = Math.round(${ce.value}); > } > {code} > ...which will allow an additional test like this to succeed in > MathFunctionsSuite.scala: > {code} > checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow) > {code} > ...there still appears to be a problem on at least the > checkEvalutionWithUnsafeProjection path, where failures like this are > produced: > {code} > Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: > [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13938) word2phrase feature created in ML
[ https://issues.apache.org/jira/browse/SPARK-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205095#comment-15205095 ] Steve Weng commented on SPARK-13938: I looked it over already, but was hoping you had more details. > word2phrase feature created in ML > - > > Key: SPARK-13938 > URL: https://issues.apache.org/jira/browse/SPARK-13938 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Steve Weng >Priority: Critical > Original Estimate: 840h > Remaining Estimate: 840h > > I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which > transforms a sentence of words into one where certain individual consecutive > words are concatenated by using a training model/estimator (e.g. "I went to > New York" becomes "I went to new_york"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13938) word2phrase feature created in ML
[ https://issues.apache.org/jira/browse/SPARK-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205085#comment-15205085 ] Sean Owen commented on SPARK-13938: --- Have a look at the link I posted, in particular https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines > word2phrase feature created in ML > - > > Key: SPARK-13938 > URL: https://issues.apache.org/jira/browse/SPARK-13938 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Steve Weng >Priority: Critical > Original Estimate: 840h > Remaining Estimate: 840h > > I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which > transforms a sentence of words into one where certain individual consecutive > words are concatenated by using a training model/estimator (e.g. "I went to > New York" becomes "I went to new_york"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14042) Add support for custom coalescers
[ https://issues.apache.org/jira/browse/SPARK-14042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14042: Assignee: Apache Spark > Add support for custom coalescers > - > > Key: SPARK-14042 > URL: https://issues.apache.org/jira/browse/SPARK-14042 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Nezih Yigitbasi >Assignee: Apache Spark > > Per our discussion on the mailing list (please see > [here|http://mail-archives.apache.org/mod_mbox//spark-dev/201602.mbox/%3CCA+g63F7aVRBH=WyyK3nvBSLCMPtSdUuL_Ge9=ww4dnmnvy4...@mail.gmail.com%3E]) > it would be nice to specify a custom coalescing policy as the current > {{coalesce()}} method only allows the user to specify the number of > partitions and we cannot really control much. The need for this feature > popped up when I wanted to merge small files by coalescing them by size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org