date:20160321

[jira] [Commented] (SPARK-14006) Builds of 1.6 branch fail R style check

2016-03-21 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205841#comment-15205841
 ] 

Yin Huai commented on SPARK-14006:
--

1.6 branch is broken because of the R style issue. Can you take a look at it? 
If backport that PR can fix the problem, yes, please backport it.

> Builds of 1.6 branch fail R style check
> ---
>
> Key: SPARK-14006
> URL: https://issues.apache.org/jira/browse/SPARK-14006
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-1.6-test-sbt-hadoop-2.2/152/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14006) Builds of 1.6 branch fail R style check

2016-03-21 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205832#comment-15205832
 ] 

Sun Rui commented on SPARK-14006:
-

[~yhuai] Do you mean a backport PR to branch 1.6?

> Builds of 1.6 branch fail R style check
> ---
>
> Key: SPARK-14006
> URL: https://issues.apache.org/jira/browse/SPARK-14006
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-1.6-test-sbt-hadoop-2.2/152/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14030) Add parameter check to LBFGS

2016-03-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14030:
--
Assignee: zhengruifeng

> Add parameter check to LBFGS
> 
>
> Key: SPARK-14030
> URL: https://issues.apache.org/jira/browse/SPARK-14030
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
>
> Add the missing parameter verification in LBFGS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14030) Add parameter check to LBFGS

2016-03-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14030:
--
Target Version/s: 2.0.0

> Add parameter check to LBFGS
> 
>
> Key: SPARK-14030
> URL: https://issues.apache.org/jira/browse/SPARK-14030
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Trivial
>
> Add the missing parameter verification in LBFGS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame

2016-03-21 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205790#comment-15205790
 ] 

Sun Rui edited comment on SPARK-14037 at 3/22/16 4:49 AM:
--

If possible, just use read.df() to load a DataFrame from a CSV file.
Loading a CSV file into a local R data.frame and calling createDataFrame() on 
it to create a DataFrame is more time-consuming because it involves launching 
of external R processes on worker nodes and two rounds of data 
serialization/deserialization.

30 seconds is really slow, could you help to get metrics information? Since you 
are running on standalone mode, you can goto the web UI and find something like 
below in the worker stderr logs:
{code}
INFO r.RRDD: Times: boot = 0.518 s, init = 0.009 s, broadcast = 0.000 s, 
read-input = 0.001 s, compute = 0.002 s, write-output = 0.074 s, total = 0.604 s
{code}


was (Author: sunrui):
If possible, just use read.df() to load a DataFrame from a CSV file.
Loading a CSV file into a local R data.frame and calling createDataFrame() on 
it to create a DataFrame is more time-consuming because it involves launching 
of external R processes on worker nodes and two rounds of data 
serialization/deserialization.

30 seconds is really slow, could you help to get metrics information? Since you 
are running on standalone mode, you can goto the web UI and find something like 
below in the worker stderr logs:
```
INFO r.RRDD: Times: boot = 0.518 s, init = 0.009 s, broadcast = 0.000 s, 
read-input = 0.001 s, compute = 0.002 s, write-output = 0.074 s, total = 0.604 s
```

> count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
> --
>
> Key: SPARK-14037
> URL: https://issues.apache.org/jira/browse/SPARK-14037
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Ubuntu 12.04
> RAM : 6 GB
> Spark 1.6.1 Standalone
>Reporter: Samuel Alexander
>  Labels: performance, sparkR
>
> Any operations on dataframe created using SparkR::createDataFrame is very 
> slow.
> I have a CSV of size ~ 6MB. Below is the sample content
> 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
> 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
> 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
> 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
> 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
> 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
> 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
> 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
> 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
> 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter
> I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, 
> sep=","). And then converted into Spark dataframe using sp_df <- 
> createDataFrame(sqlContext, r_df)
> Now count(sp_df) took more than 30 seconds
> When I load the same CSV using spark-csv like, direct_df <- 
> read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = 
> "com.databricks.spark.csv", inferSchema = "false", header="true")
> count(direct_df) took below 1 sec.
> I know performance has been improved in createDataFrame in Spark 1.6. But 
> other operations like count(), is very slow.
> How can I get rid of this performance issue? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame

2016-03-21 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205790#comment-15205790
 ] 

Sun Rui commented on SPARK-14037:
-

If possible, just use read.df() to load a DataFrame from a CSV file.
Loading a CSV file into a local R data.frame and calling createDataFrame() on 
it to create a DataFrame is more time-consuming because it involves launching 
of external R processes on worker nodes and two rounds of data 
serialization/deserialization.

30 seconds is really slow, could you help to get metrics information? Since you 
are running on standalone mode, you can goto the web UI and find something like 
below in the worker stderr logs:
```
INFO r.RRDD: Times: boot = 0.518 s, init = 0.009 s, broadcast = 0.000 s, 
read-input = 0.001 s, compute = 0.002 s, write-output = 0.074 s, total = 0.604 s
```

> count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
> --
>
> Key: SPARK-14037
> URL: https://issues.apache.org/jira/browse/SPARK-14037
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Ubuntu 12.04
> RAM : 6 GB
> Spark 1.6.1 Standalone
>Reporter: Samuel Alexander
>  Labels: performance, sparkR
>
> Any operations on dataframe created using SparkR::createDataFrame is very 
> slow.
> I have a CSV of size ~ 6MB. Below is the sample content
> 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
> 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
> 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
> 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
> 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
> 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
> 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
> 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
> 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
> 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter
> I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, 
> sep=","). And then converted into Spark dataframe using sp_df <- 
> createDataFrame(sqlContext, r_df)
> Now count(sp_df) took more than 30 seconds
> When I load the same CSV using spark-csv like, direct_df <- 
> read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = 
> "com.databricks.spark.csv", inferSchema = "false", header="true")
> count(direct_df) took below 1 sec.
> I know performance has been improved in createDataFrame in Spark 1.6. But 
> other operations like count(), is very slow.
> How can I get rid of this performance issue? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11507) Error thrown when using BlockMatrix.add

2016-03-21 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205734#comment-15205734
 ] 

yuhao yang commented on SPARK-11507:


Sure we can do it. About the fix, I assume we should copy first and invoke 
compact, right?

> Error thrown when using BlockMatrix.add
> ---
>
> Key: SPARK-11507
> URL: https://issues.apache.org/jira/browse/SPARK-11507
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.5.0
> Environment: Mac/local machine, EC2
> Scala
>Reporter: Kareem Alhazred
>Priority: Minor
>
> In certain situations when adding two block matrices, I get an error 
> regarding colPtr and the operation fails.  External issue URL includes full 
> error and code for reproducing the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13883) buildReader implementation for parquet

2016-03-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13883.
--
Resolution: Fixed

Issue resolved by pull request 11709
[https://github.com/apache/spark/pull/11709]

> buildReader implementation for parquet
> --
>
> Key: SPARK-13883
> URL: https://issues.apache.org/jira/browse/SPARK-13883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>
> Port parquet to the new strategy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205679#comment-15205679
 ] 

Apache Spark commented on SPARK-14056:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/11876

> Add s3 configurations and spark.hadoop.* configurations to hive configuration
> -
>
> Key: SPARK-14056
> URL: https://issues.apache.org/jira/browse/SPARK-14056
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> Currently when creating a HiveConf in TableReader.scala, we are not passing 
> s3 specific configurations (like aws s3 credentials) and spark.hadoop.* 
> configurations set by the user.  We should fix this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14056:


Assignee: (was: Apache Spark)

> Add s3 configurations and spark.hadoop.* configurations to hive configuration
> -
>
> Key: SPARK-14056
> URL: https://issues.apache.org/jira/browse/SPARK-14056
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> Currently when creating a HiveConf in TableReader.scala, we are not passing 
> s3 specific configurations (like aws s3 credentials) and spark.hadoop.* 
> configurations set by the user.  We should fix this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14056:


Assignee: Apache Spark

> Add s3 configurations and spark.hadoop.* configurations to hive configuration
> -
>
> Key: SPARK-14056
> URL: https://issues.apache.org/jira/browse/SPARK-14056
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> Currently when creating a HiveConf in TableReader.scala, we are not passing 
> s3 specific configurations (like aws s3 credentials) and spark.hadoop.* 
> configurations set by the user.  We should fix this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14057) sql time stamps do not respect time zones

2016-03-21 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-14057:
---

 Summary: sql time stamps do not respect time zones
 Key: SPARK-14057
 URL: https://issues.apache.org/jira/browse/SPARK-14057
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Andrew Davidson
Priority: Minor


we have time stamp data. The time stamp data is UTC how ever when we load the 
data into spark data frames, the system assume the time stamps are in the local 
time zone. This causes problems for our data scientists. Often they pull data 
from our data center into their local macs. The data centers run UTC. There 
computers are typically in PST or EST.

It is possible to hack around this problem

This cause a lot of errors in their analysis

A complete description of this issue can be found in the following mail msg

https://www.mail-archive.com/user@spark.apache.org/msg48121.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14055) AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method

2016-03-21 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14055:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0
 Priority: Critical  (was: Minor)

> AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' 
> method
> 
>
> Key: SPARK-14055
> URL: https://issues.apache.org/jira/browse/SPARK-14055
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 2
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Ernest
>Priority: Critical
>
> We got the following log when running _LiveJournalPageRank_.
> {quote}
> 452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to 
> acquire write lock for rdd_3_183
> 452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write 
> lock for rdd_3_183
> 456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from 
> memory
> 456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size 
> 418784648 dropped from memory (free 3504141600)
> 457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block 
> rdd_3_183
> 457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block 
> rdd_3_183
> 457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to 
> remove block rdd_3_183
> 500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put 
> rdd_3_183
> 500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to 
> acquire read lock for rdd_3_183
> 500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to 
> acquire write lock for rdd_3_183
> 500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write 
> lock for rdd_3_183
> 517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ** taskAttemptId is: 
> 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError 
> happeneds here*
> 517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage 
> 10.0 (TID 1662)
> 517259-java.lang.AssertionError: assertion failed
> 517260- at scala.Predef$.assert(Predef.scala:151)
> 517261- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356)
> 517262- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351)
> 517263- at scala.Option.foreach(Option.scala:257)
> 517264- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351)
> 517265- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350)
> 517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> 517267- at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350)
> 517268- at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626)
> 517269- at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238)
> {quote}
> When memory for RDD storage is not sufficient and have to evict several 
> partitions, this _AssertionError_ may happened. 
> For the above example, this is because while running _Task 1662_, several 
> partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired  
> read and write locks at first, then doing _dropBlock_ method in 
> _MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from 
> memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into 
> _BlockInfoManager.removeBlock_, but _writeLocksByTask_  is not update here.
> Unfortunately, _Task 1681_ is already started and needed to reproduce 
> rdd\_3\_183 to produce it's target rdd here , and this task acquired write 
> lock of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last, 
> this _AssertionError_ occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14055) AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14055:


Assignee: (was: Apache Spark)

> AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' 
> method
> 
>
> Key: SPARK-14055
> URL: https://issues.apache.org/jira/browse/SPARK-14055
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 2
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Ernest
>Priority: Minor
>
> We got the following log when running _LiveJournalPageRank_.
> {quote}
> 452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to 
> acquire write lock for rdd_3_183
> 452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write 
> lock for rdd_3_183
> 456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from 
> memory
> 456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size 
> 418784648 dropped from memory (free 3504141600)
> 457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block 
> rdd_3_183
> 457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block 
> rdd_3_183
> 457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to 
> remove block rdd_3_183
> 500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put 
> rdd_3_183
> 500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to 
> acquire read lock for rdd_3_183
> 500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to 
> acquire write lock for rdd_3_183
> 500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write 
> lock for rdd_3_183
> 517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ** taskAttemptId is: 
> 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError 
> happeneds here*
> 517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage 
> 10.0 (TID 1662)
> 517259-java.lang.AssertionError: assertion failed
> 517260- at scala.Predef$.assert(Predef.scala:151)
> 517261- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356)
> 517262- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351)
> 517263- at scala.Option.foreach(Option.scala:257)
> 517264- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351)
> 517265- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350)
> 517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> 517267- at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350)
> 517268- at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626)
> 517269- at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238)
> {quote}
> When memory for RDD storage is not sufficient and have to evict several 
> partitions, this _AssertionError_ may happened. 
> For the above example, this is because while running _Task 1662_, several 
> partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired  
> read and write locks at first, then doing _dropBlock_ method in 
> _MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from 
> memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into 
> _BlockInfoManager.removeBlock_, but _writeLocksByTask_  is not update here.
> Unfortunately, _Task 1681_ is already started and needed to reproduce 
> rdd\_3\_183 to produce it's target rdd here , and this task acquired write 
> lock of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last, 
> this _AssertionError_ occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14055) AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205659#comment-15205659
 ] 

Apache Spark commented on SPARK-14055:
--

User 'Earne' has created a pull request for this issue:
https://github.com/apache/spark/pull/11875

> AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' 
> method
> 
>
> Key: SPARK-14055
> URL: https://issues.apache.org/jira/browse/SPARK-14055
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 2
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Ernest
>Priority: Minor
>
> We got the following log when running _LiveJournalPageRank_.
> {quote}
> 452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to 
> acquire write lock for rdd_3_183
> 452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write 
> lock for rdd_3_183
> 456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from 
> memory
> 456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size 
> 418784648 dropped from memory (free 3504141600)
> 457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block 
> rdd_3_183
> 457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block 
> rdd_3_183
> 457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to 
> remove block rdd_3_183
> 500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put 
> rdd_3_183
> 500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to 
> acquire read lock for rdd_3_183
> 500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to 
> acquire write lock for rdd_3_183
> 500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write 
> lock for rdd_3_183
> 517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ** taskAttemptId is: 
> 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError 
> happeneds here*
> 517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage 
> 10.0 (TID 1662)
> 517259-java.lang.AssertionError: assertion failed
> 517260- at scala.Predef$.assert(Predef.scala:151)
> 517261- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356)
> 517262- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351)
> 517263- at scala.Option.foreach(Option.scala:257)
> 517264- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351)
> 517265- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350)
> 517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> 517267- at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350)
> 517268- at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626)
> 517269- at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238)
> {quote}
> When memory for RDD storage is not sufficient and have to evict several 
> partitions, this _AssertionError_ may happened. 
> For the above example, this is because while running _Task 1662_, several 
> partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired  
> read and write locks at first, then doing _dropBlock_ method in 
> _MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from 
> memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into 
> _BlockInfoManager.removeBlock_, but _writeLocksByTask_  is not update here.
> Unfortunately, _Task 1681_ is already started and needed to reproduce 
> rdd\_3\_183 to produce it's target rdd here , and this task acquired write 
> lock of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last, 
> this _AssertionError_ occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14055) AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14055:


Assignee: Apache Spark

> AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' 
> method
> 
>
> Key: SPARK-14055
> URL: https://issues.apache.org/jira/browse/SPARK-14055
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 2
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Ernest
>Assignee: Apache Spark
>Priority: Minor
>
> We got the following log when running _LiveJournalPageRank_.
> {quote}
> 452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to 
> acquire write lock for rdd_3_183
> 452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write 
> lock for rdd_3_183
> 456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from 
> memory
> 456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size 
> 418784648 dropped from memory (free 3504141600)
> 457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block 
> rdd_3_183
> 457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block 
> rdd_3_183
> 457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to 
> remove block rdd_3_183
> 500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put 
> rdd_3_183
> 500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to 
> acquire read lock for rdd_3_183
> 500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to 
> acquire write lock for rdd_3_183
> 500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write 
> lock for rdd_3_183
> 517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ** taskAttemptId is: 
> 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError 
> happeneds here*
> 517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage 
> 10.0 (TID 1662)
> 517259-java.lang.AssertionError: assertion failed
> 517260- at scala.Predef$.assert(Predef.scala:151)
> 517261- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356)
> 517262- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351)
> 517263- at scala.Option.foreach(Option.scala:257)
> 517264- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351)
> 517265- at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350)
> 517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> 517267- at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350)
> 517268- at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626)
> 517269- at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238)
> {quote}
> When memory for RDD storage is not sufficient and have to evict several 
> partitions, this _AssertionError_ may happened. 
> For the above example, this is because while running _Task 1662_, several 
> partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired  
> read and write locks at first, then doing _dropBlock_ method in 
> _MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from 
> memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into 
> _BlockInfoManager.removeBlock_, but _writeLocksByTask_  is not update here.
> Unfortunately, _Task 1681_ is already started and needed to reproduce 
> rdd\_3\_183 to produce it's target rdd here , and this task acquired write 
> lock of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last, 
> this _AssertionError_ occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3000) Drop old blocks to disk in parallel when memory is not large enough for caching new blocks

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3000:
---

Assignee: Josh Rosen  (was: Apache Spark)

> Drop old blocks to disk in parallel when memory is not large enough for 
> caching new blocks
> --
>
> Key: SPARK-3000
> URL: https://issues.apache.org/jira/browse/SPARK-3000
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>Assignee: Josh Rosen
> Attachments: Spark-3000 Design Doc.pdf
>
>
> In spark, rdd can be cached in memory for later use, and the cached memory 
> size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark 
> version before 1.1.0, and "*spark.executor.memory * 
> spark.storage.memoryFraction * spark.storage.safetyFraction*" after 
> [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. 
> For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache 
> new blocks, old blocks might be dropped to disk to free up memory for new 
> blocks. This operation is processed by _ensureFreeSpace_ in 
> _MemoryStore.scala_, there will always be a "*accountingLock*" held by the 
> caller to ensure only one thread is dropping blocks. This method can not 
> fully used the disks throughput when there are multiple disks on the working 
> node. When testing our workload, we found this is really a bottleneck when 
> size of old blocks to be dropped is really large. 
> We have tested the parallel method on spark 1.0, the speedup is significant. 
> So it's necessary to make dropping blocks operation in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3000) Drop old blocks to disk in parallel when memory is not large enough for caching new blocks

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205651#comment-15205651
 ] 

Apache Spark commented on SPARK-3000:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11874

> Drop old blocks to disk in parallel when memory is not large enough for 
> caching new blocks
> --
>
> Key: SPARK-3000
> URL: https://issues.apache.org/jira/browse/SPARK-3000
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>Assignee: Josh Rosen
> Attachments: Spark-3000 Design Doc.pdf
>
>
> In spark, rdd can be cached in memory for later use, and the cached memory 
> size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark 
> version before 1.1.0, and "*spark.executor.memory * 
> spark.storage.memoryFraction * spark.storage.safetyFraction*" after 
> [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. 
> For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache 
> new blocks, old blocks might be dropped to disk to free up memory for new 
> blocks. This operation is processed by _ensureFreeSpace_ in 
> _MemoryStore.scala_, there will always be a "*accountingLock*" held by the 
> caller to ensure only one thread is dropping blocks. This method can not 
> fully used the disks throughput when there are multiple disks on the working 
> node. When testing our workload, we found this is really a bottleneck when 
> size of old blocks to be dropped is really large. 
> We have tested the parallel method on spark 1.0, the speedup is significant. 
> So it's necessary to make dropping blocks operation in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3000) Drop old blocks to disk in parallel when memory is not large enough for caching new blocks

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3000:
---

Assignee: Apache Spark  (was: Josh Rosen)

> Drop old blocks to disk in parallel when memory is not large enough for 
> caching new blocks
> --
>
> Key: SPARK-3000
> URL: https://issues.apache.org/jira/browse/SPARK-3000
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>Assignee: Apache Spark
> Attachments: Spark-3000 Design Doc.pdf
>
>
> In spark, rdd can be cached in memory for later use, and the cached memory 
> size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark 
> version before 1.1.0, and "*spark.executor.memory * 
> spark.storage.memoryFraction * spark.storage.safetyFraction*" after 
> [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. 
> For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache 
> new blocks, old blocks might be dropped to disk to free up memory for new 
> blocks. This operation is processed by _ensureFreeSpace_ in 
> _MemoryStore.scala_, there will always be a "*accountingLock*" held by the 
> caller to ensure only one thread is dropping blocks. This method can not 
> fully used the disks throughput when there are multiple disks on the working 
> node. When testing our workload, we found this is really a bottleneck when 
> size of old blocks to be dropped is really large. 
> We have tested the parallel method on spark 1.0, the speedup is significant. 
> So it's necessary to make dropping blocks operation in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration

2016-03-21 Thread Sital Kedia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-14056:

Affects Version/s: 1.6.1
  Component/s: SQL
   EC2

> Add s3 configurations and spark.hadoop.* configurations to hive configuration
> -
>
> Key: SPARK-14056
> URL: https://issues.apache.org/jira/browse/SPARK-14056
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> Currently when creating a HiveConf in TableReader.scala, we are not passing 
> s3 specific configurations (like aws s3 credentials) and spark.hadoop.* 
> configurations set by the user.  We should fix this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration

2016-03-21 Thread Sital Kedia (JIRA)

Sital Kedia created SPARK-14056:
---

 Summary: Add s3 configurations and spark.hadoop.* configurations 
to hive configuration
 Key: SPARK-14056
 URL: https://issues.apache.org/jira/browse/SPARK-14056
 Project: Spark
  Issue Type: Improvement
Reporter: Sital Kedia


Currently when creating a HiveConf in TableReader.scala, we are not passing s3 
specific configurations (like aws s3 credentials) and spark.hadoop.* 
configurations set by the user.  We should fix this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14055) AssertionError may happeneds if not unlock writeLock when doing 'removeBlock' method

2016-03-21 Thread Ernest (JIRA)

Ernest created SPARK-14055:
--

 Summary: AssertionError may happeneds if not unlock writeLock when 
doing 'removeBlock' method
 Key: SPARK-14055
 URL: https://issues.apache.org/jira/browse/SPARK-14055
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
 Environment: Spark 2.0-SNAPSHOT
Single Rack
Standalone mode scheduling
8 node cluster
16 cores & 64G RAM / node
Data Replication factor of 2

Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
Reporter: Ernest
Priority: Minor


We got the following log when running _LiveJournalPageRank_.
{quote}
452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to 
acquire write lock for rdd_3_183
452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write 
lock for rdd_3_183
456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from 
memory
456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size 
418784648 dropped from memory (free 3504141600)
457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block 
rdd_3_183
457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block 
rdd_3_183
457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to remove 
block rdd_3_183
500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put 
rdd_3_183
500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to 
acquire read lock for rdd_3_183
500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to 
acquire write lock for rdd_3_183
500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write 
lock for rdd_3_183
517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ** taskAttemptId is: 
1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError 
happeneds here*
517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage 
10.0 (TID 1662)
517259-java.lang.AssertionError: assertion failed
517260- at scala.Predef$.assert(Predef.scala:151)
517261- at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356)
517262- at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351)
517263- at scala.Option.foreach(Option.scala:257)
517264- at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351)
517265- at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350)
517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
517267- at 
org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350)
517268- at 
org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626)
517269- at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238)
{quote}

When memory for RDD storage is not sufficient and have to evict several 
partitions, this _AssertionError_ may happened. 
For the above example, this is because while running _Task 1662_, several 
partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired  
read and write locks at first, then doing _dropBlock_ method in 
_MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from 
memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into 
_BlockInfoManager.removeBlock_, but _writeLocksByTask_  is not update here.

Unfortunately, _Task 1681_ is already started and needed to reproduce 
rdd\_3\_183 to produce it's target rdd here , and this task acquired write lock 
of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last, this 
_AssertionError_ occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14036) Remove mllib.tree.model.Node.build

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205590#comment-15205590
 ] 

Apache Spark commented on SPARK-14036:
--

User 'rishabhbhardwaj' has created a pull request for this issue:
https://github.com/apache/spark/pull/11873

> Remove mllib.tree.model.Node.build
> --
>
> Key: SPARK-14036
> URL: https://issues.apache.org/jira/browse/SPARK-14036
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> mllib.tree.model.Node.build has been deprecated for a year.  We should remove 
> it for 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14036) Remove mllib.tree.model.Node.build

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14036:


Assignee: (was: Apache Spark)

> Remove mllib.tree.model.Node.build
> --
>
> Key: SPARK-14036
> URL: https://issues.apache.org/jira/browse/SPARK-14036
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> mllib.tree.model.Node.build has been deprecated for a year.  We should remove 
> it for 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14036) Remove mllib.tree.model.Node.build

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14036:


Assignee: Apache Spark

> Remove mllib.tree.model.Node.build
> --
>
> Key: SPARK-14036
> URL: https://issues.apache.org/jira/browse/SPARK-14036
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Trivial
>
> mllib.tree.model.Node.build has been deprecated for a year.  We should remove 
> it for 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14038) Enable native view by default

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205579#comment-15205579
 ] 

Apache Spark commented on SPARK-14038:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11872

> Enable native view by default
> -
>
> Key: SPARK-14038
> URL: https://issues.apache.org/jira/browse/SPARK-14038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: releasenotes
>
> Release note update:
> {quote}
> Starting from 2.0.0, Spark SQL handles views natively by default. When 
> defining a view, now Spark SQL canonicalizes view definition by generating a 
> canonical SQL statement from the parsed logical query plan, and then stores 
> it into the catalog. If you hit any problems, you may try to turn off native 
> view by setting {{spark.sql.nativeView}} to false.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14016) Support high-precision decimals in vectorized parquet reader

2016-03-21 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14016.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11869
[https://github.com/apache/spark/pull/11869]

> Support high-precision decimals in vectorized parquet reader
> 
>
> Key: SPARK-14016
> URL: https://issues.apache.org/jira/browse/SPARK-14016
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14016) Support high-precision decimals in vectorized parquet reader

2016-03-21 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14016:
-
Assignee: Sameer Agarwal

> Support high-precision decimals in vectorized parquet reader
> 
>
> Key: SPARK-14016
> URL: https://issues.apache.org/jira/browse/SPARK-14016
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3000) Drop old blocks to disk in parallel when memory is not large enough for caching new blocks

2016-03-21 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3000:
--
Target Version/s: 2.0.0
 Summary: Drop old blocks to disk in parallel when memory is not 
large enough for caching new blocks  (was: drop old blocks to disk in parallel 
when memory is not large enough for caching new blocks)

> Drop old blocks to disk in parallel when memory is not large enough for 
> caching new blocks
> --
>
> Key: SPARK-3000
> URL: https://issues.apache.org/jira/browse/SPARK-3000
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>Assignee: Josh Rosen
> Attachments: Spark-3000 Design Doc.pdf
>
>
> In spark, rdd can be cached in memory for later use, and the cached memory 
> size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark 
> version before 1.1.0, and "*spark.executor.memory * 
> spark.storage.memoryFraction * spark.storage.safetyFraction*" after 
> [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. 
> For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache 
> new blocks, old blocks might be dropped to disk to free up memory for new 
> blocks. This operation is processed by _ensureFreeSpace_ in 
> _MemoryStore.scala_, there will always be a "*accountingLock*" held by the 
> caller to ensure only one thread is dropping blocks. This method can not 
> fully used the disks throughput when there are multiple disks on the working 
> node. When testing our workload, we found this is really a bottleneck when 
> size of old blocks to be dropped is really large. 
> We have tested the parallel method on spark 1.0, the speedup is significant. 
> So it's necessary to make dropping blocks operation in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14054) Support parameters for UDTs

2016-03-21 Thread Kevin Chen (JIRA)

Kevin Chen created SPARK-14054:
--

 Summary: Support parameters for UDTs
 Key: SPARK-14054
 URL: https://issues.apache.org/jira/browse/SPARK-14054
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.6.1
Reporter: Kevin Chen
Priority: Minor


Currently UDTs with parameters, e.g. generic types are not supported. Json 
serialized UDTs are instantiated with reflection by a parameter-less 
constructor (DataType.fromJson). This means a user needs to create a separate 
UDT for types that differ only in generic types, e.g. one backed by a list of 
string and another backed by a list of integer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-3000) drop old blocks to disk in parallel when memory is not large enough for caching new blocks

2016-03-21 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-3000:
---
  Assignee: Josh Rosen  (was: Zhang, Liye)

I'm going to re-open this issue and will submit a significantly simplified 
patch for it.

> drop old blocks to disk in parallel when memory is not large enough for 
> caching new blocks
> --
>
> Key: SPARK-3000
> URL: https://issues.apache.org/jira/browse/SPARK-3000
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>Assignee: Josh Rosen
> Attachments: Spark-3000 Design Doc.pdf
>
>
> In spark, rdd can be cached in memory for later use, and the cached memory 
> size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark 
> version before 1.1.0, and "*spark.executor.memory * 
> spark.storage.memoryFraction * spark.storage.safetyFraction*" after 
> [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. 
> For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache 
> new blocks, old blocks might be dropped to disk to free up memory for new 
> blocks. This operation is processed by _ensureFreeSpace_ in 
> _MemoryStore.scala_, there will always be a "*accountingLock*" held by the 
> caller to ensure only one thread is dropping blocks. This method can not 
> fully used the disks throughput when there are multiple disks on the working 
> node. When testing our workload, we found this is really a bottleneck when 
> size of old blocks to be dropped is really large. 
> We have tested the parallel method on spark 1.0, the speedup is significant. 
> So it's necessary to make dropping blocks operation in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3000) drop old blocks to disk in parallel when memory is not large enough for caching new blocks

2016-03-21 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3000:
--
Component/s: Block Manager

> drop old blocks to disk in parallel when memory is not large enough for 
> caching new blocks
> --
>
> Key: SPARK-3000
> URL: https://issues.apache.org/jira/browse/SPARK-3000
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
> Attachments: Spark-3000 Design Doc.pdf
>
>
> In spark, rdd can be cached in memory for later use, and the cached memory 
> size is "*spark.executor.memory * spark.storage.memoryFraction*" for spark 
> version before 1.1.0, and "*spark.executor.memory * 
> spark.storage.memoryFraction * spark.storage.safetyFraction*" after 
> [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. 
> For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache 
> new blocks, old blocks might be dropped to disk to free up memory for new 
> blocks. This operation is processed by _ensureFreeSpace_ in 
> _MemoryStore.scala_, there will always be a "*accountingLock*" held by the 
> caller to ensure only one thread is dropping blocks. This method can not 
> fully used the disks throughput when there are multiple disks on the working 
> node. When testing our workload, we found this is really a bottleneck when 
> size of old blocks to be dropped is really large. 
> We have tested the parallel method on spark 1.0, the speedup is significant. 
> So it's necessary to make dropping blocks operation in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14053) Merge absTol and relTol into one in MLlib tests

2016-03-21 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205514#comment-15205514
 ] 

DB Tsai edited comment on SPARK-14053 at 3/22/16 1:13 AM:
--

This makes sense for me. We just need to document it properly. Also, the 
current code for comparing double is symmetric. We can do

If abs(y) > eps / t && abs(x) > eps / t  test abs(y - x) < t * math.min(absX, 
absY)
else test abs(y - x) < eps
```


was (Author: dbtsai):
This makes sense for me. We just need to document it properly. Also, the 
current code for comparing double is symmetric. We can do

```If (abs(y) > eps / t && abs(x) > eps / t) test abs(y - x) < t * 
math.min(absX, absY)
else test abs(y - x) < eps
```

> Merge absTol and relTol into one in MLlib tests
> ---
>
> Key: SPARK-14053
> URL: https://issues.apache.org/jira/browse/SPARK-14053
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, Tests
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We have absTol and relTol in MLlib tests to compare values with possible 
> numerical differences. However, in most cases we should just use relTol. Many 
> absTol are not used properly. See 
> https://github.com/apache/spark/search?q=absTol. One corner case relTol 
> doesn't handle is when the target value is 0. We can make the following 
> change to relTol to solve the issue. Consider `x ~== y relTol t`.
> 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t,
> 2. else test abs(y - x) < eps
> where eps is a reasonably small value, e.g., 1e-14. Note that the transition 
> is smooth at abs( y ) = eps / t.
> cc [~dbtsai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14053) Merge absTol and relTol into one in MLlib tests

2016-03-21 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205514#comment-15205514
 ] 

DB Tsai edited comment on SPARK-14053 at 3/22/16 1:14 AM:
--

This makes sense for me. We just need to document it properly. Also, the 
current code for comparing double is symmetric. We can do

If abs( y ) > eps / t && abs( x ) > eps / t  test abs(y - x) < t * 
math.min(absX, absY)
else test abs(y - x) < eps
```


was (Author: dbtsai):
This makes sense for me. We just need to document it properly. Also, the 
current code for comparing double is symmetric. We can do

If abs(y) > eps / t && abs(x) > eps / t  test abs(y - x) < t * math.min(absX, 
absY)
else test abs(y - x) < eps
```

> Merge absTol and relTol into one in MLlib tests
> ---
>
> Key: SPARK-14053
> URL: https://issues.apache.org/jira/browse/SPARK-14053
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, Tests
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We have absTol and relTol in MLlib tests to compare values with possible 
> numerical differences. However, in most cases we should just use relTol. Many 
> absTol are not used properly. See 
> https://github.com/apache/spark/search?q=absTol. One corner case relTol 
> doesn't handle is when the target value is 0. We can make the following 
> change to relTol to solve the issue. Consider `x ~== y relTol t`.
> 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t,
> 2. else test abs(y - x) < eps
> where eps is a reasonably small value, e.g., 1e-14. Note that the transition 
> is smooth at abs( y ) = eps / t.
> cc [~dbtsai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14053) Merge absTol and relTol into one in MLlib tests

2016-03-21 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205514#comment-15205514
 ] 

DB Tsai edited comment on SPARK-14053 at 3/22/16 1:13 AM:
--

This makes sense for me. We just need to document it properly. Also, the 
current code for comparing double is symmetric. We can do

```If (abs(y) > eps / t && abs(x) > eps / t) test abs(y - x) < t * 
math.min(absX, absY)
else test abs(y - x) < eps
```


was (Author: dbtsai):
This makes sense for me. We just need to document it properly. Also, the 
current code for comparing double is symmetric. We can do

If (abs(y) > eps / t && abs(x) > eps / t) test abs(y - x) < t * math.min(absX, 
absY)
else test abs(y - x) < eps


> Merge absTol and relTol into one in MLlib tests
> ---
>
> Key: SPARK-14053
> URL: https://issues.apache.org/jira/browse/SPARK-14053
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, Tests
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We have absTol and relTol in MLlib tests to compare values with possible 
> numerical differences. However, in most cases we should just use relTol. Many 
> absTol are not used properly. See 
> https://github.com/apache/spark/search?q=absTol. One corner case relTol 
> doesn't handle is when the target value is 0. We can make the following 
> change to relTol to solve the issue. Consider `x ~== y relTol t`.
> 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t,
> 2. else test abs(y - x) < eps
> where eps is a reasonably small value, e.g., 1e-14. Note that the transition 
> is smooth at abs( y ) = eps / t.
> cc [~dbtsai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14053) Merge absTol and relTol into one in MLlib tests

2016-03-21 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205514#comment-15205514
 ] 

DB Tsai commented on SPARK-14053:
-

This makes sense for me. We just need to document it properly. Also, the 
current code for comparing double is symmetric. We can do

If (abs(y) > eps / t && abs(x) > eps / t) test abs(y - x) < t * math.min(absX, 
absY)
else test abs(y - x) < eps


> Merge absTol and relTol into one in MLlib tests
> ---
>
> Key: SPARK-14053
> URL: https://issues.apache.org/jira/browse/SPARK-14053
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, Tests
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We have absTol and relTol in MLlib tests to compare values with possible 
> numerical differences. However, in most cases we should just use relTol. Many 
> absTol are not used properly. See 
> https://github.com/apache/spark/search?q=absTol. One corner case relTol 
> doesn't handle is when the target value is 0. We can make the following 
> change to relTol to solve the issue. Consider `x ~== y relTol t`.
> 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t,
> 2. else test abs(y - x) < eps
> where eps is a reasonably small value, e.g., 1e-14. Note that the transition 
> is smooth at abs( y ) = eps / t.
> cc [~dbtsai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-03-21 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205488#comment-15205488
 ] 

Josh Rosen commented on SPARK-6305:
---

Hey Sean, did you get very far along with this? I'd like to revisit doing a 
Log4J 2.x upgrade in Spark 2.0 in order to benefit from some performance 
benefits in the new Log4J.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-03-21 Thread Jason C Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205446#comment-15205446
 ] 

Jason C Lee commented on SPARK-13802:
-

I will give it a shot! Working on the PR at the moment. 

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13320) Confusing error message for Dataset API when using sum("*")

2016-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-13320.
-
Resolution: Fixed

> Confusing error message for Dataset API when using sum("*")
> ---
>
> Key: SPARK-13320
> URL: https://issues.apache.org/jira/browse/SPARK-13320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Xiao Li
>
> {code}
> pagecounts4PartitionsDS
>   .map(line => (line._1, line._3))
>   .toDF()
>   .groupBy($"_1")
>   .agg(sum("*") as "sumOccurances")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input 
> columns _1, _2;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57)
>   at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213)
> {code}
> The error is with sum("*"), not the resolution of group by "_1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-13320) Confusing error message for Dataset API when using sum("*")

2016-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13320:

Assignee: Xiao Li

> Confusing error message for Dataset API when using sum("*")
> ---
>
> Key: SPARK-13320
> URL: https://issues.apache.org/jira/browse/SPARK-13320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Xiao Li
>
> {code}
> pagecounts4PartitionsDS
>   .map(line => (line._1, line._3))
>   .toDF()
>   .groupBy($"_1")
>   .agg(sum("*") as "sumOccurances")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input 
> columns _1, _2;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57)
>   at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213)
> {code}
> The error is with sum("*"), not the resolution of group by "_1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (SPARK-13990) Automatically pick serializer when caching RDDs

2016-03-21 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13990.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11801
[https://github.com/apache/spark/pull/11801]

> Automatically pick serializer when caching RDDs
> ---
>
> Key: SPARK-13990
> URL: https://issues.apache.org/jira/browse/SPARK-13990
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Building on the SerializerManager infrastructure introduced in SPARK-13926, 
> we should use RDDs ClassTags to automatically pick serializers when caching 
> RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13822) Follow-ups of DataFrame/Dataset API unification

2016-03-21 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13822.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Follow-ups of DataFrame/Dataset API unification
> ---
>
> Key: SPARK-13822
> URL: https://issues.apache.org/jira/browse/SPARK-13822
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 2.0.0
>
>
> This is an umbrella ticket for all follow-up work of DataFrame/Dataset API 
> unification (SPARK-13244).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13898) Merge DatasetHolder and DataFrameHolder

2016-03-21 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13898.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Merge DatasetHolder and DataFrameHolder
> ---
>
> Key: SPARK-13898
> URL: https://issues.apache.org/jira/browse/SPARK-13898
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Not 100% sure yet, but I think maybe they should just be a single class, and 
> most things in SQLImplicits should probably return Datasets of specific types 
> instead of DataFrames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13587) Support virtualenv in PySpark

2016-03-21 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-13587:
---
Issue Type: New Feature  (was: Improvement)

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14050) Add multiple languages support for Stop Words Remover

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14050:


Assignee: Apache Spark

> Add multiple languages support for Stop Words Remover
> -
>
> Key: SPARK-14050
> URL: https://issues.apache.org/jira/browse/SPARK-14050
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Burak KÖSE
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14050) Add multiple languages support for Stop Words Remover

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205399#comment-15205399
 ] 

Apache Spark commented on SPARK-14050:
--

User 'burakkose' has created a pull request for this issue:
https://github.com/apache/spark/pull/11871

> Add multiple languages support for Stop Words Remover
> -
>
> Key: SPARK-14050
> URL: https://issues.apache.org/jira/browse/SPARK-14050
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Burak KÖSE
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14050) Add multiple languages support for Stop Words Remover

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14050:


Assignee: (was: Apache Spark)

> Add multiple languages support for Stop Words Remover
> -
>
> Key: SPARK-14050
> URL: https://issues.apache.org/jira/browse/SPARK-14050
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Burak KÖSE
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13916) For whole stage codegen, measure and add the execution duration as a metric

2016-03-21 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13916.
-
   Resolution: Fixed
 Assignee: Nong Li
Fix Version/s: 2.0.0

> For whole stage codegen, measure and add the execution duration as a metric
> ---
>
> Key: SPARK-13916
> URL: https://issues.apache.org/jira/browse/SPARK-13916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14053) Merge absTol and relTol into one in MLlib tests

2016-03-21 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-14053:
-

 Summary: Merge absTol and relTol into one in MLlib tests
 Key: SPARK-14053
 URL: https://issues.apache.org/jira/browse/SPARK-14053
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, Tests
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


We have absTol and relTol in MLlib tests to compare values with possible 
numerical differences. However, in most cases we should just use relTol. Many 
absTol are not used properly. See 
https://github.com/apache/spark/search?q=absTol. One corner case relTol doesn't 
handle is when the target value is 0. We can make the following change to 
relTol to solve the issue. Consider `x ~== y relTol t`.

1. If abs(y) > eps / t, test abs(y - x) / abs(y) < t,
2. else test abs(y - x) < eps

where eps is a reasonably small value, e.g., 1e-14. Note that the transition is 
smooth at abs(y) = eps / t.

cc [~dbtsai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14053) Merge absTol and relTol into one in MLlib tests

2016-03-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14053:
--
Description: 
We have absTol and relTol in MLlib tests to compare values with possible 
numerical differences. However, in most cases we should just use relTol. Many 
absTol are not used properly. See 
https://github.com/apache/spark/search?q=absTol. One corner case relTol doesn't 
handle is when the target value is 0. We can make the following change to 
relTol to solve the issue. Consider `x ~== y relTol t`.

1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t,
2. else test abs(y - x) < eps

where eps is a reasonably small value, e.g., 1e-14. Note that the transition is 
smooth at abs( y ) = eps / t.

cc [~dbtsai]

  was:
We have absTol and relTol in MLlib tests to compare values with possible 
numerical differences. However, in most cases we should just use relTol. Many 
absTol are not used properly. See 
https://github.com/apache/spark/search?q=absTol. One corner case relTol doesn't 
handle is when the target value is 0. We can make the following change to 
relTol to solve the issue. Consider `x ~== y relTol t`.

1. If abs(y) > eps / t, test abs(y - x) / abs(y) < t,
2. else test abs(y - x) < eps

where eps is a reasonably small value, e.g., 1e-14. Note that the transition is 
smooth at abs(y) = eps / t.

cc [~dbtsai]


> Merge absTol and relTol into one in MLlib tests
> ---
>
> Key: SPARK-14053
> URL: https://issues.apache.org/jira/browse/SPARK-14053
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, Tests
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We have absTol and relTol in MLlib tests to compare values with possible 
> numerical differences. However, in most cases we should just use relTol. Many 
> absTol are not used properly. See 
> https://github.com/apache/spark/search?q=absTol. One corner case relTol 
> doesn't handle is when the target value is 0. We can make the following 
> change to relTol to solve the issue. Consider `x ~== y relTol t`.
> 1. If abs( y ) > eps / t, test abs(y - x) / abs( y ) < t,
> 2. else test abs(y - x) < eps
> where eps is a reasonably small value, e.g., 1e-14. Note that the transition 
> is smooth at abs( y ) = eps / t.
> cc [~dbtsai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14052) Build BytesToBytesMap in HashedRelation

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14052:


Assignee: Davies Liu  (was: Apache Spark)

> Build BytesToBytesMap in HashedRelation
> ---
>
> Key: SPARK-14052
> URL: https://issues.apache.org/jira/browse/SPARK-14052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Currently, for the key that can not fit within a long,  we build a hash map 
> for UnsafeHashedRelation, it's converted to BytesToBytesMap after 
> serialization and deserialization.
> We should build a BytesToBytesMap directly to have better memory efficiency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14052) Build BytesToBytesMap in HashedRelation

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14052:


Assignee: Apache Spark  (was: Davies Liu)

> Build BytesToBytesMap in HashedRelation
> ---
>
> Key: SPARK-14052
> URL: https://issues.apache.org/jira/browse/SPARK-14052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Currently, for the key that can not fit within a long,  we build a hash map 
> for UnsafeHashedRelation, it's converted to BytesToBytesMap after 
> serialization and deserialization.
> We should build a BytesToBytesMap directly to have better memory efficiency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14052) Build BytesToBytesMap in HashedRelation

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205376#comment-15205376
 ] 

Apache Spark commented on SPARK-14052:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11870

> Build BytesToBytesMap in HashedRelation
> ---
>
> Key: SPARK-14052
> URL: https://issues.apache.org/jira/browse/SPARK-14052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Currently, for the key that can not fit within a long,  we build a hash map 
> for UnsafeHashedRelation, it's converted to BytesToBytesMap after 
> serialization and deserialization.
> We should build a BytesToBytesMap directly to have better memory efficiency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14016) Support high-precision decimals in vectorized parquet reader

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14016:


Assignee: (was: Apache Spark)

> Support high-precision decimals in vectorized parquet reader
> 
>
> Key: SPARK-14016
> URL: https://issues.apache.org/jira/browse/SPARK-14016
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14016) Support high-precision decimals in vectorized parquet reader

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14016:


Assignee: Apache Spark

> Support high-precision decimals in vectorized parquet reader
> 
>
> Key: SPARK-14016
> URL: https://issues.apache.org/jira/browse/SPARK-14016
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14016) Support high-precision decimals in vectorized parquet reader

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205363#comment-15205363
 ] 

Apache Spark commented on SPARK-14016:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/11869

> Support high-precision decimals in vectorized parquet reader
> 
>
> Key: SPARK-14016
> URL: https://issues.apache.org/jira/browse/SPARK-14016
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14051) Implement `Double.NaN==Float.NaN` in `row.equals` for consistency

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14051:


Assignee: (was: Apache Spark)

> Implement `Double.NaN==Float.NaN` in `row.equals` for consistency
> -
>
> Key: SPARK-14051
> URL: https://issues.apache.org/jira/browse/SPARK-14051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since SPARK-9079 and SPARK-9145, `NaN = NaN` returns true and works well. The 
> only exception case is direct comparison between  `Row(Float.NaN)` and 
> `Row(Double.NaN)`. The following is the example: the last expression should 
> be true for consistency.
> {code}
> scala> 
> Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp")
> scala> sql("select a,b,a=b from tmp").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true])
> scala> val row_a = sql("select a from tmp").collect()
> row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
> scala> val row_b = sql("select b from tmp").collect()
> row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
> scala> row_a(0) == row_b(0)
> res2: Boolean = true
> scala> row_a(1) == row_b(1)
> res3: Boolean = false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14051) Implement `Double.NaN==Float.NaN` in `row.equals` for consistency

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205359#comment-15205359
 ] 

Apache Spark commented on SPARK-14051:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11868

> Implement `Double.NaN==Float.NaN` in `row.equals` for consistency
> -
>
> Key: SPARK-14051
> URL: https://issues.apache.org/jira/browse/SPARK-14051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since SPARK-9079 and SPARK-9145, `NaN = NaN` returns true and works well. The 
> only exception case is direct comparison between  `Row(Float.NaN)` and 
> `Row(Double.NaN)`. The following is the example: the last expression should 
> be true for consistency.
> {code}
> scala> 
> Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp")
> scala> sql("select a,b,a=b from tmp").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true])
> scala> val row_a = sql("select a from tmp").collect()
> row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
> scala> val row_b = sql("select b from tmp").collect()
> row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
> scala> row_a(0) == row_b(0)
> res2: Boolean = true
> scala> row_a(1) == row_b(1)
> res3: Boolean = false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14051) Implement `Double.NaN==Float.NaN` in `row.equals` for consistency

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14051:


Assignee: Apache Spark

> Implement `Double.NaN==Float.NaN` in `row.equals` for consistency
> -
>
> Key: SPARK-14051
> URL: https://issues.apache.org/jira/browse/SPARK-14051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Since SPARK-9079 and SPARK-9145, `NaN = NaN` returns true and works well. The 
> only exception case is direct comparison between  `Row(Float.NaN)` and 
> `Row(Double.NaN)`. The following is the example: the last expression should 
> be true for consistency.
> {code}
> scala> 
> Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp")
> scala> sql("select a,b,a=b from tmp").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true])
> scala> val row_a = sql("select a from tmp").collect()
> row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
> scala> val row_b = sql("select b from tmp").collect()
> row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
> scala> row_a(0) == row_b(0)
> res2: Boolean = true
> scala> row_a(1) == row_b(1)
> res3: Boolean = false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14052) Build BytesToBytesMap in HashedRelation

2016-03-21 Thread Davies Liu (JIRA)

Davies Liu created SPARK-14052:
--

 Summary: Build BytesToBytesMap in HashedRelation
 Key: SPARK-14052
 URL: https://issues.apache.org/jira/browse/SPARK-14052
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Currently, for the key that can not fit within a long,  we build a hash map for 
UnsafeHashedRelation, it's converted to BytesToBytesMap after serialization and 
deserialization.

We should build a BytesToBytesMap directly to have better memory efficiency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14051) Implement `Double.NaN==Float.NaN` in `row.equals` for consistency

2016-03-21 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-14051:
-

 Summary: Implement `Double.NaN==Float.NaN` in `row.equals` for 
consistency
 Key: SPARK-14051
 URL: https://issues.apache.org/jira/browse/SPARK-14051
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Dongjoon Hyun
Priority: Minor


Since SPARK-9079 and SPARK-9145, `NaN = NaN` returns true and works well. The 
only exception case is direct comparison between  `Row(Float.NaN)` and 
`Row(Double.NaN)`. The following is the example: the last expression should be 
true for consistency.

{code}
scala> 
Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp")

scala> sql("select a,b,a=b from tmp").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true])

scala> val row_a = sql("select a from tmp").collect()
row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])

scala> val row_b = sql("select b from tmp").collect()
row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])

scala> row_a(0) == row_b(0)
res2: Boolean = true

scala> row_a(1) == row_b(1)
res3: Boolean = false
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14050) Add multiple languages support for Stop Words Remover

2016-03-21 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205347#comment-15205347
 ] 

Burak KÖSE commented on SPARK-14050:


I am working on this, using nltk's words list.

> Add multiple languages support for Stop Words Remover
> -
>
> Key: SPARK-14050
> URL: https://issues.apache.org/jira/browse/SPARK-14050
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Burak KÖSE
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14050) Add multiple languages support for Stop Words Remover

2016-03-21 Thread JIRA

Burak KÖSE created SPARK-14050:
--

 Summary: Add multiple languages support for Stop Words Remover
 Key: SPARK-14050
 URL: https://issues.apache.org/jira/browse/SPARK-14050
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Burak KÖSE






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13806) SQL round() produces incorrect results for negative values

2016-03-21 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-13806:
--

Assignee: Davies Liu

> SQL round() produces incorrect results for negative values
> --
>
> Key: SPARK-13806
> URL: https://issues.apache.org/jira/browse/SPARK-13806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Mark Hamstra
>Assignee: Davies Liu
>
> Round in catalyst/expressions/mathExpressions.scala appears to be untested 
> with negative values, and it doesn't handle them correctly.
> There are at least two issues here:
> First, in the genCode for FloatType and DoubleType with _scale == 0, round() 
> will not produce the same results as for the BigDecimal.ROUND_HALF_UP 
> strategy used in all other cases.  This is because Math.round is used for 
> these _scale == 0 cases.  For example, Math.round(-3.5) is -3, while 
> BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. 
> Even after this bug is fixed with something like...
> {code}
> if (${ce.value} < 0) {
>   ${ev.value} = -1 * Math.round(-1 * ${ce.value});
> } else {
>   ${ev.value} = Math.round(${ce.value});
> }
> {code}
> ...which will allow an additional test like this to succeed in 
> MathFunctionsSuite.scala:
> {code}
> checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow)
> {code}
> ...there still appears to be a problem on at least the 
> checkEvalutionWithUnsafeProjection path, where failures like this are 
> produced:
> {code}
> Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
> [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13019) Replace example code in mllib-statistics.md using include_example

2016-03-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13019.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11108
[https://github.com/apache/spark/pull/11108]

> Replace example code in mllib-statistics.md using include_example
> -
>
> Key: SPARK-13019
> URL: https://issues.apache.org/jira/browse/SPARK-13019
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete

2016-03-21 Thread Vincent Ohprecio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205300#comment-15205300
 ] 

Vincent Ohprecio edited comment on SPARK-14031 at 3/21/16 10:40 PM:


GC accounts for less than 0.3-1.5% of CPU time.

Here is the sampler report for CPU:
com.univorcity.parsers.common.input.DefaultCharAppender.() ... 64%
io.netty.channel.nio.NioEventLoop.select() ... 21%
org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect() ... 10%
org.apache.sparl.sql.execution.datasources.csv.LineCsvWriter.writeRow() ... 4%

with a stack trace after detaching VisualVM:
https://gist.github.com/bigsnarfdude/9f15fd55da3a6d85582a


was (Author: vohprecio):
GC accounts for less than 0.3-1.5% of CPU time.

Here is the sampler report for CPU:
com.univorcity.parsers.common.input.DefaultCharAppender.() ... 64%
io.netty.channel.nio.NioEventLoop.select() ... 21%
org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect() ... 10%
org.apache.sparl.sql.execution.datasources.csv.LineCsvWriter.writeRow() ... 4%



> Dataframe to csv IO, system performance enters high CPU state and write 
> operation takes 1 hour to complete
> --
>
> Key: SPARK-14031
> URL: https://issues.apache.org/jira/browse/SPARK-14031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 
> -1TB and Ubuntu14.04 Vagrant 4 Cores 8g
>Reporter: Vincent Ohprecio
>Priority: Minor
> Attachments: visualVMscreenshot.png
>
>
> Summary
> When using spark-assembly-2.0.0/spark-shell trying to write out results of 
> dataframe to csv, system performance enters high CPU state and write 
> operation takes 1 hour to complete. 
> * Affecting: [Stage 5:>  (0 + 2) / 21]
> * Stage 5 elapsed time 348827227ns
> In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data 
> and Stage5 csv write times where between 2 - 22 seconds. 
> In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where 
> similar between 2 - 22 seconds.
> Files 
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 1 - Setup
> High CPU and 58 minute average completion time 
> * MACOSX 10.11.2
> * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB 
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 2 - Setup
> High CPU and waited over hour for csv write but didnt wait to complete 
> * Ubuntu14.04
> * 4cores 8gb
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete

2016-03-21 Thread Vincent Ohprecio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205300#comment-15205300
 ] 

Vincent Ohprecio commented on SPARK-14031:
--

GC accounts for less than 0.3-1.5% of CPU time.

Here is the hotspot report for CPU:
com.univorcity.parsers.common.input.DefaultCharAppender.() ... 64%
io.netty.channel.nio.NioEventLoop.select() ... 21%
org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect() ... 10%
org.apache.sparl.sql.execution.datasources.csv.LineCsvWriter.writeRow() ... 4%



> Dataframe to csv IO, system performance enters high CPU state and write 
> operation takes 1 hour to complete
> --
>
> Key: SPARK-14031
> URL: https://issues.apache.org/jira/browse/SPARK-14031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 
> -1TB and Ubuntu14.04 Vagrant 4 Cores 8g
>Reporter: Vincent Ohprecio
>Priority: Minor
> Attachments: visualVMscreenshot.png
>
>
> Summary
> When using spark-assembly-2.0.0/spark-shell trying to write out results of 
> dataframe to csv, system performance enters high CPU state and write 
> operation takes 1 hour to complete. 
> * Affecting: [Stage 5:>  (0 + 2) / 21]
> * Stage 5 elapsed time 348827227ns
> In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data 
> and Stage5 csv write times where between 2 - 22 seconds. 
> In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where 
> similar between 2 - 22 seconds.
> Files 
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 1 - Setup
> High CPU and 58 minute average completion time 
> * MACOSX 10.11.2
> * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB 
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 2 - Setup
> High CPU and waited over hour for csv write but didnt wait to complete 
> * Ubuntu14.04
> * 4cores 8gb
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete

2016-03-21 Thread Vincent Ohprecio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205300#comment-15205300
 ] 

Vincent Ohprecio edited comment on SPARK-14031 at 3/21/16 10:39 PM:


GC accounts for less than 0.3-1.5% of CPU time.

Here is the sampler report for CPU:
com.univorcity.parsers.common.input.DefaultCharAppender.() ... 64%
io.netty.channel.nio.NioEventLoop.select() ... 21%
org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect() ... 10%
org.apache.sparl.sql.execution.datasources.csv.LineCsvWriter.writeRow() ... 4%




was (Author: vohprecio):
GC accounts for less than 0.3-1.5% of CPU time.

Here is the hotspot report for CPU:
com.univorcity.parsers.common.input.DefaultCharAppender.() ... 64%
io.netty.channel.nio.NioEventLoop.select() ... 21%
org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect() ... 10%
org.apache.sparl.sql.execution.datasources.csv.LineCsvWriter.writeRow() ... 4%



> Dataframe to csv IO, system performance enters high CPU state and write 
> operation takes 1 hour to complete
> --
>
> Key: SPARK-14031
> URL: https://issues.apache.org/jira/browse/SPARK-14031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 
> -1TB and Ubuntu14.04 Vagrant 4 Cores 8g
>Reporter: Vincent Ohprecio
>Priority: Minor
> Attachments: visualVMscreenshot.png
>
>
> Summary
> When using spark-assembly-2.0.0/spark-shell trying to write out results of 
> dataframe to csv, system performance enters high CPU state and write 
> operation takes 1 hour to complete. 
> * Affecting: [Stage 5:>  (0 + 2) / 21]
> * Stage 5 elapsed time 348827227ns
> In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data 
> and Stage5 csv write times where between 2 - 22 seconds. 
> In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where 
> similar between 2 - 22 seconds.
> Files 
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 1 - Setup
> High CPU and 58 minute average completion time 
> * MACOSX 10.11.2
> * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB 
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 2 - Setup
> High CPU and waited over hour for csv write but didnt wait to complete 
> * Ubuntu14.04
> * 4cores 8gb
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14049) Add functionality in spark history sever API to query applications by end time

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14049:


Assignee: (was: Apache Spark)

> Add functionality in spark history sever API to query applications by end 
> time 
> ---
>
> Key: SPARK-14049
> URL: https://issues.apache.org/jira/browse/SPARK-14049
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Parag Chaudhari
>
> Currently, spark history server provides functionality to query applications 
> by application start time range based on minDate and maxDate query 
> parameters, but it  lacks support to query applications by their end time. In 
> this Jira we are proposing optional minEndDate and maxEndDate query 
> parameters and filtering capability based on these parameters to spark 
> history server. This functionality can be used for following queries,
> 1. Applications finished in last 'x' minutes
> 2. Applications finished before 'y' time
> 3. Applications finished between 'x' time to 'y' time
> 4. Applications started from 'x' time and finished before 'y' time.
> For backward compatibility, we can keep existing minDate and maxDate query 
> parameters as they are and they can continue support filtering based on start 
> time range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14049) Add functionality in spark history sever API to query applications by end time

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14049:


Assignee: Apache Spark

> Add functionality in spark history sever API to query applications by end 
> time 
> ---
>
> Key: SPARK-14049
> URL: https://issues.apache.org/jira/browse/SPARK-14049
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Parag Chaudhari
>Assignee: Apache Spark
>
> Currently, spark history server provides functionality to query applications 
> by application start time range based on minDate and maxDate query 
> parameters, but it  lacks support to query applications by their end time. In 
> this Jira we are proposing optional minEndDate and maxEndDate query 
> parameters and filtering capability based on these parameters to spark 
> history server. This functionality can be used for following queries,
> 1. Applications finished in last 'x' minutes
> 2. Applications finished before 'y' time
> 3. Applications finished between 'x' time to 'y' time
> 4. Applications started from 'x' time and finished before 'y' time.
> For backward compatibility, we can keep existing minDate and maxDate query 
> parameters as they are and they can continue support filtering based on start 
> time range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14049) Add functionality in spark history sever API to query applications by end time

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205287#comment-15205287
 ] 

Apache Spark commented on SPARK-14049:
--

User 'paragpc' has created a pull request for this issue:
https://github.com/apache/spark/pull/11867

> Add functionality in spark history sever API to query applications by end 
> time 
> ---
>
> Key: SPARK-14049
> URL: https://issues.apache.org/jira/browse/SPARK-14049
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Parag Chaudhari
>
> Currently, spark history server provides functionality to query applications 
> by application start time range based on minDate and maxDate query 
> parameters, but it  lacks support to query applications by their end time. In 
> this Jira we are proposing optional minEndDate and maxEndDate query 
> parameters and filtering capability based on these parameters to spark 
> history server. This functionality can be used for following queries,
> 1. Applications finished in last 'x' minutes
> 2. Applications finished before 'y' time
> 3. Applications finished between 'x' time to 'y' time
> 4. Applications started from 'x' time and finished before 'y' time.
> For backward compatibility, we can keep existing minDate and maxDate query 
> parameters as they are and they can continue support filtering based on start 
> time range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10433) Gradient boosted trees: increasing input size in 1.4

2016-03-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10433.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

I'm closing this since it seems to have been fixed in 1.5, but please say if it 
has occurred again after that.

> Gradient boosted trees: increasing input size in 1.4
> 
>
> Key: SPARK-10433
> URL: https://issues.apache.org/jira/browse/SPARK-10433
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1
>Reporter: Sean Owen
> Fix For: 1.5.0
>
>
> (Sorry to say I don't have any leads on a fix, but this was reported by three 
> different people and I confirmed it at fairly close range, so think it's 
> legitimate:)
> This is probably best explained in the words from the mailing list thread at 
> http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E
>  . Matt Forbes says:
> {quote}
> I am training a boosted trees model on a couple million input samples (with 
> around 300 features) and am noticing that the input size of each stage is 
> increasing each iteration. For each new tree, the first step seems to be 
> building the decision tree metadata, which does a .count() on the input data, 
> so this is the step I've been using to track the input size changing. Here is 
> what I'm seeing: 
> {quote}
> {code}
> count at DecisionTreeMetadata.scala:111 
> 1. Input Size / Records: 726.1 MB / 1295620 
> 2. Input Size / Records: 106.9 GB / 64780816 
> 3. Input Size / Records: 160.3 GB / 97171224 
> 4. Input Size / Records: 214.8 GB / 129680959 
> 5. Input Size / Records: 268.5 GB / 162533424 
>  
> Input Size / Records: 1912.6 GB / 1382017686 
>  
> {code}
> {quote}
> This step goes from taking less than 10s up to 5 minutes by the 15th or so 
> iteration. I'm not quite sure what could be causing this. I am passing a 
> memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train 
> {quote}
> Johannes Bauer showed me a very similar problem.
> Peter Rudenko offers this sketch of a reproduction:
> {code}
> val boostingStrategy = BoostingStrategy.defaultParams("Classification")
> boostingStrategy.setNumIterations(30)
> boostingStrategy.setLearningRate(1.0)
> boostingStrategy.treeStrategy.setMaxDepth(3)
> boostingStrategy.treeStrategy.setMaxBins(128)
> boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
> boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
> boostingStrategy.treeStrategy.setUseNodeIdCache(true)
> boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
>   
> mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer,
>  java.lang.Integer]])
> val model = GradientBoostedTrees.train(instances, boostingStrategy)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14049) Add functionality in spark history sever API to query applications by end time

2016-03-21 Thread Parag Chaudhari (JIRA)

Parag Chaudhari created SPARK-14049:
---

 Summary: Add functionality in spark history sever API to query 
applications by end time 
 Key: SPARK-14049
 URL: https://issues.apache.org/jira/browse/SPARK-14049
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.1, 2.0.0
Reporter: Parag Chaudhari


Currently, spark history server provides functionality to query applications by 
application start time range based on minDate and maxDate query parameters, but 
it  lacks support to query applications by their end time. In this Jira we are 
proposing optional minEndDate and maxEndDate query parameters and filtering 
capability based on these parameters to spark history server. This 
functionality can be used for following queries,

1. Applications finished in last 'x' minutes
2. Applications finished before 'y' time
3. Applications finished between 'x' time to 'y' time
4. Applications started from 'x' time and finished before 'y' time.

For backward compatibility, we can keep existing minDate and maxDate query 
parameters as they are and they can continue support filtering based on start 
time range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2016-03-21 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205279#comment-15205279
 ] 

Thomas Graves commented on SPARK-1239:
--

I do like the idea of broadcast and originally when I had tried it I had the 
issue mentioned in the second bullet point, but as long as we are synchronizing 
on the requests so we only broadcast it once we should be ok.
It does seem to have some further constraints though too.  With a sufficient 
large job I don't think it matters but what if we only have a small number of 
reducers, we broadcast it to all executors when only a couple need it. I guess 
that doesn't hurt much unless the other executors start going to the executors 
your reducers are on and add more load to them.  Should be pretty minimal 
though.
Broadcast also seems to make less sense when using the dynamic allocation.  At 
least I've seen issues when executors go away, it fails fetch from that one, 
has to retry, etc, adding additional time.  We recently specifically fixed one 
issue with this to make it go get locations again after certain number of 
failures.  That time should be less now that we fixed that but I'll have to run 
the numbers.

I'll do some more analysis/testing of this and see if that really matters.  

with a sufficient number of threads I don't think a few slow nodes would make 
much of a difference here, if you have that many slow nodes the shuffle itself 
is going to be impacted which I would see as a larger affect. The slow nodes 
could just as well affect the broadcast as well.  Hopefully you skip those as 
it takes longer for those to get a chunk, buts its possible that once that slow 
one has a chunk or two, more and more executors start going to that one for the 
broadcast data instead of the driver thus slowing down more transfers.

But its a good point and my current method would truly block (for a certain 
time) rather then being slow.  Note that there is a timeout on waiting for the 
send to happen and when it does it closes the connection and executor would 
retry.  You don't have to worry about that with broadcast.

I'll do some more analysis with that approach.

I wish Netty had some other built in mechanisms for flow control.

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>Assignee: Thomas Graves
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-6362) Broken pipe error when training a RandomForest on a union of two RDDs

2016-03-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-6362.

   Resolution: Fixed
Fix Version/s: 1.3.0

I'm going to close this since it appears to be fixed (based on running it 
locally just now on master).

> Broken pipe error when training a RandomForest on a union of two RDDs
> -
>
> Key: SPARK-6362
> URL: https://issues.apache.org/jira/browse/SPARK-6362
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
> Environment: Kubuntu 14.04, local driver
>Reporter: Pavel Laskov
>Priority: Minor
> Fix For: 1.3.0
>
>
> Training a RandomForest classifier on a dataset obtained as a union of two 
> RDDs throws a broken pipe error:
> Traceback (most recent call last):
>   File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 162, in 
> manager
> code = worker(sock)
>   File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 64, in 
> worker
> outfile.flush()
> IOError: [Errno 32] Broken pipe
> Despite an error the job runs to completion. 
> The following code reproduces the error:
> from pyspark.context import SparkContext
> from pyspark.mllib.rand import RandomRDDs
> from pyspark.mllib.tree import RandomForest
> from pyspark.mllib.linalg import DenseVector
> from pyspark.mllib.regression import LabeledPoint
> import random
> if __name__ == "__main__":
> sc = SparkContext(appName="Union bug test")
> data1 = RandomRDDs.normalVectorRDD(sc,numRows=1,numCols=200)
> data1 = data1.map(lambda x: LabeledPoint(random.randint(0,1),\
>  DenseVector(x)))
> data2 = RandomRDDs.normalVectorRDD(sc,numRows=1,numCols=200)
> data2 = data2.map(lambda x: LabeledPoint(random.randint(0,1),\
> DenseVector(x)))
> training_data = data1.union(data2)
> #training_data = training_data.repartition(2)
> model = RandomForest.trainClassifier(training_data, numClasses=2,
>  categoricalFeaturesInfo={},
>  numTrees=50, maxDepth=30)
> Interestingly, re-partitioning the data after the union operation rectifies 
> the problem (uncomment the line before training in the code above). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-03-21 Thread Simeon Simeonov (JIRA)

Simeon Simeonov created SPARK-14048:
---

 Summary: Aggregation operations on structs fail when the structs 
have fields with special characters
 Key: SPARK-14048
 URL: https://issues.apache.org/jira/browse/SPARK-14048
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
 Environment: Databricks w/ 1.6.0
Reporter: Simeon Simeonov


Consider a schema where a struct has field names with special characters, e.g.,

{code}
 |-- st: struct (nullable = true)
 ||-- x.y: long (nullable = true)
{code}

Schema such as these are frequently generated by the JSON schema generator, 
which seems to never want to map JSON data to {{MapType}} always preferring to 
use {{StructType}}. 

In SparkSQL, referring to these fields requires backticks, e.g., {{st.`x.y`}}. 
There is no problem manipulating these structs unless one is using an 
aggregation function. It seems that, under the covers, the code is not escaping 
fields with special characters correctly.

For example, 

{code}
select first(st) as st from tbl group by something
{code}

generates

{code}
org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
struct. If you have a struct and a field name of it has any special 
characters, please use backticks (`) to quote that field name, e.g. `x+y`. 
Please note that backtick itself is not supported in a field name.
  at 
org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
  at 
org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
  at 
org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
  at 
org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
  at 
com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
  at 
com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
  at 
com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
  at 
com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
  at scala.util.Try$.apply(Try.scala:161)
  at 
com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
  at 
com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
  at 
com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
  at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4607) Add random seed to GradientBoostedTrees

2016-03-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4607:
-
Affects Version/s: (was: 1.2.0)

> Add random seed to GradientBoostedTrees
> ---
>
> Key: SPARK-4607
> URL: https://issues.apache.org/jira/browse/SPARK-4607
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Gradient Boosted Trees does not take a random seed, but it uses randomness if 
> the subsampling rate is < 1.  It should take a random seed parameter.
> This update will also help to make unit tests more stable by allowing 
> determinism (using a small set of fixed random seeds).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4607) Add random seed to GBTClassifier, GBTRegressor

2016-03-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4607:
-
Summary: Add random seed to GBTClassifier, GBTRegressor  (was: Add random 
seed to GradientBoostedTrees)

> Add random seed to GBTClassifier, GBTRegressor
> --
>
> Key: SPARK-4607
> URL: https://issues.apache.org/jira/browse/SPARK-4607
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Gradient Boosted Trees does not take a random seed, but it uses randomness if 
> the subsampling rate is < 1.  It should take a random seed parameter.
> This update will also help to make unit tests more stable by allowing 
> determinism (using a small set of fixed random seeds).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4607) Add random seed to GradientBoostedTrees

2016-03-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4607:
-
Component/s: (was: MLlib)
 ML

> Add random seed to GradientBoostedTrees
> ---
>
> Key: SPARK-4607
> URL: https://issues.apache.org/jira/browse/SPARK-4607
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Gradient Boosted Trees does not take a random seed, but it uses randomness if 
> the subsampling rate is < 1.  It should take a random seed parameter.
> This update will also help to make unit tests more stable by allowing 
> determinism (using a small set of fixed random seeds).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4607) Add random seed to GradientBoostedTrees

2016-03-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4607:
-
Target Version/s: 2.0.0

> Add random seed to GradientBoostedTrees
> ---
>
> Key: SPARK-4607
> URL: https://issues.apache.org/jira/browse/SPARK-4607
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Gradient Boosted Trees does not take a random seed, but it uses randomness if 
> the subsampling rate is < 1.  It should take a random seed parameter.
> This update will also help to make unit tests more stable by allowing 
> determinism (using a small set of fixed random seeds).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14047) GBT improvement umbrella

2016-03-21 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-14047:
-

 Summary: GBT improvement umbrella
 Key: SPARK-14047
 URL: https://issues.apache.org/jira/browse/SPARK-14047
 Project: Spark
  Issue Type: Umbrella
  Components: ML
Reporter: Joseph K. Bradley


This is an umbrella for improvements to learning Gradient Boosted Trees: 
GBTClassifier, GBTRegressor.

Note: Aspects of GBTs which are related to individual trees should be listed 
under [SPARK-14045].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14046) RandomForest improvement umbrella

2016-03-21 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-14046:
-

 Summary: RandomForest improvement umbrella
 Key: SPARK-14046
 URL: https://issues.apache.org/jira/browse/SPARK-14046
 Project: Spark
  Issue Type: Umbrella
  Components: ML
Reporter: Joseph K. Bradley


This is an umbrella for improvements to learning Random Forests.

Note: Aspects of RFs which are related to individual trees should be listed 
under [SPARK-14045].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14046) RandomForest improvement umbrella

2016-03-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14046:
--
Description: 
This is an umbrella for improvements to learning Random Forests: 
RandomForestClassifier, RandomForestRegressor.

Note: Aspects of RFs which are related to individual trees should be listed 
under [SPARK-14045].

  was:
This is an umbrella for improvements to learning Random Forests.

Note: Aspects of RFs which are related to individual trees should be listed 
under [SPARK-14045].


> RandomForest improvement umbrella
> -
>
> Key: SPARK-14046
> URL: https://issues.apache.org/jira/browse/SPARK-14046
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is an umbrella for improvements to learning Random Forests: 
> RandomForestClassifier, RandomForestRegressor.
> Note: Aspects of RFs which are related to individual trees should be listed 
> under [SPARK-14045].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14045) DecisionTree improvement umbrella

2016-03-21 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-14045:
-

 Summary: DecisionTree improvement umbrella
 Key: SPARK-14045
 URL: https://issues.apache.org/jira/browse/SPARK-14045
 Project: Spark
  Issue Type: Umbrella
  Components: ML
Reporter: Joseph K. Bradley


This is an umbrella for improvements to decision tree learning.  This includes:
* DecisionTreeClassifier
* DecisionTreeRegressor
* aspects of tree ensembles specific to learning individual trees, i.e., issues 
which will also affect DecisionTreeClassifier/Regressor




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree

2016-03-21 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205243#comment-15205243
 ] 

Joseph K. Bradley commented on SPARK-3159:
--

Sorry for the slow reply.  There are several like that.  I'll try to check 
through them and link them under an umbrella, to help drive a bit more 
attention to them.

> Check for reducible DecisionTree
> 
>
> Key: SPARK-3159
> URL: https://issues.apache.org/jira/browse/SPARK-3159
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same 
> prediction.  This happens since the splitting criterion (e.g., Gini) is not 
> the same as prediction accuracy/MSE; the splitting criterion can sometimes be 
> improved even when both children would still output the same prediction 
> (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11507) Error thrown when using BlockMatrix.add

2016-03-21 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205242#comment-15205242
 ] 

Joseph K. Bradley commented on SPARK-11507:
---

Good to hear!  I am wondering though if it was a mistake to close your original 
PR (since the Breeze fix won't be put into Spark that quickly).  What do you 
think about re-opening your PR to get the bug fix into 2.0 and a few backports?

> Error thrown when using BlockMatrix.add
> ---
>
> Key: SPARK-11507
> URL: https://issues.apache.org/jira/browse/SPARK-11507
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.5.0
> Environment: Mac/local machine, EC2
> Scala
>Reporter: Kareem Alhazred
>Priority: Minor
>
> In certain situations when adding two block matrices, I get an error 
> regarding colPtr and the operation fails.  External issue URL includes full 
> error and code for reproducing the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step

2016-03-21 Thread Bob Tiernay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Tiernay updated SPARK-14044:

Description: 
It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer#writeRows}} at runtime:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. This 
{{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't 
even get Spark to succeed due to the sort step and data skew in the partitions. 
In general, this would make more efficient use of cluster resources.

For this to work, there needs to be a way to communicate between the 
{{groupBy}} and the {{partitionBy}} by way of some runtime configuration. This 
is very similar in function to Hive's {{hive.optimize.sort.dynamic.partition}} 
parameter.

  was:
It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer#writeRows}} at runtime:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. This 
{{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't 
even get Spark to succeed due to the sort step and data skew in the partitions. 
In general, this would make more efficient use of cluster resources.

For this to work, there needs to be a way to communicate between the 
{{groupBy}} and the {{partitionBy}} by way of some runtime configuration. This 
is very similar in function to Hive's {{hive.enforce.bucketing}} parameter.


> Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass 
> sort step
> 
>
> Key: SPARK-14044
> URL: https://issues.apache.org/jira/browse/SPARK-14044
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Bob Tiernay
>
> It would be very useful to allow the disabling of this block of code within 
> {{DynamicPartitionWriterContainer#writeRows}} at runtime:
> https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418
> The use case is that an upstream {{groupBy}} has already sorted a great many 
> fine grained groups which are the target of the {{partitionBy}}. This 
> {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't 
> even get Spark to succeed due to the sort step and data skew in the 
> partitions. In general, this would make more efficient use of cluster 
> resources.
> For this to work, there needs to be a way to communicate between the 
> {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. 
> This is very similar in function to Hive's 
> {{hive.optimize.sort.dynamic.partition}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step

2016-03-21 Thread Bob Tiernay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Tiernay updated SPARK-14044:

Description: 
It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer#writeRows}} at runtime:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. This 
{{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't 
even get Spark to succeed due to the sort step and data skew in the partitions. 
In general, this would make more efficient use of cluster resources.

For this to work, there needs to be a way to communicate between the 
{{groupBy}} and the {{partitionBy}} by way of some runtime configuration. This 
is very similar in function to Hive's {{hive.enforce.bucketing}} parameter.

  was:
It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer#writeRows}} at runtime:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. This 
{{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't 
even get Spark to succeed due to the sort step and data skew in the partitions. 
In general, this would make more efficient use of cluster resources.

For this to work, there needs to be a way to communicate between the 
{{groupBy}} and the {{partitionBy}} by way of some runtime configuration.


> Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass 
> sort step
> 
>
> Key: SPARK-14044
> URL: https://issues.apache.org/jira/browse/SPARK-14044
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Bob Tiernay
>
> It would be very useful to allow the disabling of this block of code within 
> {{DynamicPartitionWriterContainer#writeRows}} at runtime:
> https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418
> The use case is that an upstream {{groupBy}} has already sorted a great many 
> fine grained groups which are the target of the {{partitionBy}}. This 
> {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't 
> even get Spark to succeed due to the sort step and data skew in the 
> partitions. In general, this would make more efficient use of cluster 
> resources.
> For this to work, there needs to be a way to communicate between the 
> {{groupBy}} and the {{partitionBy}} by way of some runtime configuration. 
> This is very similar in function to Hive's {{hive.enforce.bucketing}} 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13805) Direct consume ColumnVector in generated code when ColumnarBatch is used

2016-03-21 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13805.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11636
[https://github.com/apache/spark/pull/11636]

> Direct consume ColumnVector in generated code when ColumnarBatch is used
> 
>
> Key: SPARK-13805
> URL: https://issues.apache.org/jira/browse/SPARK-13805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
> Fix For: 2.0.0
>
>
> When generated code accesses a {{ColumnarBatch}} object, it is possible to 
> get values of each column from {{ColumnVector}} instead of calling 
> {{getRow()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step

2016-03-21 Thread Bob Tiernay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Tiernay updated SPARK-14044:

Description: 
It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer#writeRows}} at runtime:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. This 
{{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't 
even get Spark to succeed due to the sort step and data skew in the partitions. 
In general, this would make more efficient use of cluster resources.

For this to work, there needs to be a way to communicate between the 
{{groupBy}} and the {{partitionBy}} by way of some runtime configuration.

  was:
It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer#writeRows}} at runtime:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. Currently, we 
can't even get Spark to succeed due to the sort step and data skew in the 
partitions. In general, this would make more efficient use of cluster resources.

For this to work, there needs to be a way to communicate between the 
{{groupBy}} and the {{partitionBy}} by way of some runtime configuration.


> Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass 
> sort step
> 
>
> Key: SPARK-14044
> URL: https://issues.apache.org/jira/browse/SPARK-14044
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Bob Tiernay
>
> It would be very useful to allow the disabling of this block of code within 
> {{DynamicPartitionWriterContainer#writeRows}} at runtime:
> https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418
> The use case is that an upstream {{groupBy}} has already sorted a great many 
> fine grained groups which are the target of the {{partitionBy}}. This 
> {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't 
> even get Spark to succeed due to the sort step and data skew in the 
> partitions. In general, this would make more efficient use of cluster 
> resources.
> For this to work, there needs to be a way to communicate between the 
> {{groupBy}} and the {{partitionBy}} by way of some runtime configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step

2016-03-21 Thread Bob Tiernay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Tiernay updated SPARK-14044:

Description: 
It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer#writeRows}} at runtime:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. Currently, we 
can't even get Spark to succeed due to the sort step and data skew in the 
partitions. In general, this would make more efficient use of cluster resources.

For this to work, there needs to be a way to communicate between the 
{{groupBy}} and the {{partitionBy}} by way of some runtime configuration.

  was:
It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer}}:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. For this to 
work, there needs to be a way to communicate between the {{groupBy}} and the 
{{partitionBy}} by way of some runtime configuration.


> Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass 
> sort step
> 
>
> Key: SPARK-14044
> URL: https://issues.apache.org/jira/browse/SPARK-14044
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Bob Tiernay
>
> It would be very useful to allow the disabling of this block of code within 
> {{DynamicPartitionWriterContainer#writeRows}} at runtime:
> https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418
> The use case is that an upstream {{groupBy}} has already sorted a great many 
> fine grained groups which are the target of the {{partitionBy}}. Currently, 
> we can't even get Spark to succeed due to the sort step and data skew in the 
> partitions. In general, this would make more efficient use of cluster 
> resources.
> For this to work, there needs to be a way to communicate between the 
> {{groupBy}} and the {{partitionBy}} by way of some runtime configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer to bypass sort step

2016-03-21 Thread Bob Tiernay (JIRA)

Bob Tiernay created SPARK-14044:
---

 Summary: Allow configuration of DynamicPartitionWriterContainer to 
bypass sort step
 Key: SPARK-14044
 URL: https://issues.apache.org/jira/browse/SPARK-14044
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: Bob Tiernay


It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer}}:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. For this to 
work, there needs to be a way to communicate between the {{groupBy}} and the 
{{partitionBy}} by way of some runtime configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14044) Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step

2016-03-21 Thread Bob Tiernay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Tiernay updated SPARK-14044:

Summary: Allow configuration of DynamicPartitionWriterContainer#writeRows 
to bypass sort step  (was: Allow configuration of 
DynamicPartitionWriterContainer to bypass sort step)

> Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass 
> sort step
> 
>
> Key: SPARK-14044
> URL: https://issues.apache.org/jira/browse/SPARK-14044
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Bob Tiernay
>
> It would be very useful to allow the disabling of this block of code within 
> {{DynamicPartitionWriterContainer}}:
> https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418
> The use case is that an upstream {{groupBy}} has already sorted a great many 
> fine grained groups which are the target of the {{partitionBy}}. For this to 
> work, there needs to be a way to communicate between the {{groupBy}} and the 
> {{partitionBy}} by way of some runtime configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14023) Make exceptions consistent regarding fields and columns

2016-03-21 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205185#comment-15205185
 ] 

Jacek Laskowski commented on SPARK-14023:
-

If [~josephkb] or [~srowen] could help me how and where to get started with 
this, I could look into it and offer a pull req. I'd appreciate any help. 
Thanks!

> Make exceptions consistent regarding fields and columns
> ---
>
> Key: SPARK-14023
> URL: https://issues.apache.org/jira/browse/SPARK-14023
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As you can see below, a column is called a field depending on where an 
> exception is thrown. I think it should be "column" everywhere (since that's 
> what has a type from a schema).
> {code}
> scala> lr
> res32: org.apache.spark.ml.regression.LinearRegression = linReg_d9bfe808e743
> scala> lr.fit(ds)
> java.lang.IllegalArgumentException: Field "features" does not exist.
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:214)
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:214)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:213)
>   at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>   at 
> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
>   at 
> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
>   at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:89)
>   ... 51 elided
> scala> lr.fit(ds)
> java.lang.IllegalArgumentException: requirement failed: Column label must be 
> of type DoubleType but was actually StringType.
>   at scala.Predef$.require(Predef.scala:219)
>   at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>   at 
> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
>   at 
> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
>   at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:89)
>   ... 51 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13806) SQL round() produces incorrect results for negative values

2016-03-21 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205134#comment-15205134
 ] 

Mark Hamstra commented on SPARK-13806:
--

Yes, there is the mostly orthogonal question about which rounding strategy 
should be used -- see the comments in SPARK-8279.  But, assuming that we are 
adopting the ROUND_HALF_UP strategy, there is the problem with negative values 
that this JIRA points out: When using ROUND_HALF_UP and scale == 0, -x.5 must 
round to -(x+1), but Math.round will round it to -x.

In addition to this, the code gen for rounding of negative floating point 
values with negative scales is broken.

All of this stems from Spark SQL's implementation of round() being untested 
with negative values. 

> SQL round() produces incorrect results for negative values
> --
>
> Key: SPARK-13806
> URL: https://issues.apache.org/jira/browse/SPARK-13806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Mark Hamstra
>
> Round in catalyst/expressions/mathExpressions.scala appears to be untested 
> with negative values, and it doesn't handle them correctly.
> There are at least two issues here:
> First, in the genCode for FloatType and DoubleType with _scale == 0, round() 
> will not produce the same results as for the BigDecimal.ROUND_HALF_UP 
> strategy used in all other cases.  This is because Math.round is used for 
> these _scale == 0 cases.  For example, Math.round(-3.5) is -3, while 
> BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. 
> Even after this bug is fixed with something like...
> {code}
> if (${ce.value} < 0) {
>   ${ev.value} = -1 * Math.round(-1 * ${ce.value});
> } else {
>   ${ev.value} = Math.round(${ce.value});
> }
> {code}
> ...which will allow an additional test like this to succeed in 
> MathFunctionsSuite.scala:
> {code}
> checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow)
> {code}
> ...there still appears to be a problem on at least the 
> checkEvalutionWithUnsafeProjection path, where failures like this are 
> produced:
> {code}
> Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
> [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13951) PySpark ml.pipeline support export/import - nested Piplines

2016-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205112#comment-15205112
 ] 

Apache Spark commented on SPARK-13951:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/11866

> PySpark ml.pipeline support export/import - nested Piplines
> ---
>
> Key: SPARK-13951
> URL: https://issues.apache.org/jira/browse/SPARK-13951
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13806) SQL round() produces incorrect results for negative values

2016-03-21 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205103#comment-15205103
 ] 

Davies Liu commented on SPARK-13806:


This is because round() in Java/Scala have different sematics than Database, we 
should figure out that's the right behavior first.  cc [~rxin]

> SQL round() produces incorrect results for negative values
> --
>
> Key: SPARK-13806
> URL: https://issues.apache.org/jira/browse/SPARK-13806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Mark Hamstra
>
> Round in catalyst/expressions/mathExpressions.scala appears to be untested 
> with negative values, and it doesn't handle them correctly.
> There are at least two issues here:
> First, in the genCode for FloatType and DoubleType with _scale == 0, round() 
> will not produce the same results as for the BigDecimal.ROUND_HALF_UP 
> strategy used in all other cases.  This is because Math.round is used for 
> these _scale == 0 cases.  For example, Math.round(-3.5) is -3, while 
> BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. 
> Even after this bug is fixed with something like...
> {code}
> if (${ce.value} < 0) {
>   ${ev.value} = -1 * Math.round(-1 * ${ce.value});
> } else {
>   ${ev.value} = Math.round(${ce.value});
> }
> {code}
> ...which will allow an additional test like this to succeed in 
> MathFunctionsSuite.scala:
> {code}
> checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow)
> {code}
> ...there still appears to be a problem on at least the 
> checkEvalutionWithUnsafeProjection path, where failures like this are 
> produced:
> {code}
> Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
> [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13938) word2phrase feature created in ML

2016-03-21 Thread Steve Weng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205095#comment-15205095
 ] 

Steve Weng commented on SPARK-13938:


I looked it over already, but was hoping you had more details.




> word2phrase feature created in ML
> -
>
> Key: SPARK-13938
> URL: https://issues.apache.org/jira/browse/SPARK-13938
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Steve Weng
>Priority: Critical
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which 
> transforms a sentence of words into one where certain individual consecutive 
> words are concatenated by using a training model/estimator (e.g. "I went to 
> New York" becomes "I went to new_york").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13938) word2phrase feature created in ML

2016-03-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205085#comment-15205085
 ] 

Sean Owen commented on SPARK-13938:
---

Have a look at the link I posted, in particular 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines

> word2phrase feature created in ML
> -
>
> Key: SPARK-13938
> URL: https://issues.apache.org/jira/browse/SPARK-13938
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Steve Weng
>Priority: Critical
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which 
> transforms a sentence of words into one where certain individual consecutive 
> words are concatenated by using a training model/estimator (e.g. "I went to 
> New York" becomes "I went to new_york").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14042) Add support for custom coalescers

2016-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14042:


Assignee: Apache Spark

> Add support for custom coalescers
> -
>
> Key: SPARK-14042
> URL: https://issues.apache.org/jira/browse/SPARK-14042
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Nezih Yigitbasi
>Assignee: Apache Spark
>
> Per our discussion on the mailing list (please see 
> [here|http://mail-archives.apache.org/mod_mbox//spark-dev/201602.mbox/%3CCA+g63F7aVRBH=WyyK3nvBSLCMPtSdUuL_Ge9=ww4dnmnvy4...@mail.gmail.com%3E])
>  it would be nice to specify a custom coalescing policy as the current 
> {{coalesce()}} method only allows the user to specify the number of 
> partitions and we cannot really control much. The need for this feature 
> popped up when I wanted to merge small files by coalescing them by size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 205 matches

Mail list logo