[jira] [Assigned] (SPARK-10624) TakeOrderedAndProjectNode output is not ordered

2015-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10624:


Assignee: Andrew Or  (was: Apache Spark)

> TakeOrderedAndProjectNode output is not ordered
> ---
>
> Key: SPARK-10624
> URL: https://issues.apache.org/jira/browse/SPARK-10624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Input: 1 to 100
> Output: 10, 9, 7, 6, 8, 5, 2, 1, 4, 3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4576) Add concatenation operator

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4576.
---
Resolution: Won't Fix

Going to resolve this as "Won't Fix" for now, since it seems like {{concat}} 
handles this for us and we won't want to have a diverge of behavior between our 
Hive and non-Hive dialects. Please comment / re-open if I'm mistaken here or if 
you'd like to discuss further (or if you'd just like to voice your support for 
this feature).

> Add concatenation operator
> --
>
> Key: SPARK-4576
> URL: https://issues.apache.org/jira/browse/SPARK-4576
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>
> The standard SQL defines || as a concatenation operator.
> The operator makes concatenated string from 2 operands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10624) TakeOrderedAndProjectNode output is not ordered

2015-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746419#comment-14746419
 ] 

Apache Spark commented on SPARK-10624:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/8764

> TakeOrderedAndProjectNode output is not ordered
> ---
>
> Key: SPARK-10624
> URL: https://issues.apache.org/jira/browse/SPARK-10624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Input: 1 to 100
> Output: 10, 9, 7, 6, 8, 5, 2, 1, 4, 3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2337) String Interpolation for SparkSQL queries

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2337.
---
Resolution: Won't Fix

Resolving as "Won't Fix" per PR discussion.

> String Interpolation for SparkSQL queries
> -
>
> Key: SPARK-2337
> URL: https://issues.apache.org/jira/browse/SPARK-2337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Ahir Reddy
>Priority: Minor
>
> {code}
> val sqlContext = new SQLContext(...)
> import sqlContext._
> case class Person(name: String, age: Int)
> val people: RDD[Person] = ...
> val srdd = sql"SELECT * FROM $people"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10624) TakeOrderedAndProjectNode output is not ordered

2015-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10624:


Assignee: Apache Spark  (was: Andrew Or)

> TakeOrderedAndProjectNode output is not ordered
> ---
>
> Key: SPARK-10624
> URL: https://issues.apache.org/jira/browse/SPARK-10624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> Input: 1 to 100
> Output: 10, 9, 7, 6, 8, 5, 2, 1, 4, 3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10627) Regularization for artificial neural networks

2015-09-15 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746636#comment-14746636
 ] 

Alexander Ulanov commented on SPARK-10627:
--

Dropout WIP refactoring for the new ML API 
https://github.com/avulanov/spark/tree/dropout-mlp. 

> Regularization for artificial neural networks
> -
>
> Key: SPARK-10627
> URL: https://issues.apache.org/jira/browse/SPARK-10627
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Add regularization for artificial neural networks. Includes, but not limited 
> to:
> 1)L1 and L2 regularization
> 2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
> 3)Dropconnect 
> http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10584) Documentation about the compatible Hive version is wrong.

2015-09-15 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-10584:
---
Description: 
In Spark 1.5.0, Spark SQL is compatible with Hive 0.12.0 through 1.2.1 but the 
documentation is wrong.

Also, we cannot get the default value by 
`sqlContext.getConf("spark.sql.hive.metastore.version")`.

  was:
The default value of hive metastore version is 1.2.1 but the documentation says 
`spark.sql.hive.metastore.version` is 0.13.1.

Also, we cannot get the default value by 
`sqlContext.getConf("spark.sql.hive.metastore.version")`.


> Documentation about the compatible Hive version is wrong.
> -
>
> Key: SPARK-10584
> URL: https://issues.apache.org/jira/browse/SPARK-10584
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> In Spark 1.5.0, Spark SQL is compatible with Hive 0.12.0 through 1.2.1 but 
> the documentation is wrong.
> Also, we cannot get the default value by 
> `sqlContext.getConf("spark.sql.hive.metastore.version")`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10629) Gradient boosted trees: mapPartitions input size increasing

2015-09-15 Thread Wenmin Wu (JIRA)
Wenmin Wu created SPARK-10629:
-

 Summary: Gradient boosted trees: mapPartitions input size 
increasing 
 Key: SPARK-10629
 URL: https://issues.apache.org/jira/browse/SPARK-10629
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Wenmin Wu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10624) TakeOrderedAndProjectNode output is not ordered

2015-09-15 Thread Andrew Or (JIRA)
Andrew Or created SPARK-10624:
-

 Summary: TakeOrderedAndProjectNode output is not ordered
 Key: SPARK-10624
 URL: https://issues.apache.org/jira/browse/SPARK-10624
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Andrew Or
Assignee: Andrew Or


Input: 1 to 100
Output: 10, 9, 7, 6, 8, 5, 2, 1, 4, 3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6321) Adapt the number of partitions used by the Exchange rule to the cluster specifications

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6321.
---
Resolution: Won't Fix

I'm going to resolve this specific issue as "Won't Fix." While we do plan to do 
some form of dynamic selection of parallelism, I think that it is more likely 
to be driven by the size / characteristics of the input data. If you search 
JIRA you'll be able to find newer tickets describing some 1.6.0-targeted 
proposals for this.

> Adapt the number of partitions used by the Exchange rule to the cluster 
> specifications
> --
>
> Key: SPARK-6321
> URL: https://issues.apache.org/jira/browse/SPARK-6321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Óscar Puertas
>
> Currently, the exchange rule is using a default value of 200 fixed 
> partitions. In my opinion, it would be nice if we can set that default value 
> to something more related to the cluster specifications instead of a magic 
> number.
> My proposal is to use the default.parallelism value in the Spark 
> configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10300) Use tags to control which tests to run depending on changes being tested

2015-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746429#comment-14746429
 ] 

Apache Spark commented on SPARK-10300:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8775

> Use tags to control which tests to run depending on changes being tested
> 
>
> Key: SPARK-10300
> URL: https://issues.apache.org/jira/browse/SPARK-10300
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.6.0
>
>
> Our unit tests are a little slow, and we could benefit from finer-grained 
> control over which test suites to run depending on what parts of the code 
> base is changed.
> Currently we already have some logic in "run-tests.py" to do this, but it's 
> limited; for example, a minor change in an untracked module is mapped to a 
> "root" module change, and causes really expensive Hive compatibility tests to 
> run when that may not really be necessary.
> Using tags could allow us to be smarter here; this is an idea that has been 
> thrown around before (e.g. SPARK-4746). On top of that, for the cases when we 
> actually do need to run all the tests, we should bump the existing timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5919) Enable broadcast joins for Parquet files

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5919.
---
Resolution: Fixed

This is now supported; ParquetRelation now implements proper statistics support.

> Enable broadcast joins for Parquet files
> 
>
> Key: SPARK-5919
> URL: https://issues.apache.org/jira/browse/SPARK-5919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Dima Zhiyanov
>
> Unable to perform broadcast join of Schema RDDs created from Parquet files. 
> Computing statistics is only available for real Hive tables, and it is not 
> always convenient to create a Hive table for every Parquet file
> The issue is discussed here
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-td15298.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10508) incorrect evaluation of searched case expression

2015-09-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746515#comment-14746515
 ] 

Josh Rosen commented on SPARK-10508:


I managed to reproduce this on 1.3.1 as well, using the following code:

{code}
val df = Seq(
  (0, null.asInstanceOf[Double]),
  (1, -1.0),
  (2, 0.0),
  (3, 1.0),
  (4, 0.1),
  (5, 10.0)
).toDF("rnum", "cdec").selectExpr("rnum", "cast(cdec as decimal(7, 2)) as cdec")
df.registerTempTable("TDEC")
sqlContext.sql("select rnum, cdec, case when cdec in ( -1,10,0.1 )  then 
'test1' else 'other' end from tdec")
{code}

I was able to confirm that this is fixed in at least 1.4.1 and 1.5.0, where 
this gives the following result:

{code}
0   0   other
1   -1  test1
2   0   other
3   1   other
4   0.1 test1
5   10  test1
{code}

As a result, you should be able to work around this by upgrading to a newer 
Spark version. Therefore, I'm going to mark this as fixed in 1.4.1 / 1.5.0.

> incorrect evaluation of searched case expression
> 
>
> Key: SPARK-10508
> URL: https://issues.apache.org/jira/browse/SPARK-10508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
> Fix For: 1.4.1, 1.5.0
>
>
> The following case expression never evaluates to 'test1' when cdec is -1 or 
> 10 as it will for Hive 0.13. Instead is returns 'other' for all rows.
> {code}
> select rnum, cdec, case when cdec in ( -1,10,0.1 )  then 'test1' else 'other' 
> end from tdec 
> create table  if not exists TDEC ( RNUM int , CDEC decimal(7, 2 ))
> TERMINATED BY '\n' 
>  STORED AS orc  ;
> 0|\N
> 1|-1.00
> 2|0.00
> 3|1.00
> 4|0.10
> 5|10.00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10508) incorrect evaluation of searched case expression

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10508.

   Resolution: Fixed
Fix Version/s: 1.4.1
   1.5.0

> incorrect evaluation of searched case expression
> 
>
> Key: SPARK-10508
> URL: https://issues.apache.org/jira/browse/SPARK-10508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
> Fix For: 1.5.0, 1.4.1
>
>
> The following case expression never evaluates to 'test1' when cdec is -1 or 
> 10 as it will for Hive 0.13. Instead is returns 'other' for all rows.
> {code}
> select rnum, cdec, case when cdec in ( -1,10,0.1 )  then 'test1' else 'other' 
> end from tdec 
> create table  if not exists TDEC ( RNUM int , CDEC decimal(7, 2 ))
> TERMINATED BY '\n' 
>  STORED AS orc  ;
> 0|\N
> 1|-1.00
> 2|0.00
> 3|1.00
> 4|0.10
> 5|10.00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10613) Reduce LocalNode tests dependency on SQLContext

2015-09-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10613.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Reduce LocalNode tests dependency on SQLContext
> ---
>
> Key: SPARK-10613
> URL: https://issues.apache.org/jira/browse/SPARK-10613
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.6.0
>
>
> The whole point of local nodes is that you don't need to do things 
> distributed-ly, meaning RDDs / DataFrames are really not necessary. This 
> allows us to write simpler tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10624) TakeOrderedAndProjectNode output is not ordered

2015-09-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10624.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> TakeOrderedAndProjectNode output is not ordered
> ---
>
> Key: SPARK-10624
> URL: https://issues.apache.org/jira/browse/SPARK-10624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.6.0
>
>
> Input: 1 to 100
> Output: 10, 9, 7, 6, 8, 5, 2, 1, 4, 3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746692#comment-14746692
 ] 

Apache Spark commented on SPARK-10577:
--

User 'Jianfeng-chs' has created a pull request for this issue:
https://github.com/apache/spark/pull/8777

> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>  Labels: starter
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10595) Various ML programming guide cleanups post 1.5

2015-09-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10595.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8752
[https://github.com/apache/spark/pull/8752]

> Various ML programming guide cleanups post 1.5
> --
>
> Key: SPARK-10595
> URL: https://issues.apache.org/jira/browse/SPARK-10595
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 1.6.0
>
>
> Various ML guide cleanups.
> * ml-guide.md: Make it easier to access the algorithm-specific guides.
> * LDA user guide: EM often begins with useless topics, but running longer 
> generally improves them dramatically.  E.g., 10 iterations on a Wikipedia 
> dataset produces useless topics, but 50 iterations produces very meaningful 
> topics.
> * mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be 
> “scalingVec”
> * Clean up Binarizer user guide a little.
> * Document in Pipeline that users should not put an instance into the 
> Pipeline in more than 1 place.
> * spark.ml Word2Vec user guide: clean up grammar/writing
> * Chi Sq Feature Selector docs: Improve text in doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-09-15 Thread Luvsandondov Lkhamsuren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746813#comment-14746813
 ] 

Luvsandondov Lkhamsuren commented on SPARK-9963:


Implementing 
{code:title=Node.scala|borderStyle=solid}
def predictImpl(binnedFeatures: Array[Int], splits: Array[Array[Split]])
{code}
then using it inside:
{code:title=RandomForest.scala|borderStyle=solid}
def binSeqOp(
agg: Array[DTStatsAggregator],
baggedPoint: BaggedPoint[TreePoint]): Array[DTStatsAggregator] = {
  treeToNodeToIndexInfo.foreach { case (treeIndex, nodeIndexToInfo) =>
val node = 
topNodes(treeIndex).toNode.predictImpl(baggedPoint.datum.binnedFeatures, splits)
nodeBinSeqOp(treeIndex, nodeIndexToInfo.getOrElse(node.id, null), agg, 
baggedPoint)
  }
  agg
}
{code}

We need to extract the nodeIndex to retrieve the relevant NodeIndexInfo from  
treeToNodeToIndexInfo. Any insights on how I might be able to get around 
getting nodeIndex in this case?

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746929#comment-14746929
 ] 

Yi Zhou edited comment on SPARK-10474 at 9/16/15 5:55 AM:
--

BTW, the "spark.shuffle.safetyFraction" is not public parameter for user..


was (Author: jameszhouyi):
BTW, the "spark.shuffle.safetyFraction" is not public ..

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746929#comment-14746929
 ] 

Yi Zhou commented on SPARK-10474:
-

BTW, the "spark.shuffle.safetyFraction" is not public ..

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746881#comment-14746881
 ] 

Yi Zhou commented on SPARK-10474:
-

Thanks [~chenghao]. It's better not to throw such exception.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4684) Add a script to run JDBC server on Windows

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4684:
--
Component/s: Windows

> Add a script to run JDBC server on Windows
> --
>
> Key: SPARK-4684
> URL: https://issues.apache.org/jira/browse/SPARK-4684
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Windows
>Reporter: Matei Zaharia
>Assignee: Cheng Lian
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7685) Handle high imbalanced data and apply weights to different samples in Logistic Regression

2015-09-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7685.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 7884
[https://github.com/apache/spark/pull/7884]

> Handle high imbalanced data and apply weights to different samples in 
> Logistic Regression
> -
>
> Key: SPARK-7685
> URL: https://issues.apache.org/jira/browse/SPARK-7685
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Critical
> Fix For: 1.6.0
>
>
> In fraud detection dataset, almost all the samples are negative while only 
> couple of them are positive. This type of high imbalanced data will bias the 
> models toward negative resulting poor performance. In python-scikit, they 
> provide a correction allowing users to Over-/undersample the samples of each 
> class according to the given weights. In auto mode, selects weights inversely 
> proportional to class frequencies in the training set. This can be done in a 
> more efficient way by multiplying the weights into loss and gradient instead 
> of doing actual over/undersampling in the training dataset which is very 
> expensive.
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
> On the other hand, some of the training data maybe more important like the 
> training samples from tenure users while the training samples from new users 
> maybe less important. We should be able to provide another "weight: Double" 
> information in the LabeledPoint to weight them differently in the learning 
> algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9642) LinearRegression should supported weighted data

2015-09-15 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9642:
---
Shepherd: DB Tsai

> LinearRegression should supported weighted data
> ---
>
> Key: SPARK-9642
> URL: https://issues.apache.org/jira/browse/SPARK-9642
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Meihua Wu
>Assignee: Meihua Wu
>  Labels: 1.6
>
> In many modeling application, data points are not necessarily sampled with 
> equal probabilities. Linear regression should support weighting which account 
> the over or under sampling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10575) Wrap RDD.takeSample with scope

2015-09-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10575.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Wrap RDD.takeSample with scope
> --
>
> Key: SPARK-10575
> URL: https://issues.apache.org/jira/browse/SPARK-10575
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Minor
> Fix For: 1.6.0
>
>
> Remove return statements in RDD.takeSample and wrap it withScope



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8786) Create a wrapper for BinaryType

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8786.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

This should have been fixed in 1.5.0. I believe that it was addressed via the 
combination of SPARK-9390 and some other patches.

> Create a wrapper for BinaryType
> ---
>
> Key: SPARK-8786
> URL: https://issues.apache.org/jira/browse/SPARK-8786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 1.5.0
>
>
> The hashCode and equals() of Array[Byte] does check the bytes, we should 
> create a wrapper (internally) to do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8786) Create a wrapper for BinaryType

2015-09-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746588#comment-14746588
 ] 

Josh Rosen commented on SPARK-8786:
---

And to confirm, the case given above does work in 1.5.

> Create a wrapper for BinaryType
> ---
>
> Key: SPARK-8786
> URL: https://issues.apache.org/jira/browse/SPARK-8786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 1.5.0
>
>
> The hashCode and equals() of Array[Byte] does check the bytes, we should 
> create a wrapper (internally) to do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10628) Add support for arbitrary RandomRDD generation to PySparkAPI

2015-09-15 Thread holdenk (JIRA)
holdenk created SPARK-10628:
---

 Summary: Add support for arbitrary RandomRDD generation to 
PySparkAPI
 Key: SPARK-10628
 URL: https://issues.apache.org/jira/browse/SPARK-10628
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: holdenk
Priority: Minor


SPARK-2724 added support for specific RandomRDDs, add support for arbitrary 
Random RDD generation for feature parity with Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746642#comment-14746642
 ] 

Cheng Hao commented on SPARK-4226:
--

Thank you [~brooks], you're right! I meant it will makes more complicated in 
the implementation, e.g. to resolved and split the conjunction for the 
condition, that's also what I was trying to avoid in my PR by using the 
anti-join. 

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9033) scala.MatchError: interface java.util.Map (of class java.lang.Class) with Spark SQL

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-9033.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

This should have been fixed for maps and arrays in SPARK-6996 and SPARK-6475, 
both of which were included in Spark 1.4.0. If there are additional issues to 
be fixed here then please file a new JIRA describing them.

> scala.MatchError: interface java.util.Map (of class java.lang.Class) with 
> Spark SQL
> ---
>
> Key: SPARK-9033
> URL: https://issues.apache.org/jira/browse/SPARK-9033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.2, 1.3.1
>Reporter: Pavel
> Fix For: 1.4.0
>
>
> I've a java.util.Map field in a POJO class and I'm trying to 
> use it to createDataFrame (1.3.1) / applySchema(1.2.2) with the SQLContext 
> and getting following error in both 1.2.2 & 1.3.1 versions of the Spark SQL:
> *sample code:
> {code}
> SQLContext sqlCtx = new SQLContext(sc.sc());
> JavaRDD rdd = sc.textFile("/path").map(line-> Event.fromString(line)); 
> //text line is splitted and assigned to respective field of the event class 
> here
> DataFrame schemaRDD  = sqlCtx.createDataFrame(rdd, Event.class); <-- error 
> thrown here
> schemaRDD.registerTempTable("events");
> {code}
> Event class is a Serializable containing a field of type  
> java.util.Map. This issue occurs also with Spark streaming 
> when used with SQL.
> {code}
> JavaDStream receiverStream = jssc.receiverStream(new 
> StreamingReceiver());
> JavaDStream windowDStream = receiverStream.window(WINDOW_LENGTH, 
> SLIDE_INTERVAL);
> jssc.checkpoint("event-streaming");
> windowDStream.foreachRDD(evRDD -> {
>if(evRDD.count() == 0) return null;
> DataFrame schemaRDD = sqlCtx.createDataFrame(evRDD, Event.class);
> schemaRDD.registerTempTable("events");
>   ...
> }
> {code}
> *error:
> {code}
> scala.MatchError: interface java.util.Map (of class java.lang.Class)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244) 
> ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
> {code}
> **also this occurs for fields of custom POJO classes:
> {code}
> scala.MatchError: class com.test.MyClass (of class java.lang.Class)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
>  ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244) 
> ~[scala-library-2.10.5.jar:na]
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) 
> ~[scala-library-2.10.5.jar:na]
>   at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) 
> ~[spark-sql_2.10-1.3.1.jar:1.3.1]
>   at 
> 

[jira] [Commented] (SPARK-6715) Eliminate duplicate filters from pushdown predicates

2015-09-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746442#comment-14746442
 ] 

Josh Rosen commented on SPARK-6715:
---

Per discussion on the PR, I believe that this is "Won't Fix" for 1.x?

> Eliminate duplicate filters from pushdown predicates
> 
>
> Key: SPARK-6715
> URL: https://issues.apache.org/jira/browse/SPARK-6715
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Now in {{DataSourceStrategy}}, the pushdown predicates are duplicate of 
> original {{Filter}} conditions. Thus, some predicates are performed both by 
> data source relation and {{Filter}} plan. I think it is a duplicate loading. 
> Once the predicates are pushed down and performed by a data source relation, 
> it
> can be eliminated from outside {{Filter}} plan's condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10460) fieldIndex method missing on spark.sql.Row

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10460.

Resolution: Cannot Reproduce

Row definitely has a fieldIndex method as of 1.4.1 
(https://github.com/apache/spark/blob/v1.4.1/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L327).
 I tried this out myself and it worked, so i think that there must be some 
other problem. Maybe you're actually running Spark 1.3.x? In any case, please 
comment / re-open this issue if you have additional information that can help 
us to reproduce and debug this problem. Thanks!

> fieldIndex method missing on spark.sql.Row
> --
>
> Key: SPARK-10460
> URL: https://issues.apache.org/jira/browse/SPARK-10460
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: I'm running on an Ubuntu 14.04 32-bit machine, Java 7, 
> spark 1.4.1. Jar was created using sbt-assembly. I've tested both using spark 
> submit and in spark-shell. Both time I had errors in the exact same spot.
>Reporter: FELIPE Q B ALMEIDA
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> {code:title=foo.scala|borderStyle=solid}
> val sc   = new SparkContext(cnf)
>   
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> // initializing the dataframe from json file
> val reviewsDF = sqlContext.jsonFile(inputDir)
> val schema = reviewsDF.schema
> val cleanRDD = reviewsDF.rdd.filter{row:Row => 
> // 
> ***
> //error: value fieldIndex is not a member of org.apache.spark.sql.row
> val unixTimestampIndex = row.fieldIndex("unixReviewTime")
> // 
> ***
> val tryLong = Try(row.getLong(unixTimestampIndex))
>  (row.anyNull == false && tryLong.isSuccess)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10548) Concurrent execution in SQL does not work

2015-09-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10548.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

> Concurrent execution in SQL does not work
> -
>
> Key: SPARK-10548
> URL: https://issues.apache.org/jira/browse/SPARK-10548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> From the mailing list:
> {code}
> future { df1.count() } 
> future { df2.count() } 
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> === edit ===
> Simple reproduction:
> {code}
> (1 to 100).par.foreach { _ =>
>   sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10563) SparkContext's local properties should be cloned when inherited

2015-09-15 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746512#comment-14746512
 ] 

Andrew Or commented on SPARK-10563:
---

NOTE: there's a subtle difference in behavior between branch-1.6 (master) and 
branch-1.5. In the latter, this fix only applies to cases where a SQLContext is 
created around the SparkContext. In raw Spark core / streaming for instance, we 
don't clone the properties. In master, however, we *always* clone the 
properties.

> SparkContext's local properties should be cloned when inherited
> ---
>
> Key: SPARK-10563
> URL: https://issues.apache.org/jira/browse/SPARK-10563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.6.0, 1.5.1
>
>
> Currently, when a child thread inherits local properties from the parent 
> thread, it gets a reference of the parent's local properties and uses them as 
> default values.
> The problem, however, is that when the parent changes the value of an 
> existing property, the changes are reflected in the child thread! This has 
> very confusing semantics, especially in streaming.
> {code}
> private val localProperties = new InheritableThreadLocal[Properties] {
>   override protected def childValue(parent: Properties): Properties = new 
> Properties(parent)
>   override protected def initialValue(): Properties = new Properties()
> }
> {code}
> Instead, we should make a clone of the parent properties rather than passing 
> in a mutable reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10504) aggregate where NULL is defined as the value expression aborts when SUM used

2015-09-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746524#comment-14746524
 ] 

Josh Rosen commented on SPARK-10504:


[~marmbrus], I just checked and this is fixed in 1.5 but still broken in 1.4.1.

> aggregate where NULL is defined as the value expression aborts when SUM used
> 
>
> Key: SPARK-10504
> URL: https://issues.apache.org/jira/browse/SPARK-10504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
>Reporter: N Campbell
>Priority: Minor
>
> In ISO-SQL the context would determine an implicit type for NULL or one might 
> find that a vendor requires an explicit type via CAST ( NULL as INTEGER). It 
> appears that SPARK presumes a long type i.e. select min(NULL), max(NULL) but 
> a query such the following aborts.
>  
> {{select sum ( null  )  from tversion}}
> {code}
> Operation: execute
> Errors:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 5232.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 5232.0 (TID 18531, sandbox.hortonworks.com): scala.MatchError: NullType (of 
> class org.apache.spark.sql.types.NullType$)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:403)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:422)
>   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:422)
>   at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:426)
>   at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:119)
>   at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:82)
>   at 
> org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:581)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:133)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10504) aggregate where NULL is defined as the value expression aborts when SUM used

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10504:
---
Affects Version/s: 1.4.1

> aggregate where NULL is defined as the value expression aborts when SUM used
> 
>
> Key: SPARK-10504
> URL: https://issues.apache.org/jira/browse/SPARK-10504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
>Reporter: N Campbell
>Priority: Minor
>
> In ISO-SQL the context would determine an implicit type for NULL or one might 
> find that a vendor requires an explicit type via CAST ( NULL as INTEGER). It 
> appears that SPARK presumes a long type i.e. select min(NULL), max(NULL) but 
> a query such the following aborts.
>  
> {{select sum ( null  )  from tversion}}
> {code}
> Operation: execute
> Errors:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 5232.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 5232.0 (TID 18531, sandbox.hortonworks.com): scala.MatchError: NullType (of 
> class org.apache.spark.sql.types.NullType$)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:403)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:422)
>   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:422)
>   at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:426)
>   at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:119)
>   at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:82)
>   at 
> org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:581)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:133)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8673) Launcher: add support for monitoring launched applications

2015-09-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8673:
-
Assignee: Marcelo Vanzin

> Launcher: add support for monitoring launched applications
> --
>
> Key: SPARK-8673
> URL: https://issues.apache.org/jira/browse/SPARK-8673
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> See parent bug for details.
> This task covers adding the groundwork for being able to communicate with the 
> launched Spark application and provide ways for the code using the launcher 
> library to interact with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10629) Gradient boosted trees: mapPartitions input size increasing

2015-09-15 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-10629:
--
Description: 
First of all, I think my problem is quite different from 
https://issues.apache.org/jira/browse/SPARK-10433, which point that the input 
size increasing at each iteration.

My problem is the mapPartitions input size increase in one iteration. My 
training samples has 2958359 features in total. Within one iteration, 3 
collectAsMap operation had been called. And here is a summary of each call.

| Stage Id |   Description| 
Duration  |   Input| Shuffle Read | Shuffle Write |
|:--:|:---:|:---:|:---:|::|::|
|  4  | mapPartitions at DecisionTree.scala:613 |  1.6 h  |710.2 MB 
|   |   2.8 GB   |
|  5  | collectAsMap at DecisionTree.scala:642  |  1.8 min  |   
 |  2.8 GB|  |
|  6  | mapPartitions at DecisionTree.scala:613 |  1.2 h  | 27.0 GB 
 ||  5.6 GB |
|  7  | collectAsMap at DecisionTree.scala:642 | 2.0 min |   |
5.6GB   |  |
|  8  | mapPartitions at DecisionTree.scala:613 |  1.2 h  | 26.5 GB 
 || 11.1 GB |
|  9  | collectAsMap at DecisionTree.scala:642 | 2.0 min |  |
8.3 GB  |  |

the mapPartitions operation took too long time! It's so strange! I wonder 
whether there is bug exits?

  was:
First of all, I think my problem is quite different from 
https://issues.apache.org/jira/browse/SPARK-10433, which point that the input 
size increasing at each iteration.

My problem is the mapPartitions input size increase in one iteration. My 
training samples has 2958359 features in total. Within one iteration, 3 
collectAsMap operation had been called. And here is a summary of each call.

| Tables| Are   | Cool  |
| - |:-:| -:|
| col 3 is  | right-aligned | $1600 |
| col 2 is  | centered  |   $12 |
| zebra stripes | are neat  |$1 |


> Gradient boosted trees: mapPartitions input size increasing 
> 
>
> Key: SPARK-10629
> URL: https://issues.apache.org/jira/browse/SPARK-10629
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1
>Reporter: Wenmin Wu
>
> First of all, I think my problem is quite different from 
> https://issues.apache.org/jira/browse/SPARK-10433, which point that the input 
> size increasing at each iteration.
> My problem is the mapPartitions input size increase in one iteration. My 
> training samples has 2958359 features in total. Within one iteration, 3 
> collectAsMap operation had been called. And here is a summary of each call.
> | Stage Id |   Description| 
> Duration  |   Input| Shuffle Read | Shuffle Write |
> |:--:|:---:|:---:|:---:|::|::|
> |  4  | mapPartitions at DecisionTree.scala:613 |  1.6 h  |710.2 
> MB | |   2.8 GB   |
> |  5  | collectAsMap at DecisionTree.scala:642  |  1.8 min  | 
>|2.8 GB|  |
> |  6  | mapPartitions at DecisionTree.scala:613 |  1.2 h  | 27.0 
> GB  ||  5.6 GB |
> |  7  | collectAsMap at DecisionTree.scala:642 | 2.0 min |   |
> 5.6GB   |  |
> |  8  | mapPartitions at DecisionTree.scala:613 |  1.2 h  | 26.5 
> GB  ||   11.1 GB |
> |  9  | collectAsMap at DecisionTree.scala:642 | 2.0 min |  |
> 8.3 GB  |  |
> the mapPartitions operation took too long time! It's so strange! I wonder 
> whether there is bug exits?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-09-15 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746450#comment-14746450
 ] 

Peng Cheng commented on SPARK-10625:


branch: https://github.com/Schedule1/spark/tree/SPARK-10625
will submit a pull request after I fixed all tests

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9642) LinearRegression should supported weighted data

2015-09-15 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9642:
---
Assignee: Meihua Wu

> LinearRegression should supported weighted data
> ---
>
> Key: SPARK-9642
> URL: https://issues.apache.org/jira/browse/SPARK-9642
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Meihua Wu
>Assignee: Meihua Wu
>  Labels: 1.6
>
> In many modeling application, data points are not necessarily sampled with 
> equal probabilities. Linear regression should support weighting which account 
> the over or under sampling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10503) incorrect predicate evaluation involving NULL value

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10503:
---
Description: 
Query an ORC table in Hive using the following SQL statement via the SPARKSQL 
thrift-server. The row were rnum=0 has a c1 value of null. The resultset 
returned by SPARK includes a row where rnum=0 and c1=0 which is incorrect

{code}
select tint.rnum, tint.rnum from tint where tint.cint in ( tint.cint )

table in Hive
create table  if not exists TINT ( RNUM int , CINT smallint   )
TERMINATED BY '\n' 
 STORED AS orc  ;

data loaded into ORC table is

0|\N
1|-1
2|0
3|1
4|10
{code}

  was:
Query an ORC table in Hive using the following SQL statement via the SPARKSQL 
thrift-server. The row were rnum=0 has a c1 value of null. The resultset 
returned by SPARK includes a row where rnum=0 and c1=0 which is incorrect

select tint.rnum, tint.rnum from tint where tint.cint in ( tint.cint )

table in Hive
create table  if not exists TINT ( RNUM int , CINT smallint   )
TERMINATED BY '\n' 
 STORED AS orc  ;

data loaded into ORC table is

0|\N
1|-1
2|0
3|1
4|10



> incorrect predicate evaluation involving NULL value
> ---
>
> Key: SPARK-10503
> URL: https://issues.apache.org/jira/browse/SPARK-10503
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: N Campbell
>
> Query an ORC table in Hive using the following SQL statement via the SPARKSQL 
> thrift-server. The row were rnum=0 has a c1 value of null. The resultset 
> returned by SPARK includes a row where rnum=0 and c1=0 which is incorrect
> {code}
> select tint.rnum, tint.rnum from tint where tint.cint in ( tint.cint )
> table in Hive
> create table  if not exists TINT ( RNUM int , CINT smallint   )
> TERMINATED BY '\n' 
>  STORED AS orc  ;
> data loaded into ORC table is
> 0|\N
> 1|-1
> 2|0
> 3|1
> 4|10
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10381) Infinite loop when OutputCommitCoordination is enabled and OutputCommitter.commitTask throws exception

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10381.

   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8544
[https://github.com/apache/spark/pull/8544]

> Infinite loop when OutputCommitCoordination is enabled and 
> OutputCommitter.commitTask throws exception
> --
>
> Key: SPARK-10381
> URL: https://issues.apache.org/jira/browse/SPARK-10381
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> When speculative execution is enabled, consider a scenario where the 
> authorized committer of a particular output partition fails during the 
> OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator 
> is supposed to release that committer's exclusive lock on committing once 
> that task fails. However, due to a unit mismatch the lock will not be 
> released, causing Spark to go into an infinite retry loop.
> This bug was masked by the fact that the OutputCommitCoordinator does not 
> have enough end-to-end tests (the current tests use many mocks). Other 
> factors contributing to this bug are the fact that we have many 
> similarly-named identifiers that have different semantics but the same data 
> types (e.g. attemptNumber and taskAttemptId, with inconsistent variable 
> naming which makes them difficult to distinguish).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7192) Pyspark casts hive bigint to int

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-7192.
---
Resolution: Not A Problem

I'm resolving as "Not a Problem." Please comment / re-open if you can explain 
why Python's arbitrary precision integer support is a problem here or tell me 
whether I've overlooked something here.

> Pyspark casts hive bigint to int
> 
>
> Key: SPARK-7192
> URL: https://issues.apache.org/jira/browse/SPARK-7192
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Tamas Jambor
>
> It seems that pyspark reads bigint from hive and stores it as an int:
> >> hive_ctx = HiveContext(sc)
> >> data = hive_ctx.sql("select col1, col2 from dataset1")
> >> data
> DataFrame[col1: int, col2: bigint]
> >> c_t = [type(v) for v in data.collect()[0]]
> >> c_t
> [int, int]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10584) Documentation about the compatible Hive version is wrong.

2015-09-15 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-10584:
---
Summary: Documentation about the compatible Hive version is wrong.  (was: 
Documentation about spark.sql.hive.metastore.version is wrong.)

> Documentation about the compatible Hive version is wrong.
> -
>
> Key: SPARK-10584
> URL: https://issues.apache.org/jira/browse/SPARK-10584
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> The default value of hive metastore version is 1.2.1 but the documentation 
> says `spark.sql.hive.metastore.version` is 0.13.1.
> Also, we cannot get the default value by 
> `sqlContext.getConf("spark.sql.hive.metastore.version")`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10629) Gradient boosted trees: mapPartitions input size increasing

2015-09-15 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-10629:
--
Description: 
First of all, I think my problem is quite different from 
https://issues.apache.org/jira/browse/SPARK-10433, which point that the input 
size increasing at each iteration.

My problem is the mapPartitions input size increase in one iteration. My 
training samples has 2958359 features in total. Within one iteration, 3 
collectAsMap operation had been called. And here is a summary of each call.

| Tables| Are   | Cool  |
| - |:-:| -:|
| col 3 is  | right-aligned | $1600 |
| col 2 is  | centered  |   $12 |
| zebra stripes | are neat  |$1 |

  was:
First of all, I think my problem is quite different from 
https://issues.apache.org/jira/browse/SPARK-10433, which point that the input 
size increasing at each iteration.

My problem is the mapPartitions input size increase in one iteration. My 
training samples has 2958359 features in total. Within one iteration, 3 
collectAsMap operation had been called. And here is a summary of each call.

stage ID 4 mapPartitions at DecisionTree.scala:613 


> Gradient boosted trees: mapPartitions input size increasing 
> 
>
> Key: SPARK-10629
> URL: https://issues.apache.org/jira/browse/SPARK-10629
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1
>Reporter: Wenmin Wu
>
> First of all, I think my problem is quite different from 
> https://issues.apache.org/jira/browse/SPARK-10433, which point that the input 
> size increasing at each iteration.
> My problem is the mapPartitions input size increase in one iteration. My 
> training samples has 2958359 features in total. Within one iteration, 3 
> collectAsMap operation had been called. And here is a summary of each call.
> | Tables| Are   | Cool  |
> | - |:-:| -:|
> | col 3 is  | right-aligned | $1600 |
> | col 2 is  | centered  |   $12 |
> | zebra stripes | are neat  |$1 |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code

2015-09-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9078:
---
Assignee: Suresh Thalamati

> Use of non-standard LIMIT keyword in JDBC tableExists code
> --
>
> Key: SPARK-9078
> URL: https://issues.apache.org/jira/browse/SPARK-9078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Robert Beauchemin
>Assignee: Suresh Thalamati
>Priority: Minor
> Fix For: 1.6.0
>
>
> tableExists in  
> spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses 
> non-standard SQL (specifically, the LIMIT keyword) to determine whether a 
> table exists in a JDBC data source. This will cause an exception in many/most 
> JDBC databases that doesn't support LIMIT keyword. See 
> http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql
> To check for table existence or an exception, it could be recrafted around 
> "select 1 from $table where 0 = 1" which isn't the same (it returns an empty 
> resultset rather than the value '1'), but would support more data sources and 
> also support empty tables. Arguably ugly and possibly queries every row on 
> sources that don't support constant folding, but better than failing on JDBC 
> sources that don't support LIMIT. 
> Perhaps "supports LIMIT" could be a field in the JdbcDialect class for 
> databases that support keyword this to override. The ANSI standard is (OFFSET 
> and) FETCH. 
> The standard way to check for table existence would be to use 
> information_schema.tables which is a SQL standard but may not work for other 
> JDBC data sources that support SQL, but not the information_schema. The JDBC 
> DatabaseMetaData interface provides getSchemas()  that allows checking for 
> the information_schema in drivers that support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code

2015-09-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9078:

Assignee: Suresh Thalamati

> Use of non-standard LIMIT keyword in JDBC tableExists code
> --
>
> Key: SPARK-9078
> URL: https://issues.apache.org/jira/browse/SPARK-9078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Robert Beauchemin
>Assignee: Suresh Thalamati
>Priority: Minor
> Fix For: 1.6.0
>
>
> tableExists in  
> spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses 
> non-standard SQL (specifically, the LIMIT keyword) to determine whether a 
> table exists in a JDBC data source. This will cause an exception in many/most 
> JDBC databases that doesn't support LIMIT keyword. See 
> http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql
> To check for table existence or an exception, it could be recrafted around 
> "select 1 from $table where 0 = 1" which isn't the same (it returns an empty 
> resultset rather than the value '1'), but would support more data sources and 
> also support empty tables. Arguably ugly and possibly queries every row on 
> sources that don't support constant folding, but better than failing on JDBC 
> sources that don't support LIMIT. 
> Perhaps "supports LIMIT" could be a field in the JdbcDialect class for 
> databases that support keyword this to override. The ANSI standard is (OFFSET 
> and) FETCH. 
> The standard way to check for table existence would be to use 
> information_schema.tables which is a SQL standard but may not work for other 
> JDBC data sources that support SQL, but not the information_schema. The JDBC 
> DatabaseMetaData interface provides getSchemas()  that allows checking for 
> the information_schema in drivers that support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code

2015-09-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9078:

Assignee: (was: Suresh Thalamati)

> Use of non-standard LIMIT keyword in JDBC tableExists code
> --
>
> Key: SPARK-9078
> URL: https://issues.apache.org/jira/browse/SPARK-9078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Robert Beauchemin
>Priority: Minor
> Fix For: 1.6.0
>
>
> tableExists in  
> spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses 
> non-standard SQL (specifically, the LIMIT keyword) to determine whether a 
> table exists in a JDBC data source. This will cause an exception in many/most 
> JDBC databases that doesn't support LIMIT keyword. See 
> http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql
> To check for table existence or an exception, it could be recrafted around 
> "select 1 from $table where 0 = 1" which isn't the same (it returns an empty 
> resultset rather than the value '1'), but would support more data sources and 
> also support empty tables. Arguably ugly and possibly queries every row on 
> sources that don't support constant folding, but better than failing on JDBC 
> sources that don't support LIMIT. 
> Perhaps "supports LIMIT" could be a field in the JdbcDialect class for 
> databases that support keyword this to override. The ANSI standard is (OFFSET 
> and) FETCH. 
> The standard way to check for table existence would be to use 
> information_schema.tables which is a SQL standard but may not work for other 
> JDBC data sources that support SQL, but not the information_schema. The JDBC 
> DatabaseMetaData interface provides getSchemas()  that allows checking for 
> the information_schema in drivers that support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code

2015-09-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9078.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8676
[https://github.com/apache/spark/pull/8676]

> Use of non-standard LIMIT keyword in JDBC tableExists code
> --
>
> Key: SPARK-9078
> URL: https://issues.apache.org/jira/browse/SPARK-9078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Robert Beauchemin
>Priority: Minor
> Fix For: 1.6.0
>
>
> tableExists in  
> spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses 
> non-standard SQL (specifically, the LIMIT keyword) to determine whether a 
> table exists in a JDBC data source. This will cause an exception in many/most 
> JDBC databases that doesn't support LIMIT keyword. See 
> http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql
> To check for table existence or an exception, it could be recrafted around 
> "select 1 from $table where 0 = 1" which isn't the same (it returns an empty 
> resultset rather than the value '1'), but would support more data sources and 
> also support empty tables. Arguably ugly and possibly queries every row on 
> sources that don't support constant folding, but better than failing on JDBC 
> sources that don't support LIMIT. 
> Perhaps "supports LIMIT" could be a field in the JdbcDialect class for 
> databases that support keyword this to override. The ANSI standard is (OFFSET 
> and) FETCH. 
> The standard way to check for table existence would be to use 
> information_schema.tables which is a SQL standard but may not work for other 
> JDBC data sources that support SQL, but not the information_schema. The JDBC 
> DatabaseMetaData interface provides getSchemas()  that allows checking for 
> the information_schema in drivers that support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6102) Create a SparkSQL DataSource API implementation for Redshift

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6102.
---
Resolution: Done

The spark-redshift library did this: 
https://github.com/databricks/spark-redshift

> Create a SparkSQL DataSource API implementation for Redshift
> 
>
> Key: SPARK-6102
> URL: https://issues.apache.org/jira/browse/SPARK-6102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Chris Fregly
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10584) Documentation about the compatible Hive version is wrong.

2015-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746660#comment-14746660
 ] 

Apache Spark commented on SPARK-10584:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/8776

> Documentation about the compatible Hive version is wrong.
> -
>
> Key: SPARK-10584
> URL: https://issues.apache.org/jira/browse/SPARK-10584
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> In Spark 1.5.0, Spark SQL is compatible with Hive 0.12.0 through 1.2.1 but 
> the documentation is wrong.
> Also, we cannot get the default value by 
> `sqlContext.getConf("spark.sql.hive.metastore.version")`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4450) SparkSQL producing incorrect answer when using --master yarn

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4450:
--
Description: 
A simple summary program using 

spark-submit --master local  MyJob.py

vs.

spark-submit --master yarn MyJob.py

produces different answers--the output produced by local has been independently 
verified and is correct, but the output from yarn is incorrect.

It does not appear to happen with smaller files, only large files.

MyJob.py is 

{code}
from pyspark import SparkContext, SparkConf
from pyspark.sql import *

def maybeFloat(x):
"""Convert NULLs into 0s"""
if x=='': return 0.
else: return float(x)

def maybeInt(x):
"""Convert NULLs into 0s"""
if x=='': return 0
else: return int(x)

def mapColl(p):
return {
"f1": p[0],
"f2": p[1],
"f3": p[2],
"f4": int(p[3]),
"f5": int(p[4]),
"f6": p[5],
"f7": p[6],
"f8": p[7],
"f9": p[8],
"f10": maybeInt(p[9]),
"f11": p[10],
"f12": p[11],
"f13": p[12],
"f14": p[13],
"f15": maybeFloat(p[14]),
"f16": maybeInt(p[15]),
"f17": maybeFloat(p[16]) }

sc = SparkContext()
sqlContext = SQLContext(sc)

lines = sc.textFile("sample.csv")
fields = lines.map(lambda l: mapColl(l.split(",")))

collTable = sqlContext.inferSchema(fields)
collTable.registerAsTable("sample")

test = sqlContext.sql("SELECT f9, COUNT(*) AS rows, SUM(f15) AS f15sum " \
  + "FROM sample " \
  + "GROUP BY f9")
foo = test.collect()
print foo

sc.stop()
{code}


  was:
A simple summary program using 

spark-submit --master local  MyJob.py

vs.

spark-submit --master yarn MyJob.py

produces different answers--the output produced by local has been independently 
verified and is correct, but the output from yarn is incorrect.

It does not appear to happen with smaller files, only large files.

MyJob.py is 

from pyspark import SparkContext, SparkConf
from pyspark.sql import *

def maybeFloat(x):
"""Convert NULLs into 0s"""
if x=='': return 0.
else: return float(x)

def maybeInt(x):
"""Convert NULLs into 0s"""
if x=='': return 0
else: return int(x)

def mapColl(p):
return {
"f1": p[0],
"f2": p[1],
"f3": p[2],
"f4": int(p[3]),
"f5": int(p[4]),
"f6": p[5],
"f7": p[6],
"f8": p[7],
"f9": p[8],
"f10": maybeInt(p[9]),
"f11": p[10],
"f12": p[11],
"f13": p[12],
"f14": p[13],
"f15": maybeFloat(p[14]),
"f16": maybeInt(p[15]),
"f17": maybeFloat(p[16]) }

sc = SparkContext()
sqlContext = SQLContext(sc)

lines = sc.textFile("sample.csv")
fields = lines.map(lambda l: mapColl(l.split(",")))

collTable = sqlContext.inferSchema(fields)
collTable.registerAsTable("sample")

test = sqlContext.sql("SELECT f9, COUNT(*) AS rows, SUM(f15) AS f15sum " \
  + "FROM sample " \
  + "GROUP BY f9")
foo = test.collect()
print foo

sc.stop()





> SparkSQL producing incorrect answer when using --master yarn
> 
>
> Key: SPARK-4450
> URL: https://issues.apache.org/jira/browse/SPARK-4450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: CDH 5.1
>Reporter: Rick Bischoff
>
> A simple summary program using 
> spark-submit --master local  MyJob.py
> vs.
> spark-submit --master yarn MyJob.py
> produces different answers--the output produced by local has been 
> independently verified and is correct, but the output from yarn is incorrect.
> It does not appear to happen with smaller files, only large files.
> MyJob.py is 
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import *
> def maybeFloat(x):
> """Convert NULLs into 0s"""
> if x=='': return 0.
> else: return float(x)
> def maybeInt(x):
> """Convert NULLs into 0s"""
> if x=='': return 0
> else: return int(x)
> def mapColl(p):
> return {
> "f1": p[0],
> "f2": p[1],
> "f3": p[2],
> "f4": int(p[3]),
> "f5": int(p[4]),
> "f6": p[5],
> "f7": p[6],
> "f8": p[7],
> "f9": p[8],
> "f10": maybeInt(p[9]),
> "f11": p[10],
> "f12": p[11],
> "f13": p[12],
> "f14": p[13],
> "f15": maybeFloat(p[14]),
> "f16": maybeInt(p[15]),
> "f17": maybeFloat(p[16]) }
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> lines = sc.textFile("sample.csv")
> fields = lines.map(lambda l: mapColl(l.split(",")))
> collTable = sqlContext.inferSchema(fields)
> collTable.registerAsTable("sample")
> test = sqlContext.sql("SELECT f9, COUNT(*) AS rows, SUM(f15) AS 

[jira] [Resolved] (SPARK-4450) SparkSQL producing incorrect answer when using --master yarn

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4450.
---
Resolution: Cannot Reproduce

Resolving as "Cannot Reproduce" since this is targeted against an old Spark 
version and does not describe the actual inconsistency / wrong answer.

> SparkSQL producing incorrect answer when using --master yarn
> 
>
> Key: SPARK-4450
> URL: https://issues.apache.org/jira/browse/SPARK-4450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: CDH 5.1
>Reporter: Rick Bischoff
>
> A simple summary program using 
> spark-submit --master local  MyJob.py
> vs.
> spark-submit --master yarn MyJob.py
> produces different answers--the output produced by local has been 
> independently verified and is correct, but the output from yarn is incorrect.
> It does not appear to happen with smaller files, only large files.
> MyJob.py is 
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import *
> def maybeFloat(x):
> """Convert NULLs into 0s"""
> if x=='': return 0.
> else: return float(x)
> def maybeInt(x):
> """Convert NULLs into 0s"""
> if x=='': return 0
> else: return int(x)
> def mapColl(p):
> return {
> "f1": p[0],
> "f2": p[1],
> "f3": p[2],
> "f4": int(p[3]),
> "f5": int(p[4]),
> "f6": p[5],
> "f7": p[6],
> "f8": p[7],
> "f9": p[8],
> "f10": maybeInt(p[9]),
> "f11": p[10],
> "f12": p[11],
> "f13": p[12],
> "f14": p[13],
> "f15": maybeFloat(p[14]),
> "f16": maybeInt(p[15]),
> "f17": maybeFloat(p[16]) }
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> lines = sc.textFile("sample.csv")
> fields = lines.map(lambda l: mapColl(l.split(",")))
> collTable = sqlContext.inferSchema(fields)
> collTable.registerAsTable("sample")
> test = sqlContext.sql("SELECT f9, COUNT(*) AS rows, SUM(f15) AS f15sum " \
>   + "FROM sample " \
>   + "GROUP BY f9")
> foo = test.collect()
> print foo
> sc.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-09-15 Thread Peng Cheng (JIRA)
Peng Cheng created SPARK-10625:
--

 Summary: Spark SQL JDBC read/write is unable to handle JDBC 
Drivers that adds unserializable objects into connection properties
 Key: SPARK-10625
 URL: https://issues.apache.org/jira/browse/SPARK-10625
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0, 1.4.1
 Environment: Ubuntu 14.04
Reporter: Peng Cheng


Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
adding new objects into the connection properties, which is then reused by 
Spark to be deployed to workers. When some of these new objects are unable to 
be serializable it will trigger an org.apache.spark.SparkException: Task not 
serializable. The following test code snippet demonstrate this problem by using 
a modified H2 driver:

  test("INSERT to JDBC Datasource with UnserializableH2Driver") {

object UnserializableH2Driver extends org.h2.Driver {

  override def connect(url: String, info: Properties): Connection = {

val result = super.connect(url, info)
info.put("unserializableDriver", this)
result
  }

  override def getParentLogger: Logger = ???
}

import scala.collection.JavaConversions._

val oldDrivers = 
DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
oldDrivers.foreach{
  DriverManager.deregisterDriver
}
DriverManager.registerDriver(UnserializableH2Driver)

sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
properties).collect()(0).length)

DriverManager.deregisterDriver(UnserializableH2Driver)
oldDrivers.foreach{
  DriverManager.registerDriver
}
  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10563) SparkContext's local properties should be cloned when inherited

2015-09-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10563.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

> SparkContext's local properties should be cloned when inherited
> ---
>
> Key: SPARK-10563
> URL: https://issues.apache.org/jira/browse/SPARK-10563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.6.0, 1.5.1
>
>
> Currently, when a child thread inherits local properties from the parent 
> thread, it gets a reference of the parent's local properties and uses them as 
> default values.
> The problem, however, is that when the parent changes the value of an 
> existing property, the changes are reflected in the child thread! This has 
> very confusing semantics, especially in streaming.
> {code}
> private val localProperties = new InheritableThreadLocal[Properties] {
>   override protected def childValue(parent: Properties): Properties = new 
> Properties(parent)
>   override protected def initialValue(): Properties = new Properties()
> }
> {code}
> Instead, we should make a clone of the parent properties rather than passing 
> in a mutable reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10627) Regularization for artificial neural networks

2015-09-15 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-10627:


 Summary: Regularization for artificial neural networks
 Key: SPARK-10627
 URL: https://issues.apache.org/jira/browse/SPARK-10627
 Project: Spark
  Issue Type: Umbrella
  Components: ML
Affects Versions: 1.5.0
Reporter: Alexander Ulanov
Priority: Minor


Add regularization for artificial neural networks. Includes, but not limited to:
1)L1 and L2 regularization
2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
3)Dropconnect 
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10629) Gradient boosted trees: mapPartitions input size increasing

2015-09-15 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-10629:
--
Description: 
First of all, I think my problem is quite different from 
https://issues.apache.org/jira/browse/SPARK-10433, which point that the input 
size increasing at each iteration.

My problem is the mapPartitions input size increase in one iteration. My 
training samples has 2958359 features in total. Within one iteration, 3 
collectAsMap operation had been called. And here is a summary of each call.

stage ID 4 mapPartitions at DecisionTree.scala:613 

> Gradient boosted trees: mapPartitions input size increasing 
> 
>
> Key: SPARK-10629
> URL: https://issues.apache.org/jira/browse/SPARK-10629
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1
>Reporter: Wenmin Wu
>
> First of all, I think my problem is quite different from 
> https://issues.apache.org/jira/browse/SPARK-10433, which point that the input 
> size increasing at each iteration.
> My problem is the mapPartitions input size increase in one iteration. My 
> training samples has 2958359 features in total. Within one iteration, 3 
> collectAsMap operation had been called. And here is a summary of each call.
> stage ID 4 mapPartitions at DecisionTree.scala:613 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5624) Can't find new column

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5624.
---
Resolution: Fixed

Resolving as "Fixed" per claim that this works in a newer release.

> Can't find new column 
> --
>
> Key: SPARK-5624
> URL: https://issues.apache.org/jira/browse/SPARK-5624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Alex Liu
>Priority: Minor
>
> The following test fails
> {code}
> 0: jdbc:hive2://localhost:1> DROP TABLE IF EXISTS alter_test_table;
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.175 seconds)
> 0: jdbc:hive2://localhost:1> DROP TABLE IF EXISTS alter_test_table_ctas;
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.155 seconds)
> 0: jdbc:hive2://localhost:1> DROP TABLE IF EXISTS 
> alter_test_table_renamed;
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.162 seconds)
> 0: jdbc:hive2://localhost:1> CREATE TABLE alter_test_table (foo INT, bar 
> STRING) COMMENT 'table to test DDL ops' PARTITIONED BY (ds STRING) STORED AS 
> TEXTFILE;
> +-+
> | result  |
> +-+
> +-+
> No rows selected (0.247 seconds)
> 0: jdbc:hive2://localhost:1> LOAD DATA LOCAL INPATH 
> '/Users/alex/project/automaton/resources/tests/data/files/kv1.txt' OVERWRITE 
> INTO TABLE alter_test_table PARTITION (ds='2008-08-08');  
> +-+
> | result  |
> +-+
> +-+
> No rows selected (0.367 seconds)
> 0: jdbc:hive2://localhost:1> CREATE TABLE alter_test_table_ctas as SELECT 
> * FROM alter_test_table;
> +--+--+-+
> | foo  | bar  | ds  |
> +--+--+-+
> +--+--+-+
> No rows selected (0.641 seconds)
> 0: jdbc:hive2://localhost:1> ALTER TABLE alter_test_table ADD COLUMNS 
> (new_col1 INT);
> +-+
> | result  |
> +-+
> +-+
> No rows selected (0.226 seconds)
> 0: jdbc:hive2://localhost:1> INSERT OVERWRITE TABLE alter_test_table 
> PARTITION (ds='2008-08-15') SELECT foo, bar, 3 FROM 
> alter_test_table_ctas WHERE ds='2008-08-08';
> +--+--+--+
> | foo  | bar  | c_2  |
> +--+--+--+
> +--+--+--+
> No rows selected (0.522 seconds)
> 0: jdbc:hive2://localhost:1> select * from alter_test_table ;
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 35.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 35.0 (TID 66, 127.0.0.1): java.lang.RuntimeException: cannot find field 
> new_col1 from [0:foo, 1:bar]
> 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:367)
> 
> org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldRef(LazySimpleStructObjectInspector.java:168)
> 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$9.apply(TableReader.scala:275)
> 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$9.apply(TableReader.scala:275)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 
> org.apache.spark.sql.hive.HadoopTableReader$.fillObject(TableReader.scala:275)
> 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$3$$anonfun$apply$1.apply(TableReader.scala:193)
> 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$3$$anonfun$apply$1.apply(TableReader.scala:187)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> 

[jira] [Updated] (SPARK-10538) java.lang.NegativeArraySizeException during join

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10538:
---
Target Version/s: 1.5.1

> java.lang.NegativeArraySizeException during join
> 
>
> Key: SPARK-10538
> URL: https://issues.apache.org/jira/browse/SPARK-10538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
> Attachments: screenshot-1.png
>
>
> Hi,
> I've got a problem during joining tables in PySpark. (in my example 20 of 
> them)
> I can observe that during calculation of first partition (on one of 
> consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) 
> vs on others partitions (approx. 272.5 KB / 113 record)
> I can also observe that just before the crash python process going up to few 
> gb of RAM.
> After some time there is an exception:
> {code}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm running this on 2 nodes cluster (12 cores, 64 GB RAM)
> Config:
> {code}
> spark.driver.memory  10g
> spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC 
> -Dfile.encoding=UTF8
> spark.executor.memory   60g
> spark.storage.memoryFraction0.05
> spark.shuffle.memoryFraction0.75
> spark.driver.maxResultSize  10g  
> spark.cores.max 24
> spark.kryoserializer.buffer.max 1g
> spark.default.parallelism   200
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10508) incorrect evaluation of searched case expression

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10508:
---
Description: 
The following case expression never evaluates to 'test1' when cdec is -1 or 10 
as it will for Hive 0.13. Instead is returns 'other' for all rows.

{code}
select rnum, cdec, case when cdec in ( -1,10,0.1 )  then 'test1' else 'other' 
end from tdec 

create table  if not exists TDEC ( RNUM int , CDEC decimal(7, 2 ))
TERMINATED BY '\n' 
 STORED AS orc  ;


0|\N
1|-1.00
2|0.00
3|1.00
4|0.10
5|10.00
{code}

  was:
The following case expression never evaluates to 'test1' when cdec is -1 or 10 
as it will for Hive 0.13. Instead is returns 'other' for all rows.

select rnum, cdec, case when cdec in ( -1,10,0.1 )  then 'test1' else 'other' 
end from tdec 

create table  if not exists TDEC ( RNUM int , CDEC decimal(7, 2 ))
TERMINATED BY '\n' 
 STORED AS orc  ;


0|\N
1|-1.00
2|0.00
3|1.00
4|0.10
5|10.00



> incorrect evaluation of searched case expression
> 
>
> Key: SPARK-10508
> URL: https://issues.apache.org/jira/browse/SPARK-10508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>
> The following case expression never evaluates to 'test1' when cdec is -1 or 
> 10 as it will for Hive 0.13. Instead is returns 'other' for all rows.
> {code}
> select rnum, cdec, case when cdec in ( -1,10,0.1 )  then 'test1' else 'other' 
> end from tdec 
> create table  if not exists TDEC ( RNUM int , CDEC decimal(7, 2 ))
> TERMINATED BY '\n' 
>  STORED AS orc  ;
> 0|\N
> 1|-1.00
> 2|0.00
> 3|1.00
> 4|0.10
> 5|10.00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10612) Add prepare to LocalNode

2015-09-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10612.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Add prepare to LocalNode
> 
>
> Key: SPARK-10612
> URL: https://issues.apache.org/jira/browse/SPARK-10612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>
> The idea is that we should separate the function call that does memory 
> reservation (i.e. prepare) from the function call that consumes the input 
> (e.g. open()), so all operators can be a chance to reserve memory before they 
> are all consumed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10626) Create a Java friendly method for randomRDD & RandomDataGenerator on RandomRDDs.

2015-09-15 Thread holdenk (JIRA)
holdenk created SPARK-10626:
---

 Summary: Create a Java friendly method for randomRDD & 
RandomDataGenerator on RandomRDDs.
 Key: SPARK-10626
 URL: https://issues.apache.org/jira/browse/SPARK-10626
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: holdenk
Priority: Minor


SPARK-3136 added a large number of functions for creating Java RandomRDDs, but 
for people that want to use custom RandomDataGenerators we should make a Java 
friendly method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2015-09-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5236.
-
Resolution: Cannot Reproduce

Closing as cannot reproduce, please reopen if you can.

> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
> -
>
> Key: SPARK-5236
> URL: https://issues.apache.org/jira/browse/SPARK-5236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Alex Baretta
>
> {code}
> 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
> (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
> at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
> at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
> at 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
> at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
> at 
> org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375)
> at 
> org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434)
> at 
> parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237)
> at 
> parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353)
> at 
> parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402)
> at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
> ... 27 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5194) ADD JAR doesn't update classpath until reconnect

2015-09-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5194.
-
Resolution: Cannot Reproduce

Closing as cannot reproduce.  Please reopen if you can on Spark 1.5

> ADD JAR doesn't update classpath until reconnect
> 
>
> Key: SPARK-5194
> URL: https://issues.apache.org/jira/browse/SPARK-5194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Oleg Danilov
>
> Steps to reproduce:
> beeline>  !connect jdbc:hive2://vmhost-vm0:1  
>  
> 0: jdbc:hive2://vmhost-vm0:1> add jar 
> ./target/nexr-hive-udf-0.2-SNAPSHOT.jar
> 0: jdbc:hive2://vmhost-vm0:1> CREATE TEMPORARY FUNCTION nvl AS 
> 'com.nexr.platform.hive.udf.GenericUDFNVL';
> 0: jdbc:hive2://vmhost-vm0:1> select nvl(imsi,'test') from 
> ps_cei_index_1_week limit 1;
> Error: java.lang.ClassNotFoundException: 
> com.nexr.platform.hive.udf.GenericUDFNVL (state=,code=0)
> 0: jdbc:hive2://vmhost-vm0:1> !reconnect
> Reconnecting to "jdbc:hive2://vmhost-vm0:1"...
> Closing: org.apache.hive.jdbc.HiveConnection@3f18dc75: {1}
> Connected to: Spark SQL (version 1.2.0)
> Driver: null (version null)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://vmhost-vm0:1> select nvl(imsi,'test') from 
> ps_cei_index_1_week limit 1;
> +--+
> | _c0  |
> +--+
> | -1   |
> +--+
> 1 row selected (1.605 seconds)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5305) Using a field in a WHERE clause that is not in the schema does not throw an exception.

2015-09-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5305.
-
Resolution: Cannot Reproduce

Marking as cannot reproduce, please reopen if you can.

> Using a field in a WHERE clause that is not in the schema does not throw an 
> exception.
> --
>
> Key: SPARK-5305
> URL: https://issues.apache.org/jira/browse/SPARK-5305
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Corey J. Nolet
>
> Given a schema:
> key1 = String
> key2 = Integer
> The following sql statement doesn't seem to throw an exception:
> SELECT * FROM myTable WHERE doesntExist = 'val1'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe

2015-09-15 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746320#comment-14746320
 ] 

David Ross commented on SPARK-5391:
---

Haven't tried native JSON but looks promising, so this ticket is probably lower 
priority.

> SparkSQL fails to create tables with custom JSON SerDe
> --
>
> Key: SPARK-5391
> URL: https://issues.apache.org/jira/browse/SPARK-5391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: David Ross
>
> - Using Spark built from trunk on this commit: 
> https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd
> - Build for Hive13
> - Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde
> First download jar locally:
> {code}
> $ curl 
> http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar
>  > /tmp/json-serde-1.3-jar-with-dependencies.jar
> {code}
> Then add it in SparkSQL session:
> {code}
> add jar /tmp/json-serde-1.3-jar-with-dependencies.jar
> {code}
> Finally create table:
> {code}
> create table test_json (c1 boolean) ROW FORMAT SERDE 
> 'org.openx.data.jsonserde.JsonSerDe';
> {code}
> Logs for add jar:
> {code}
> 15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar'
> 15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this 
> point. hive.execution.engine=mr.
> 15/01/23 23:48:34 INFO SessionState: Added 
> /tmp/json-serde-1.3-jar-with-dependencies.jar to class path
> 15/01/23 23:48:34 INFO SessionState: Added resource: 
> /tmp/json-serde-1.3-jar-with-dependencies.jar
> 15/01/23 23:48:34 INFO spark.SparkContext: Added JAR 
> /tmp/json-serde-1.3-jar-with-dependencies.jar at 
> http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with 
> timestamp 1422056914776
> 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
> Schema: List()
> 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
> Schema: List()
> {code}
> Logs (with error) for create table:
> {code}
> 15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'create table test_json (c1 boolean) ROW FORMAT SERDE 
> 'org.openx.data.jsonserde.JsonSerDe''
> 15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table 
> test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
> 15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this 
> point. hive.execution.engine=mr.
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating 
> a lock manager
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table 
> test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941103 end=1422056941104 duration=1 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
> 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json 
> position=13
> 15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941104 end=1422056941240 duration=136 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: 
> Schema(fieldSchemas:null, properties:null)
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941071 end=1422056941252 duration=181 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json 
> (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941067 end=1422056941258 duration=191 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 WARN security.ShellBasedUnixGroupsMapping: got exception 
> trying to get groups for user anonymous
> org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user
>   at 

[jira] [Commented] (SPARK-8152) Dataframe Join Ignores Condition

2015-09-15 Thread Eric Doi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746337#comment-14746337
 ] 

Eric Doi commented on SPARK-8152:
-

Thanks.  Will reopen if I'm able to reproduce in a clearer way.

> Dataframe Join Ignores Condition
> 
>
> Key: SPARK-8152
> URL: https://issues.apache.org/jira/browse/SPARK-8152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Doi
> Attachments: field_choice_pinpointed.png, side-by-side.png
>
>
> When joining two tables A and B, on condition that A.X = B.X, in some cases 
> that condition is not fulfilled in the result.
> Suspect it might be due to duplicate column names in the source tables 
> causing confusion.  Is it possible for there to exist hidden fields in a 
> dataframe?
> Will attach a screenshot for more details.  The bug is reproducible but hard 
> to pinpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2789) Apply names to RDD to becoming SchemaRDD

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen closed SPARK-2789.
-
Resolution: Won't Fix

> Apply names to RDD to becoming SchemaRDD
> 
>
> Key: SPARK-2789
> URL: https://issues.apache.org/jira/browse/SPARK-2789
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> In order to simplify apply schema, we could add an API called applyNames(), 
> which will infer the types in the RDD and create an schema with names, then 
> apply  this schema on it to becoming a SchemaRDD. The names could be provides 
> by String with names separated  by space.
> For example:
> rdd = sc.parallelize([("Alice", 10)])
> srdd = sqlCtx.applyNames(rdd, "name age")
> User don't need to create an case class or StructType to have all power of 
> Spark SQL.
> The string presentation of schema also could support nested structure 
> (MapType, ArrayType and StructType), for example:
> "name age address(city zip) likes[title stars] props{[value type]}"
> It will equal to unnamed schema:
> root
> |--name
> |--age
> |--address
> |--|--city
> |--|--zip
> |--likes
> |--|--element
> |--|--|--title
> |--|--|--starts
> |--props
> |--|--key:
> |--|--value:
> |--|--|--element
> |--|--|--|--value
> |--|--|--|--type
> All the names of fields are seperated by space, the struct of field (if it is 
> nested type) follows the name without space, wich shoud startswith "(" 
> (StructType) or "[" (ArrayType) or "{" (MapType).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6632) Optimize the parquetSchema to metastore schema reconciliation, so that the process is delegated to each map task itself

2015-09-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6632.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Starting with Spark 1.5 I believe all footer reading is delegated to a spark 
job.

> Optimize the parquetSchema to metastore schema reconciliation, so that the 
> process is delegated to each map task itself
> ---
>
> Key: SPARK-6632
> URL: https://issues.apache.org/jira/browse/SPARK-6632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yash Datta
> Fix For: 1.5.0
>
>
> Currently in ParquetRelation2, schema from all the part files is first 
> merged, and then reconciled with metastore schema. This approach does not 
> scale in case we have thousands of partitions for the table. We can take a 
> different approach where we can go ahead with the metastore schema, and 
> reconcile the names of the columns within each map task , using ReadSupport 
> hooks provided in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9188) make open hash set/table APIs public

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-9188.
---
Resolution: Won't Fix

This has been proposed previously. I'm closing this issue as "Won't Fix" 
because building a general-purpose specialized collections library is outside 
of Spark's scope / mission; we don't want to commit to these classes as stable 
public APIs. If you want to use these classes yourself then I recommend that 
you copy their source to your own project.

> make open hash set/table APIs public 
> -
>
> Key: SPARK-9188
> URL: https://issues.apache.org/jira/browse/SPARK-9188
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Reporter: Mohit Jaggi
>
> These data structures will be useful for writing custom aggregations and 
> other code on spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9032) scala.MatchError in DataFrameReader.json(String path)

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9032:
--
Description: 
Executing read().json() of SQLContext e.g. DataFrameReader raises a MatchError 
with a stacktrace as follows while trying to read JSON data:

{code}
15/07/14 11:25:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
15/07/14 11:25:26 INFO DAGScheduler: Job 0 finished: json at Example.java:23, 
took 6.981330 s
Exception in thread "main" scala.MatchError: StringType (of class 
org.apache.spark.sql.types.StringType$)
at org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58)
at 
org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139)
at 
org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.json.JSONRelation.schema$lzycompute(JSONRelation.scala:137)
at org.apache.spark.sql.json.JSONRelation.schema(JSONRelation.scala:137)
at 
org.apache.spark.sql.sources.LogicalRelation.(LogicalRelation.scala:30)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:213)
at com.hp.sparkdemo.Example.main(Example.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/07/14 11:25:26 INFO SparkContext: Invoking stop() from shutdown hook
15/07/14 11:25:26 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
15/07/14 11:25:26 INFO DAGScheduler: Stopping DAGScheduler
15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Shutting down all executors
15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Asking each executor to 
shut down
15/07/14 11:25:26 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
{code}

Offending code snippet (around line 23):

{code}
   JavaSparkContext sctx = new JavaSparkContext(sparkConf);
SQLContext ctx = new SQLContext(sctx);
DataFrame frame = ctx.read().json(facebookJSON);
frame.printSchema();
{code}

The exception is reproducable using the following JSON:
{code}
{
   "data": [
  {
 "id": "X999_Y999",
 "from": {
"name": "Tom Brady", "id": "X12"
 },
 "message": "Looking forward to 2010!",
 "actions": [
{
   "name": "Comment",
   "link": "http://www.facebook.com/X999/posts/Y999;
},
{
   "name": "Like",
   "link": "http://www.facebook.com/X999/posts/Y999;
}
 ],
 "type": "status",
 "created_time": "2010-08-02T21:27:44+",
 "updated_time": "2010-08-02T21:27:44+"
  },
  {
 "id": "X998_Y998",
 "from": {
"name": "Peyton Manning", "id": "X18"
 },
 "message": "Where's my contract?",
 "actions": [
{
   "name": "Comment",
   "link": "http://www.facebook.com/X998/posts/Y998;
},
{
   "name": "Like",
   "link": "http://www.facebook.com/X998/posts/Y998;
}
 ],
 "type": "status",
 "created_time": "2010-08-02T21:27:44+",
 "updated_time": "2010-08-02T21:27:44+"
  }
   ]
}
{code}

  was:
Executing read().json() of SQLContext e.g. DataFrameReader raises a MatchError 
with a stacktrace as follows while trying to read JSON data:

15/07/14 11:25:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
15/07/14 11:25:26 INFO DAGScheduler: Job 0 finished: json at Example.java:23, 
took 6.981330 s
Exception in thread "main" scala.MatchError: StringType (of class 
org.apache.spark.sql.types.StringType$)
at org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58)
at 
org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139)
at 

[jira] [Resolved] (SPARK-9032) scala.MatchError in DataFrameReader.json(String path)

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-9032.
---
   Resolution: Fixed
Fix Version/s: 1.4.1

Just confirmed that this is fixed in 1.4.1.

> scala.MatchError in DataFrameReader.json(String path)
> -
>
> Key: SPARK-9032
> URL: https://issues.apache.org/jira/browse/SPARK-9032
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.4.0
> Environment: Ubuntu 15.04
>Reporter: Philipp Poetter
> Fix For: 1.4.1
>
>
> Executing read().json() of SQLContext e.g. DataFrameReader raises a 
> MatchError with a stacktrace as follows while trying to read JSON data:
> {code}
> 15/07/14 11:25:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool 
> 15/07/14 11:25:26 INFO DAGScheduler: Job 0 finished: json at Example.java:23, 
> took 6.981330 s
> Exception in thread "main" scala.MatchError: StringType (of class 
> org.apache.spark.sql.types.StringType$)
>   at org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139)
>   at 
> org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.json.JSONRelation.schema$lzycompute(JSONRelation.scala:137)
>   at org.apache.spark.sql.json.JSONRelation.schema(JSONRelation.scala:137)
>   at 
> org.apache.spark.sql.sources.LogicalRelation.(LogicalRelation.scala:30)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:213)
>   at com.hp.sparkdemo.Example.main(Example.java:23)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 15/07/14 11:25:26 INFO SparkContext: Invoking stop() from shutdown hook
> 15/07/14 11:25:26 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
> 15/07/14 11:25:26 INFO DAGScheduler: Stopping DAGScheduler
> 15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Shutting down all 
> executors
> 15/07/14 11:25:26 INFO SparkDeploySchedulerBackend: Asking each executor to 
> shut down
> 15/07/14 11:25:26 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> {code}
> Offending code snippet (around line 23):
> {code}
>JavaSparkContext sctx = new JavaSparkContext(sparkConf);
> SQLContext ctx = new SQLContext(sctx);
> DataFrame frame = ctx.read().json(facebookJSON);
> frame.printSchema();
> {code}
> The exception is reproducable using the following JSON:
> {code}
> {
>"data": [
>   {
>  "id": "X999_Y999",
>  "from": {
> "name": "Tom Brady", "id": "X12"
>  },
>  "message": "Looking forward to 2010!",
>  "actions": [
> {
>"name": "Comment",
>"link": "http://www.facebook.com/X999/posts/Y999;
> },
> {
>"name": "Like",
>"link": "http://www.facebook.com/X999/posts/Y999;
> }
>  ],
>  "type": "status",
>  "created_time": "2010-08-02T21:27:44+",
>  "updated_time": "2010-08-02T21:27:44+"
>   },
>   {
>  "id": "X998_Y998",
>  "from": {
> "name": "Peyton Manning", "id": "X18"
>  },
>  "message": "Where's my contract?",
>  "actions": [
> {
>"name": "Comment",
>"link": "http://www.facebook.com/X998/posts/Y998;
> },
> {
>"name": "Like",
>"link": "http://www.facebook.com/X998/posts/Y998;
> }
>  ],
>  "type": "status",
>  "created_time": "2010-08-02T21:27:44+",
>  "updated_time": "2010-08-02T21:27:44+"
>   }
>]
> }
> {code}



--
This 

[jira] [Commented] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi

2015-09-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746365#comment-14746365
 ] 

Josh Rosen commented on SPARK-6513:
---

[~marmbrus], safe to say that this is "Won't Fix" given the RDDApi removal?

> Add zipWithUniqueId (and other RDD APIs) to RDDApi
> --
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Minor
>
> It will be nice if we could treat a Dataframe just like an RDD (wherever it 
> makes sense) 
> *Worked in 1.2.1*
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> {code}
> *Stopped working in 1.3.0* 
> {code}   
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> {code}
> **Not working workaround:**
> although this might give me an {{RDD\[Row\]}}:
> {code}
> jsonResult.rdd.zipWithUniqueId()  
> {code}
> Now this won't work obviously since {{RDD\[Row\]}} does not have a 
> {{registerTempTable}} method of course
> {code}
>  foo.registerTempTable("...")
> {code}
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)
> EDIT: changed from issue to enhancement request 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6942) Umbrella: UI Visualizations for Core and Dataframes

2015-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6942:
---
Assignee: Andrew Or  (was: Patrick Wendell)

> Umbrella: UI Visualizations for Core and Dataframes 
> 
>
> Key: SPARK-6942
> URL: https://issues.apache.org/jira/browse/SPARK-6942
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL, Web UI
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.5.0
>
>
> This is an umbrella issue for the assorted visualization proposals for 
> Spark's UI. The scope will likely cover Spark 1.4 and 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7175) Upgrade Hive to 1.1.0

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-7175.
---
Resolution: Duplicate

We added support for connecting to Hive 1.1 in SPARK-8067

> Upgrade Hive to 1.1.0
> -
>
> Key: SPARK-7175
> URL: https://issues.apache.org/jira/browse/SPARK-7175
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Punya Biswal
>
> Spark SQL currently supports Hive 0.13 (June 2014), but the latest version of 
> Hive is 1.1.0 (March 2015). Among other improvements, it includes new UDFs 
> for date manipulation that I'd like to avoid rebuilding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9033) scala.MatchError: interface java.util.Map (of class java.lang.Class) with Spark SQL

2015-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9033:
--
Description: 
I've a java.util.Map field in a POJO class and I'm trying to 
use it to createDataFrame (1.3.1) / applySchema(1.2.2) with the SQLContext and 
getting following error in both 1.2.2 & 1.3.1 versions of the Spark SQL:

*sample code:

{code}
SQLContext sqlCtx = new SQLContext(sc.sc());
JavaRDD rdd = sc.textFile("/path").map(line-> Event.fromString(line)); 
//text line is splitted and assigned to respective field of the event class here
DataFrame schemaRDD  = sqlCtx.createDataFrame(rdd, Event.class); <-- error 
thrown here
schemaRDD.registerTempTable("events");
{code}


Event class is a Serializable containing a field of type  java.util.Map. This issue occurs also with Spark streaming when used with SQL.

{code}
JavaDStream receiverStream = jssc.receiverStream(new 
StreamingReceiver());
JavaDStream windowDStream = receiverStream.window(WINDOW_LENGTH, 
SLIDE_INTERVAL);
jssc.checkpoint("event-streaming");

windowDStream.foreachRDD(evRDD -> {
   if(evRDD.count() == 0) return null;

DataFrame schemaRDD = sqlCtx.createDataFrame(evRDD, Event.class);
schemaRDD.registerTempTable("events");
...
}
{code}


*error:
{code}
scala.MatchError: interface java.util.Map (of class java.lang.Class)
at 
org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
 ~[spark-sql_2.10-1.3.1.jar:1.3.1]
at 
org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
 ~[spark-sql_2.10-1.3.1.jar:1.3.1]
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 ~[scala-library-2.10.5.jar:na]
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 ~[scala-library-2.10.5.jar:na]
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 ~[scala-library-2.10.5.jar:na]
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
~[scala-library-2.10.5.jar:na]
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244) 
~[scala-library-2.10.5.jar:na]
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) 
~[scala-library-2.10.5.jar:na]
at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) 
~[spark-sql_2.10-1.3.1.jar:1.3.1]
at 
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) 
~[spark-sql_2.10-1.3.1.jar:1.3.1]
at 
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) 
~[spark-sql_2.10-1.3.1.jar:1.3.1]
{code}


**also this occurs for fields of custom POJO classes:
{code}
scala.MatchError: class com.test.MyClass (of class java.lang.Class)
at 
org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
 ~[spark-sql_2.10-1.3.1.jar:1.3.1]
at 
org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
 ~[spark-sql_2.10-1.3.1.jar:1.3.1]
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 ~[scala-library-2.10.5.jar:na]
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 ~[scala-library-2.10.5.jar:na]
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 ~[scala-library-2.10.5.jar:na]
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
~[scala-library-2.10.5.jar:na]
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244) 
~[scala-library-2.10.5.jar:na]
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) 
~[scala-library-2.10.5.jar:na]
at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192) 
~[spark-sql_2.10-1.3.1.jar:1.3.1]
at 
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437) 
~[spark-sql_2.10-1.3.1.jar:1.3.1]
at 
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465) 
~[spark-sql_2.10-1.3.1.jar:1.3.1]
{code}

**also occurs for Calendar  type:

{code}
scala.MatchError: class java.util.Calendar (of class java.lang.Class)
at 
org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
 ~[spark-sql_2.10-1.3.1.jar:1.3.1]
at 
org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
 ~[spark-sql_2.10-1.3.1.jar:1.3.1]
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 ~[scala-library-2.10.5.jar:na]
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 ~[scala-library-2.10.5.jar:na]
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 ~[scala-library-2.10.5.jar:na]
at 

[jira] [Created] (SPARK-10609) Improve task distribution strategy in taskSetManager

2015-09-15 Thread Zhang, Liye (JIRA)
Zhang, Liye created SPARK-10609:
---

 Summary: Improve task distribution strategy in taskSetManager
 Key: SPARK-10609
 URL: https://issues.apache.org/jira/browse/SPARK-10609
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zhang, Liye


For some bad cases, the current strategy cannot handle properly especially when 
reduce tasks locality is enable, We should improve the waiting strategy for 
pending tasks when there are still cores are free in the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10610) Using AppName instead AppId in the name of all metrics

2015-09-15 Thread Yi Tian (JIRA)
Yi Tian created SPARK-10610:
---

 Summary: Using AppName instead AppId in the name of all metrics
 Key: SPARK-10610
 URL: https://issues.apache.org/jira/browse/SPARK-10610
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Yi Tian
Priority: Minor


When we using {{JMX}} to monitor spark system,  We have to configure the name 
of target metrics in the monitor system. But the current name of metrics is 
{{appId}} + {{executorId}} + {{source}} . So when the spark program restarted, 
we have to update the name of metrics in the monitor system.

We should add an optional configuration property to control whether using the 
appName instead of appId in spark metrics system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Nalia Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744929#comment-14744929
 ] 

Nalia Tang commented on SPARK-10466:


不知道这里适不适合提问。。也不知道中文是否恰当。。
在我使用decode 的时候 程序总是报:
 org.apache.spark.sql.AnalysisException: Invalid number of arguments for 
function decode;


我使用的情况是这样的:
DataFrame iamMx2 = sqlContext.sql("select DECODE(BILL_REF_NO,3,1,1) from 
iamMx0 ");

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> 

[jira] [Commented] (SPARK-10590) Spark with YARN build is broken

2015-09-15 Thread Kevin Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744927#comment-14744927
 ] 

Kevin Tsai commented on SPARK-10590:


Hi Sean, Reynold, 
Thanks, It's my bad.
This PR solved this issue.
https://github.com/apache/spark/pull/2883/files


> Spark with YARN build is broken
> ---
>
> Key: SPARK-10590
> URL: https://issues.apache.org/jira/browse/SPARK-10590
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: CentOS 6.5
> Oracle JDK 1.7.0_75
> Maven 3.3.3
> Hadoop 2.6.0
> Spark 1.5.0
>Reporter: Kevin Tsai
>
> Hi, After upgrade to v1.5.0 and trying to build it.
> It shows:
> [ERROR] missing or invalid dependency detected while loading class file 
> 'WebUI.class'
> It was working on Spark 1.4.1
> Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
> -Phive-thriftserver -Dscala-2.11 -DskipTests clean package
> Hope it helps.
> Regards,
> Kevin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10611) Configuration object thread safety issue in NewHadoopRDD

2015-09-15 Thread Mingyu Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingyu Kim updated SPARK-10611:
---
Attachment: Screen Shot 2015-09-09 at 12.58.13 PM.png

> Configuration object thread safety issue in NewHadoopRDD
> 
>
> Key: SPARK-10611
> URL: https://issues.apache.org/jira/browse/SPARK-10611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Mingyu Kim
>Priority: Critical
>
> SPARK-2546 fixed the issue for HadoopRDD, but the fix is not ported over to 
> NewHadoopRDD. The screenshot of the stacktrace is attached, and it's very 
> similar to what's reported in SPARK-2546.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10611) Configuration object thread safety issue in NewHadoopRDD

2015-09-15 Thread Mingyu Kim (JIRA)
Mingyu Kim created SPARK-10611:
--

 Summary: Configuration object thread safety issue in NewHadoopRDD
 Key: SPARK-10611
 URL: https://issues.apache.org/jira/browse/SPARK-10611
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Mingyu Kim
Priority: Critical


SPARK-2546 fixed the issue for HadoopRDD, but the fix is not ported over to 
NewHadoopRDD. The screenshot of the stacktrace is attached, and it's very 
similar to what's reported in SPARK-2546.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10611) Configuration object thread safety issue in NewHadoopRDD

2015-09-15 Thread Mingyu Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingyu Kim updated SPARK-10611:
---
Attachment: (was: Screen Shot 2015-09-09 at 12.58.13 PM.png)

> Configuration object thread safety issue in NewHadoopRDD
> 
>
> Key: SPARK-10611
> URL: https://issues.apache.org/jira/browse/SPARK-10611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Mingyu Kim
>Priority: Critical
>
> SPARK-2546 fixed the issue for HadoopRDD, but the fix is not ported over to 
> NewHadoopRDD. The screenshot of the stacktrace is attached, and it's very 
> similar to what's reported in SPARK-2546.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10612) Add prepare to LocalNode

2015-09-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-10612:
---

 Summary: Add prepare to LocalNode
 Key: SPARK-10612
 URL: https://issues.apache.org/jira/browse/SPARK-10612
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


The idea is that we should separate the function call that does memory 
reservation (i.e. prepare) from the function call that consumes the input (e.g. 
open()), so all operators can be a chance to reserve memory before they are all 
consumed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10608) turn off reduce tasks locality as default to avoid bad cases

2015-09-15 Thread Zhang, Liye (JIRA)
Zhang, Liye created SPARK-10608:
---

 Summary: turn off reduce tasks locality as default to avoid bad 
cases
 Key: SPARK-10608
 URL: https://issues.apache.org/jira/browse/SPARK-10608
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Zhang, Liye
Priority: Critical


After [SPARK-2774|https://issues.apache.org/jira/browse/SPARK-2774], which is 
aiming to reduce network transform, reduce tasks will have their own locality 
other than following the map side locality. This will lead to some bad cases 
when there is data skew happens. In some cases, tasks will continue being 
distributed on some nodes, and will never be balance distributed. 
e.g. If we do not set *spark.scheduler.minRegisteredExecutorsRatio*, then the 
input data will only be loaded on part of the nodes, say 4 nodes in 10 nodes. 
And this will leading the first batch of the reduce tasks running on  the 4 
nodes, and with many pending tasks waiting for distribution. It might be fine 
if the tasks runnning for a long time, But if the tasks are running in short 
time, for example, less than *spark.locality.wait*, then the locality level 
will not get to lower level, and then the following batches of tasks will be 
still running on the 4 nodes. Which will ending with all following tasks are 
running on the 4 nodes instead of 10 nodes. Even though after several stages 
the tasks may evenly distributed, however, the unbalanced tasks distribution in 
the beginning will exhaust resources on some nodes first and cause GC more 
frequently. Which will lead bad performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10590) Spark with YARN build is broken

2015-09-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744931#comment-14744931
 ] 

Reynold Xin commented on SPARK-10590:
-

Ok glad it worked!


> Spark with YARN build is broken
> ---
>
> Key: SPARK-10590
> URL: https://issues.apache.org/jira/browse/SPARK-10590
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: CentOS 6.5
> Oracle JDK 1.7.0_75
> Maven 3.3.3
> Hadoop 2.6.0
> Spark 1.5.0
>Reporter: Kevin Tsai
>
> Hi, After upgrade to v1.5.0 and trying to build it.
> It shows:
> [ERROR] missing or invalid dependency detected while loading class file 
> 'WebUI.class'
> It was working on Spark 1.4.1
> Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
> -Phive-thriftserver -Dscala-2.11 -DskipTests clean package
> Hope it helps.
> Regards,
> Kevin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744969#comment-14744969
 ] 

Cheng Hao commented on SPARK-10474:
---

The root causes for the exception is the executor don't have enough memory for 
external sorting(UnsafeXXXSorter), 
The memory used for the sorting is MAX_JVM_HEAP * spark.shuffle.memoryFraction 
* spark.shuffle.safetyFraction.

So a workaround is to set a bigger memory for jvm, or the spark conf keys 
"spark.shuffle.memoryFraction"(0.2 by default) and 
"spark.shuffle.safetyFraction"(0.8 by default).


> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a 

[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744966#comment-14744966
 ] 

Cheng Hao commented on SPARK-10466:
---

[~naliazheli] It's an irrelevant issue, you'd better to subscribe the spark 
mail list and then ask question in English. 
See(http://spark.apache.org/community.html)

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> 

[jira] [Updated] (SPARK-10611) Configuration object thread safety issue in NewHadoopRDD

2015-09-15 Thread Mingyu Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingyu Kim updated SPARK-10611:
---
Attachment: Screen Shot 2015-09-15 at 12.20.38 AM.png

> Configuration object thread safety issue in NewHadoopRDD
> 
>
> Key: SPARK-10611
> URL: https://issues.apache.org/jira/browse/SPARK-10611
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Mingyu Kim
>Priority: Critical
> Attachments: Screen Shot 2015-09-15 at 12.20.38 AM.png
>
>
> SPARK-2546 fixed the issue for HadoopRDD, but the fix is not ported over to 
> NewHadoopRDD. The screenshot of the stacktrace is attached, and it's very 
> similar to what's reported in SPARK-2546.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745008#comment-14745008
 ] 

Cheng Hao commented on SPARK-10474:
---

But from the current implementation, we'd better not to throw exception if 
acquired memory(offheap) is not satisfied,  maybe we'd better use fixed memory 
allocations for both data page and the pointer array, what do you think?

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10598.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> RoutingTablePartition toMessage method refers to bytes instead of bits
> --
>
> Key: SPARK-10598
> URL: https://issues.apache.org/jira/browse/SPARK-10598
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Robin East
>Assignee: Robin East
>Priority: Trivial
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8426) Add blacklist mechanism for YARN container allocation

2015-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8426:
---

Assignee: Apache Spark

> Add blacklist mechanism for YARN container allocation
> -
>
> Key: SPARK-8426
> URL: https://issues.apache.org/jira/browse/SPARK-8426
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8426) Add blacklist mechanism for YARN container allocation

2015-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8426:
---

Assignee: (was: Apache Spark)

> Add blacklist mechanism for YARN container allocation
> -
>
> Key: SPARK-8426
> URL: https://issues.apache.org/jira/browse/SPARK-8426
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8426) Add blacklist mechanism for YARN container allocation

2015-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744952#comment-14744952
 ] 

Apache Spark commented on SPARK-8426:
-

User 'mwws' has created a pull request for this issue:
https://github.com/apache/spark/pull/8760

> Add blacklist mechanism for YARN container allocation
> -
>
> Key: SPARK-8426
> URL: https://issues.apache.org/jira/browse/SPARK-8426
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744967#comment-14744967
 ] 

Cheng Hao commented on SPARK-10466:
---

[~naliazheli] It's an irrelevant issue, you'd better to subscribe the spark 
mail list and then ask question in English. 
See(http://spark.apache.org/community.html)

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> 

[jira] [Issue Comment Deleted] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-10466:
--
Comment: was deleted

(was: [~naliazheli] It's an irrelevant issue, you'd better to subscribe the 
spark mail list and then ask question in English. 
See(http://spark.apache.org/community.html))

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> 

[jira] [Assigned] (SPARK-10612) Add prepare to LocalNode

2015-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10612:


Assignee: Apache Spark  (was: Reynold Xin)

> Add prepare to LocalNode
> 
>
> Key: SPARK-10612
> URL: https://issues.apache.org/jira/browse/SPARK-10612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> The idea is that we should separate the function call that does memory 
> reservation (i.e. prepare) from the function call that consumes the input 
> (e.g. open()), so all operators can be a chance to reserve memory before they 
> are all consumed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10612) Add prepare to LocalNode

2015-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745015#comment-14745015
 ] 

Apache Spark commented on SPARK-10612:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8761

> Add prepare to LocalNode
> 
>
> Key: SPARK-10612
> URL: https://issues.apache.org/jira/browse/SPARK-10612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The idea is that we should separate the function call that does memory 
> reservation (i.e. prepare) from the function call that consumes the input 
> (e.g. open()), so all operators can be a chance to reserve memory before they 
> are all consumed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10612) Add prepare to LocalNode

2015-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10612:


Assignee: Reynold Xin  (was: Apache Spark)

> Add prepare to LocalNode
> 
>
> Key: SPARK-10612
> URL: https://issues.apache.org/jira/browse/SPARK-10612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The idea is that we should separate the function call that does memory 
> reservation (i.e. prepare) from the function call that consumes the input 
> (e.g. open()), so all operators can be a chance to reserve memory before they 
> are all consumed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >