[jira] [Commented] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2016-02-05 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135630#comment-15135630
 ] 

Jason C Lee commented on SPARK-12519:
-

The memory leak in question comes from a TungstenAggregationIterator's hashmap 
that is usually freed at the end of iterating through the list. However, show() 
by default takes the first 20 items and does not usually get to the end of the 
list. Therefore, the hashmap does not get freed. Any idea how to fix it is 
welcome.

> "Managed memory leak detected" when using distinct on PySpark DataFrame
> ---
>
> Key: SPARK-12519
> URL: https://issues.apache.org/jira/browse/SPARK-12519
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: OS X 10.9.5, Java 1.8.0_66
>Reporter: Paul Shearer
>
> After running the distinct() method to transform a DataFrame, subsequent 
> actions like count() and show() may report a managed memory leak. Here is a 
> minimal example that reproduces the bug on my machine:
> h1. Script
> {noformat}
> logger = sc._jvm.org.apache.log4j
> logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
> logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )
> import string
> import random
> def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
> return ''.join(random.choice(chars) for _ in range(size))
> nrow   = 8
> ncol   = 20
> ndrow  = 4 # number distinct rows
> tmp = [id_generator() for i in xrange(ndrow*ncol)]
> tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
> xrange(nrow)] 
> dat = sc.parallelize(tmp,1000).toDF()
> dat = dat.distinct() # if this line is commented out, no memory leak will be 
> reported
> # dat = dat.rdd.distinct().toDF() # if this line is used instead of the 
> above, no leak
> ct = dat.count()
> print ct  
> # memory leak warning prints at this point in the code
> dat.show()  
> {noformat}
> h1. Output
> When this script is run in PySpark (with IPython kernel), I get this error:
> {noformat}
> $ pyspark --executor-memory 12G --driver-memory 12G
> Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
> Type "copyright", "credits" or "license" for more information.
> IPython 4.0.0 -- An enhanced Interactive Python.
> ? -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help  -> Python's own help system.
> object?   -> Details about 'object', use 'object??' for extra details.
> <<<... usual loading info...>>>
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
> SparkContext available as sc, SQLContext available as sqlContext.
> In [1]: execfile('bugtest.py')
> 4
> 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 
> 16777216 bytes, TID = 2202
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> |_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   
> _11|   _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
> |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
> |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
> |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
> |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
> |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
> |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
> |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
> |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
> |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B

[jira] [Updated] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown

2016-02-05 Thread Mao, Wei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mao, Wei updated SPARK-13222:
-
Description: The CheckpointInterval of DStream could be customized to 
multiple of BatchInteveal. So when user call shutdown it could be in the middle 
of two RDD checkpoint. In case the input source is not repeatable and user 
don't want to enable WAL because of its extra cost, the application could be 
unrecoverable.  (was: In case the input source is not repeatable, and user 
don't want to enable WAL because of the overhead costing, we need to make sure 
the latest status of stateful RDD is checkpointed.)

> checkpoint latest stateful RDD on graceful shutdown
> ---
>
> Key: SPARK-13222
> URL: https://issues.apache.org/jira/browse/SPARK-13222
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Mao, Wei
>
> The CheckpointInterval of DStream could be customized to multiple of 
> BatchInteveal. So when user call shutdown it could be in the middle of two 
> RDD checkpoint. In case the input source is not repeatable and user don't 
> want to enable WAL because of its extra cost, the application could be 
> unrecoverable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13222:


Assignee: (was: Apache Spark)

> checkpoint latest stateful RDD on graceful shutdown
> ---
>
> Key: SPARK-13222
> URL: https://issues.apache.org/jira/browse/SPARK-13222
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Mao, Wei
>
> In case the input source is not repeatable, and user don't want to enable WAL 
> because of the overhead costing, we need to make sure the latest status of 
> stateful RDD is checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown

2016-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135611#comment-15135611
 ] 

Apache Spark commented on SPARK-13222:
--

User 'mwws' has created a pull request for this issue:
https://github.com/apache/spark/pull/11101

> checkpoint latest stateful RDD on graceful shutdown
> ---
>
> Key: SPARK-13222
> URL: https://issues.apache.org/jira/browse/SPARK-13222
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Mao, Wei
>
> In case the input source is not repeatable, and user don't want to enable WAL 
> because of the overhead costing, we need to make sure the latest status of 
> stateful RDD is checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13222:


Assignee: Apache Spark

> checkpoint latest stateful RDD on graceful shutdown
> ---
>
> Key: SPARK-13222
> URL: https://issues.apache.org/jira/browse/SPARK-13222
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Mao, Wei
>Assignee: Apache Spark
>
> In case the input source is not repeatable, and user don't want to enable WAL 
> because of the overhead costing, we need to make sure the latest status of 
> stateful RDD is checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown

2016-02-05 Thread Mao, Wei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mao, Wei updated SPARK-13222:
-
Description: In case the input source is not repeatable, and user don't 
want to enable WAL because of the overhead costing, we need to make sure the 
latest status of stateful RDD is checkpointed.

> checkpoint latest stateful RDD on graceful shutdown
> ---
>
> Key: SPARK-13222
> URL: https://issues.apache.org/jira/browse/SPARK-13222
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Mao, Wei
>
> In case the input source is not repeatable, and user don't want to enable WAL 
> because of the overhead costing, we need to make sure the latest status of 
> stateful RDD is checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown

2016-02-05 Thread Mao, Wei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mao, Wei updated SPARK-13222:
-
Component/s: Streaming

> checkpoint latest stateful RDD on graceful shutdown
> ---
>
> Key: SPARK-13222
> URL: https://issues.apache.org/jira/browse/SPARK-13222
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Mao, Wei
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown

2016-02-05 Thread Mao, Wei (JIRA)
Mao, Wei created SPARK-13222:


 Summary: checkpoint latest stateful RDD on graceful shutdown
 Key: SPARK-13222
 URL: https://issues.apache.org/jira/browse/SPARK-13222
 Project: Spark
  Issue Type: Bug
Reporter: Mao, Wei






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13171) Update promise & future to Promise and Future as the old ones are deprecated

2016-02-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13171.
-
   Resolution: Fixed
 Assignee: Jakob Odersky
Fix Version/s: 2.0.0

> Update promise & future to Promise and Future as the old ones are deprecated
> 
>
> Key: SPARK-13171
> URL: https://issues.apache.org/jira/browse/SPARK-13171
> Project: Spark
>  Issue Type: Sub-task
>Reporter: holdenk
>Assignee: Jakob Odersky
>Priority: Trivial
> Fix For: 2.0.0
>
>
> We use the promise and future functions on the concurrent object, both of 
> which have been deprecated in 2.11 . The full traits are present in Scala 
> 2.10 as well so this should be a safe migration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11725) Let UDF to handle null value

2016-02-05 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135447#comment-15135447
 ] 

Cristian Opris commented on SPARK-11725:


You're right, sorry it seems I lost track of which Scala weirdness applies 
here: 

{code}
scala> def i : Int = (null : java.lang.Integer)
i: Int

scala> i == null
:22: warning: comparing values of types Int and Null using `==' will 
always yield false
  i == null
^
java.lang.NullPointerException
at scala.Predef$.Integer2int(Predef.scala:392)
{code}

At least the previous behaviour of passing 0 as default value was consistent 
with this specific Scala weirdness:

{code}
scala> val i: Int  = null.asInstanceOf[Int]
i: Int = 0
{code}

But the current behaviour of "null propagation" seems even more confusing 
semantically, since for example one can have an UDF with multiple args and just 
not calling the UDF when any of the args is null hardly can make any sense 
semantically.

{code}
def udf(i : Int, l: Long, s: String) = ...

sql("select udf(i, l, s) from df") 
{code}

I understand the need to not break existing code but perhaps having a more 
clear and documented UDF spec perhaps with the use of Option would really help 
here.

> Let UDF to handle null value
> 
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jeff Zhang
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
>  Output ///
> ++---++
> | age|   name|age2|
> ++---++
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> ++---++
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13221) GroupingSets Returns an Incorrect Results

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13221:


Assignee: Apache Spark

> GroupingSets Returns an Incorrect Results
> -
>
> Key: SPARK-13221
> URL: https://issues.apache.org/jira/browse/SPARK-13221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>
> The following query returns a wrong result:
> {code}
> sql("select course, sum(earnings) as sum from courseSales group by course, 
> earnings" +
>  " grouping sets((), (course), (course, earnings))" +
>  " order by course, sum").show()
> {code}
> Before the fix, the results are like
> {code}
> [null,null]
> [Java,null]
> [Java,2.0]
> [Java,3.0]
> [dotNET,null]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> {code}
> After the fix, the results are corrected:
> {code}
> [null,113000.0]
> [Java,2.0]
> [Java,3.0]
> [Java,5.0]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> [dotNET,63000.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13221) GroupingSets Returns an Incorrect Results

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13221:


Assignee: (was: Apache Spark)

> GroupingSets Returns an Incorrect Results
> -
>
> Key: SPARK-13221
> URL: https://issues.apache.org/jira/browse/SPARK-13221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xiao Li
>Priority: Critical
>
> The following query returns a wrong result:
> {code}
> sql("select course, sum(earnings) as sum from courseSales group by course, 
> earnings" +
>  " grouping sets((), (course), (course, earnings))" +
>  " order by course, sum").show()
> {code}
> Before the fix, the results are like
> {code}
> [null,null]
> [Java,null]
> [Java,2.0]
> [Java,3.0]
> [dotNET,null]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> {code}
> After the fix, the results are corrected:
> {code}
> [null,113000.0]
> [Java,2.0]
> [Java,3.0]
> [Java,5.0]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> [dotNET,63000.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13221) GroupingSets Returns an Incorrect Results

2016-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135446#comment-15135446
 ] 

Apache Spark commented on SPARK-13221:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11100

> GroupingSets Returns an Incorrect Results
> -
>
> Key: SPARK-13221
> URL: https://issues.apache.org/jira/browse/SPARK-13221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xiao Li
>Priority: Critical
>
> The following query returns a wrong result:
> {code}
> sql("select course, sum(earnings) as sum from courseSales group by course, 
> earnings" +
>  " grouping sets((), (course), (course, earnings))" +
>  " order by course, sum").show()
> {code}
> Before the fix, the results are like
> {code}
> [null,null]
> [Java,null]
> [Java,2.0]
> [Java,3.0]
> [dotNET,null]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> {code}
> After the fix, the results are corrected:
> {code}
> [null,113000.0]
> [Java,2.0]
> [Java,3.0]
> [Java,5.0]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> [dotNET,63000.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13221) GroupingSets Returns an Incorrect Results

2016-02-05 Thread Xiao Li (JIRA)
Xiao Li created SPARK-13221:
---

 Summary: GroupingSets Returns an Incorrect Results
 Key: SPARK-13221
 URL: https://issues.apache.org/jira/browse/SPARK-13221
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0, 2.0.0
Reporter: Xiao Li
Priority: Critical


The following query returns a wrong result:
{code}
sql("select course, sum(earnings) as sum from courseSales group by course, 
earnings" +
 " grouping sets((), (course), (course, earnings))" +
 " order by course, sum").show()
{code}
Before the fix, the results are like
{code}
[null,null]
[Java,null]
[Java,2.0]
[Java,3.0]
[dotNET,null]
[dotNET,5000.0]
[dotNET,1.0]
[dotNET,48000.0]
{code}
After the fix, the results are corrected:
{code}
[null,113000.0]
[Java,2.0]
[Java,3.0]
[Java,5.0]
[dotNET,5000.0]
[dotNET,1.0]
[dotNET,48000.0]
[dotNET,63000.0]
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135386#comment-15135386
 ] 

Xiao Li commented on SPARK-12720:
-

The solution is done, but hitting a bug in grouping set. Thus, it is blocked. 
Will open a JIRA and PR for resolving the bug at first. 

{code}
sql("SELECT `key`, `value`, count( value ) AS `_c2` FROM `default`.`t1` " +
  "GROUP BY `key`, `value` GROUPING SETS ((), (`key`), (`key`, 
`value`))").show()
{code}
{code}
sql("SELECT `key`, `value`, count( value ) AS `_c2` FROM `default`.`t1` " +
  "GROUP BY `key`, `value` with rollup").show()
{code}

These two queries return different results, however, they should be identical. 

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11725) Let UDF to handle null value

2016-02-05 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135357#comment-15135357
 ] 

Jeff Zhang commented on SPARK-11725:


+1 for Option[T]

> Let UDF to handle null value
> 
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jeff Zhang
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
>  Output ///
> ++---++
> | age|   name|age2|
> ++---++
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> ++---++
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13068:


Assignee: (was: Apache Spark)

> Extend pyspark ml paramtype conversion to support lists
> ---
>
> Key: SPARK-13068
> URL: https://issues.apache.org/jira/browse/SPARK-13068
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> In SPARK-7675 we added type conversion for PySpark ML params. We should 
> follow up and support param type conversion for lists and nested structures 
> as required. This blocks having all PySpark ML params having type information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13068:


Assignee: Apache Spark

> Extend pyspark ml paramtype conversion to support lists
> ---
>
> Key: SPARK-13068
> URL: https://issues.apache.org/jira/browse/SPARK-13068
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>
> In SPARK-7675 we added type conversion for PySpark ML params. We should 
> follow up and support param type conversion for lists and nested structures 
> as required. This blocks having all PySpark ML params having type information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists

2016-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135330#comment-15135330
 ] 

Apache Spark commented on SPARK-13068:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/11099

> Extend pyspark ml paramtype conversion to support lists
> ---
>
> Key: SPARK-13068
> URL: https://issues.apache.org/jira/browse/SPARK-13068
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> In SPARK-7675 we added type conversion for PySpark ML params. We should 
> follow up and support param type conversion for lists and nested structures 
> as required. This blocks having all PySpark ML params having type information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13220) Deprecate "yarn-client" and "yarn-cluster"

2016-02-05 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13220:
-

 Summary: Deprecate "yarn-client" and "yarn-cluster"
 Key: SPARK-13220
 URL: https://issues.apache.org/jira/browse/SPARK-13220
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Andrew Or


We currently allow `--master yarn-client`. Instead, the user should do 
`--master yarn --deploy-mode client` to be more explicit. This is more 
consistent with other cluster managers and obviates the need to do special 
parsing of the master string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11725) Let UDF to handle null value

2016-02-05 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135264#comment-15135264
 ] 

Michael Armbrust commented on SPARK-11725:
--

[~onetoinfin...@yahoo.com], unfortunatly I don't think that that is true.  
{{Long}} and other primitive types inherit from {{AnyVal}} while only things 
that inherit from {{AnyRef}} can be {{null}}.  If you attempt a comparison 
between a primitive and null, the compiler will tell you this.

{code}
scala> 1 == null
:8: warning: comparing values of types Int and Null using `==' will 
always yield false
  1 == null
{code}

{quote}
Throwing an error when a null value cannot be passed to a UDF that has been 
compiled to only accept nulls.
{quote}

While this might be reasonable in a greenfield situation, I don't think we can 
change semantics on our users like that.  We chose this semantic because its 
pretty common for databases to use null for error conditions, rather than 
failing the whole query.

{quote}
Using {{Option\[T\]}} as a UDF arg to signal that the function accepts nulls.
{quote}

I like this idea and I actually expected it to work.  As you can see it already 
works in datasets:

{code}
Seq((1, new Integer(1)), (2, null)).toDF().as[(Int, Option[Int])].collect()
res0: Array[(Int, Option[Int])] = Array((1,Some(1)), (2,None))
{code}

We should definitely be using the same logic when converting arguments for UDFs.

> Let UDF to handle null value
> 
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jeff Zhang
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
>  Output ///
> ++---++
> | age|   name|age2|
> ++---++
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> ++---++
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!

2016-02-05 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135254#comment-15135254
 ] 

Jakob Odersky commented on SPARK-13176:
---

The PR I submitted uses java.nio.Files, it does not fix the underlying problem 
of ignoring specific deprecation warnings.

> Ignore deprecation warning for ProcessBuilder lines_!
> -
>
> Key: SPARK-13176
> URL: https://issues.apache.org/jira/browse/SPARK-13176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> The replacement,  stream_! & lineStream_! is not present in 2.10 API.
> Note @SupressWarnings for deprecation doesn't appear to work 
> https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings 
> might involve wrapping or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13176:


Assignee: Apache Spark

> Ignore deprecation warning for ProcessBuilder lines_!
> -
>
> Key: SPARK-13176
> URL: https://issues.apache.org/jira/browse/SPARK-13176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>
> The replacement,  stream_! & lineStream_! is not present in 2.10 API.
> Note @SupressWarnings for deprecation doesn't appear to work 
> https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings 
> might involve wrapping or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!

2016-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135251#comment-15135251
 ] 

Apache Spark commented on SPARK-13176:
--

User 'jodersky' has created a pull request for this issue:
https://github.com/apache/spark/pull/11098

> Ignore deprecation warning for ProcessBuilder lines_!
> -
>
> Key: SPARK-13176
> URL: https://issues.apache.org/jira/browse/SPARK-13176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> The replacement,  stream_! & lineStream_! is not present in 2.10 API.
> Note @SupressWarnings for deprecation doesn't appear to work 
> https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings 
> might involve wrapping or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13176:


Assignee: (was: Apache Spark)

> Ignore deprecation warning for ProcessBuilder lines_!
> -
>
> Key: SPARK-13176
> URL: https://issues.apache.org/jira/browse/SPARK-13176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> The replacement,  stream_! & lineStream_! is not present in 2.10 API.
> Note @SupressWarnings for deprecation doesn't appear to work 
> https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings 
> might involve wrapping or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13216) Spark streaming application not honoring --num-executors in restarting of an application from a checkpoint

2016-02-05 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135233#comment-15135233
 ] 

Hari Shreedharan commented on SPARK-13216:
--

I think this brings up a broader question - I think the parameters that we are 
loading from the new configuration is currently very limited. Perhaps we should 
go back and look at which all parameters to load from checkpoint and which not 
to. 

Eventually we would want to "tag" parameters are load-able from checkpoint or 
not. Either way, we might want to override any parameters from the checkpoint 
(other than the ones we really cannot and should not) using params from the 
user in the current run of the app.

> Spark streaming application not honoring --num-executors in restarting of an 
> application from a checkpoint
> --
>
> Key: SPARK-13216
> URL: https://issues.apache.org/jira/browse/SPARK-13216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Streaming
>Affects Versions: 1.5.0
>Reporter: Neelesh Srinivas Salian
>  Labels: Streaming
>
> Scenario to help understand:
> 1) The Spark streaming job with 12 executors was initiated with checkpointing 
> enabled.
> 2) In version 1.3, the user was able to append the number of executors to 20 
> using --num-executors but was unable to do so in version 1.5.
> In 1.5, the spark application still runs with 13 executors (1 for driver and 
> 12 executors).
> There is a need to start from the checkpoint itself and not restart the 
> application to avoid the loss of information.
> 3) Checked the code in 1.3 and 1.5, which shows the command 
> ''--num-executors" has been deprecated.
> Any thoughts on this? Not sure if anyone hit this one specifically before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!

2016-02-05 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135222#comment-15135222
 ] 

Jakob Odersky commented on SPARK-13176:
---

One of the places the process api is used is in creating symlinks. Since Spark 
requires at least Java 1.7, we can drop the use of external commands and rely 
on the nio.Files api instead.

> Ignore deprecation warning for ProcessBuilder lines_!
> -
>
> Key: SPARK-13176
> URL: https://issues.apache.org/jira/browse/SPARK-13176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> The replacement,  stream_! & lineStream_! is not present in 2.10 API.
> Note @SupressWarnings for deprecation doesn't appear to work 
> https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings 
> might involve wrapping or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-05 Thread Abhinav Chawade (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Chawade updated SPARK-13219:

Description: 
When 2 or more tables are joined in SparkSQL and there is an equality clause in 
query on attributes used to perform the join, it is useful to apply that clause 
on scans for both table. If this is not done, one of the tables results in full 
scan which can reduce the query dramatically. Consider following example with 2 
tables being joined.

{code}
CREATE TABLE assets (
assetid int PRIMARY KEY,
address text,
propertyname text
)
CREATE TABLE tenants (
assetid int PRIMARY KEY,
name text
)
spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
t.assetid and t.assetid='1201';
WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
== Physical Plan ==
Project [name#14]
 ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
  Exchange (HashPartitioning 200)
   Filter (CAST(assetid#13, DoubleType) = 1201.0)
HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
tenants, Some(t)), None
  Exchange (HashPartitioning 200)
   HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
Some(a)), None
Time taken: 1.354 seconds, Fetched 8 row(s)
{code}

The simple workaround is to add another equality condition for each table but 
it becomes cumbersome. It will be helpful if the query planner could improve 
filter propagation.

  was:
When 2 or more tables are joined in SparkSQL and there is an equality clause in 
query on attributes used to perform the join, it is useful to apply that clause 
on scans for both table. If this is not done, one of the tables results in full 
scan which can reduce the query dramatically. Consider following example with 2 
tables being joined.

{code}
CREATE TABLE assets (
assetid int PRIMARY KEY,
address text,
propertyname text
)
CREATE TABLE element22082.tenants (
assetid int PRIMARY KEY,
name text
)
spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
t.assetid and t.assetid='1201';
WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
== Physical Plan ==
Project [name#14]
 ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
  Exchange (HashPartitioning 200)
   Filter (CAST(assetid#13, DoubleType) = 1201.0)
HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
tenants, Some(t)), None
  Exchange (HashPartitioning 200)
   HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
Some(a)), None
Time taken: 1.354 seconds, Fetched 8 row(s)
{code}

The simple workaround is to add another equality condition for each table but 
it becomes cumbersome. It will be helpful if the query planner could improve 
filter propagation.


> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
> tenants, Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
> Some(a)), None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It wil

[jira] [Updated] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-05 Thread Abhinav Chawade (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Chawade updated SPARK-13219:

Description: 
When 2 or more tables are joined in SparkSQL and there is an equality clause in 
query on attributes used to perform the join, it is useful to apply that clause 
on scans for both table. If this is not done, one of the tables results in full 
scan which can reduce the query dramatically. Consider following example with 2 
tables being joined.

{code}
CREATE TABLE assets (
assetid int PRIMARY KEY,
address text,
propertyname text
)
CREATE TABLE tenants (
assetid int PRIMARY KEY,
name text
)
spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
t.assetid and t.assetid='1201';
WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
== Physical Plan ==
Project [name#14]
 ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
  Exchange (HashPartitioning 200)
   Filter (CAST(assetid#13, DoubleType) = 1201.0)
HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
Some(t)), None
  Exchange (HashPartitioning 200)
   HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
None
Time taken: 1.354 seconds, Fetched 8 row(s)
{code}

The simple workaround is to add another equality condition for each table but 
it becomes cumbersome. It will be helpful if the query planner could improve 
filter propagation.

  was:
When 2 or more tables are joined in SparkSQL and there is an equality clause in 
query on attributes used to perform the join, it is useful to apply that clause 
on scans for both table. If this is not done, one of the tables results in full 
scan which can reduce the query dramatically. Consider following example with 2 
tables being joined.

{code}
CREATE TABLE assets (
assetid int PRIMARY KEY,
address text,
propertyname text
)
CREATE TABLE tenants (
assetid int PRIMARY KEY,
name text
)
spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
t.assetid and t.assetid='1201';
WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
== Physical Plan ==
Project [name#14]
 ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
  Exchange (HashPartitioning 200)
   Filter (CAST(assetid#13, DoubleType) = 1201.0)
HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
tenants, Some(t)), None
  Exchange (HashPartitioning 200)
   HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
Some(a)), None
Time taken: 1.354 seconds, Fetched 8 row(s)
{code}

The simple workaround is to add another equality condition for each table but 
it becomes cumbersome. It will be helpful if the query planner could improve 
filter propagation.


> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner

[jira] [Resolved] (SPARK-13215) Remove fallback in codegen

2016-02-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13215.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11097
[https://github.com/apache/spark/pull/11097]

> Remove fallback in codegen
> --
>
> Key: SPARK-13215
> URL: https://issues.apache.org/jira/browse/SPARK-13215
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> in newMutableProjection, it will fallback to InterpretedMutableProjection if 
> failed to compile.
> Since we remove the configuration for codegen, we are heavily reply on 
> codegen (also TungstenAggregate require the generated MutableProjection to 
> update UnsafeRow), should remove the fallback, which could make user 
> confusing, see the discussion in SPARK-13116.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-05 Thread Abhinav Chawade (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Chawade updated SPARK-13219:

Description: 
When 2 or more tables are joined in SparkSQL and there is an equality clause in 
query on attributes used to perform the join, it is useful to apply that clause 
on scans for both table. If this is not done, one of the tables results in full 
scan which can reduce the query dramatically. Consider following example with 2 
tables being joined.

{code}
CREATE TABLE assets (
assetid int PRIMARY KEY,
address text,
propertyname text
)
CREATE TABLE element22082.tenants (
assetid int PRIMARY KEY,
name text
)
spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
t.assetid and t.assetid='1201';
WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
== Physical Plan ==
Project [name#14]
 ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
  Exchange (HashPartitioning 200)
   Filter (CAST(assetid#13, DoubleType) = 1201.0)
HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
tenants, Some(t)), None
  Exchange (HashPartitioning 200)
   HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
Some(a)), None
Time taken: 1.354 seconds, Fetched 8 row(s)
{code}

The simple workaround is to add another equality condition for each table but 
it becomes cumbersome. It will be helpful if the query planner could improve 
filter propagation.

  was:
When 2 or more tables are joined in SparkSQL and there is an equality clause in 
query on attributes used to perform the join, it is useful to apply that clause 
on scans for both table. If this is not done, one of the tables results in full 
scan which can reduce the query dramatically. Consider following example with 2 
tables being joined.

{code}
CREATE TABLE assets (
assetid int PRIMARY KEY,
address text,
propertyname text
)
CREATE TABLE element22082.tenants (
assetid int PRIMARY KEY,
name text
)
spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
t.assetid and t.assetid='1201'
 > ;
WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
== Physical Plan ==
Project [name#14]
 ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
  Exchange (HashPartitioning 200)
   Filter (CAST(assetid#13, DoubleType) = 1201.0)
HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
tenants, Some(t)), None
  Exchange (HashPartitioning 200)
   HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
Some(a)), None
Time taken: 1.354 seconds, Fetched 8 row(s)
{code}

The simple workaround is to add another equality condition for each table but 
it becomes cumbersome. It will be helpful if the query planner could improve 
filter propagation.


> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE element22082.tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
> tenants, Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
> Some(a)), None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each tabl

[jira] [Created] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-05 Thread Abhinav Chawade (JIRA)
Abhinav Chawade created SPARK-13219:
---

 Summary: Pushdown predicate propagation in SparkSQL with join
 Key: SPARK-13219
 URL: https://issues.apache.org/jira/browse/SPARK-13219
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0, 1.4.1
 Environment: Spark 1.4
Datastax Spark connector 1.4
Cassandra. 2.1.12
Centos 6.6
Reporter: Abhinav Chawade


When 2 or more tables are joined in SparkSQL and there is an equality clause in 
query on attributes used to perform the join, it is useful to apply that clause 
on scans for both table. If this is not done, one of the tables results in full 
scan which can reduce the query dramatically. Consider following example with 2 
tables being joined.

{code}
CREATE TABLE assets (
assetid int PRIMARY KEY,
address text,
propertyname text
)
CREATE TABLE element22082.tenants (
assetid int PRIMARY KEY,
name text
)
spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
t.assetid and t.assetid='1201'
 > ;
WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
== Physical Plan ==
Project [name#14]
 ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
  Exchange (HashPartitioning 200)
   Filter (CAST(assetid#13, DoubleType) = 1201.0)
HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
tenants, Some(t)), None
  Exchange (HashPartitioning 200)
   HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
Some(a)), None
Time taken: 1.354 seconds, Fetched 8 row(s)
{code}

The simple workaround is to add another equality condition for each table but 
it becomes cumbersome. It will be helpful if the query planner could improve 
filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135206#comment-15135206
 ] 

Xiao Li commented on SPARK-12720:
-

Ok, I think we should not use subquery to do it.

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12720:

Comment: was deleted

(was: I guess the best way is to add another `Project` above `Aggregate`)

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135187#comment-15135187
 ] 

Xiao Li commented on SPARK-12720:
-

I guess the best way is to add another `Project` above `Aggregate`

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135177#comment-15135177
 ] 

Xiao Li commented on SPARK-12720:
-

{code}SELECT key, value, count(value) FROM t1 GROUP BY key, value with 
rollup{code}
is converted to 
{code}SELECT `gen_subquery_0`.`key`, `gen_subquery_0`.`value`, count( 
`gen_subquery_0`.`value`) AS `_c2` FROM (SELECT `t1`.`key`, `t1`.`value`, 
`t1`.`key` AS `key`, `t1`.`value` AS `value` FROM `default`.`t1`) AS 
gen_subquery_0 GROUP BY `gen_subquery_0`.`key`, `gen_subquery_0`.`value` 
GROUPING SETS ((), (`gen_subquery_0`.`key`), (`gen_subquery_0`.`key`, 
`gen_subquery_0`.`value`)){code}

However, it has two `key` attributes in the subquery. {{`t1`.`key` AS `key`}} 
and {{`t1`.`key`}} It looks weird, but they are generated by the rule 
`ResolveGroupingAnalytics`. I have to rewrite the rule. I did a try to rename 
the alias to `key#23`. However, this alias is used in the group by and 
aggregate functions and then introduces more issues. Will do a switch and let 
them use the original variables instead of the alias names.



> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13002) Mesos scheduler backend does not follow the property spark.dynamicAllocation.initialExecutors

2016-02-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-13002.
---
   Resolution: Fixed
 Assignee: Luc Bourlier
Fix Version/s: 2.0.0

> Mesos scheduler backend does not follow the property 
> spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-13002
> URL: https://issues.apache.org/jira/browse/SPARK-13002
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Luc Bourlier
>Assignee: Luc Bourlier
>  Labels: dynamic_allocation, mesos
> Fix For: 2.0.0
>
>
> When starting a Spark job on a Mesos cluster, all available cores are 
> reserved (up to {{spark.cores.max}}), creating one executor per Mesos node, 
> and as many executors as needed.
> This is the case even when dynamic allocation is enabled.
> When dynamic allocation is enabled, the number of executor launched at 
> startup should be limited to the value of 
> {{spark.dynamicAllocation.initialExecutors}}.
> The Mesos scheduler backend already follows the value computed by the 
> {{ExecutorAllocationManager}} for the number of executors that should be up 
> and running. Expect at startup, when it just creates all the executors it can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13214) Fix dynamic allocation docs

2016-02-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13214:
--
Assignee: Bill Chambers

> Fix dynamic allocation docs
> ---
>
> Key: SPARK-13214
> URL: https://issues.apache.org/jira/browse/SPARK-13214
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Trivial
> Fix For: 1.6.1, 2.0.0
>
>
> Update docs to reflect dynamicAllocation to be available for all cluster 
> managers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13214) Fix dynamic allocation docs

2016-02-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-13214.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.1

> Fix dynamic allocation docs
> ---
>
> Key: SPARK-13214
> URL: https://issues.apache.org/jira/browse/SPARK-13214
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Trivial
> Fix For: 1.6.1, 2.0.0
>
>
> Update docs to reflect dynamicAllocation to be available for all cluster 
> managers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12939) migrate encoder resolution to Analyzer

2016-02-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12939:
-
Assignee: Wenchen Fan

> migrate encoder resolution to Analyzer
> --
>
> Key: SPARK-12939
> URL: https://issues.apache.org/jira/browse/SPARK-12939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> in https://github.com/apache/spark/pull/10747, we decided to use expressions 
> instead of encoders in query plans, to make the dataset easier to optimize 
> and make encoder resolution less hacky.
> However, that PR didn't finish the migration completely and cause some 
> regressions which are hidden due to the lack of end-to-end test.
> We should fix it by completing the migration and add end-to-end test for it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12939) migrate encoder resolution to Analyzer

2016-02-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12939.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10852
[https://github.com/apache/spark/pull/10852]

> migrate encoder resolution to Analyzer
> --
>
> Key: SPARK-12939
> URL: https://issues.apache.org/jira/browse/SPARK-12939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>
> in https://github.com/apache/spark/pull/10747, we decided to use expressions 
> instead of encoders in query plans, to make the dataset easier to optimize 
> and make encoder resolution less hacky.
> However, that PR didn't finish the migration completely and cause some 
> regressions which are hidden due to the lack of end-to-end test.
> We should fix it by completing the migration and add end-to-end test for it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13202) Jars specified with --jars do not exist on the worker classpath.

2016-02-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135163#comment-15135163
 ] 

Sean Owen commented on SPARK-13202:
---

Actually, try spark-submit. This is a class loader visibility problem. Spark 
has to load app classes from its class loader. I remember this situation is 
more complex in the shell. You may also try the userClassPathFirst option to 
see if that matters. I think the worst case is that indeed you're doing what's 
needed. But these might shed some light or confirm

> Jars specified with --jars do not exist on the worker classpath.
> 
>
> Key: SPARK-13202
> URL: https://issues.apache.org/jira/browse/SPARK-13202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Michael Schmitz
>
> I have a Spark Scala 2.11 application.  To deploy it to the cluster, I create 
> a jar of the dependencies and a jar of the project (although this problem 
> still manifests if I create a single jar with everything).  I will focus on 
> problems specific to Spark Shell, but I'm pretty sure they also apply to 
> Spark Submit.
> I can get Spark Shell to work with my application, however I need to set 
> spark.executor.extraClassPath.  From reading the documentation 
> (http://spark.apache.org/docs/latest/configuration.html#runtime-environment) 
> it sounds like I shouldn't need to set this option ("Users typically should 
> not need to set this option.") After reading about --jars, I understand that 
> this should set the classpath for the workers to use the jars that are synced 
> to those machines.
> When I don't set spark.executor.extraClassPath, I get a kryo registrator 
> exception with the root cause being that a class is not found.
> java.io.IOException: org.apache.spark.SparkException: Failed to register 
> classes with Kryo
> java.lang.ClassNotFoundException: org.allenai.common.Enum
> If I SSH into the workers, I can see that we did create directories that 
> contain the jars specified by --jars.
> /opt/data/spark/worker/app-20160204212742-0002/0
> /opt/data/spark/worker/app-20160204212742-0002/1
> Now, if I re-run spark-shell but with `--conf 
> spark.executor.extraClassPath=/opt/data/spark/worker/app-20160204212742-0002/0/myjar.jar`,
>  my job will succeed.  In other words, if I put my jars at a location that is 
> available to all the workers and specify that as an extra executor class 
> path, the job succeeds.
> Unfortunately, this means that the jars are being copied to the workers for 
> no reason.  How can I get --jars to add the jars it copies to the workers to 
> the classpath?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13218) Executor failed after SparkContext start and start again

2016-02-05 Thread leo wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135153#comment-15135153
 ] 

leo wu commented on SPARK-13218:


One more point : 

If spark-default.conf  and spark-env.sh are configured over a remote Spark 
standalone cluster manually, and then launch iPython notebook or pyspark,  
everything works fine.

So I strongly suspect it's a problem with sparkContext/sparkconf initialization 
after stop and start again.


> Executor failed  after SparkContext start and start again 
> --
>
> Key: SPARK-13218
> URL: https://issues.apache.org/jira/browse/SPARK-13218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Run IPython/Jupyter along with Spark on ubuntu 14.04
>Reporter: leo wu
>
> In a python notebook, I am trying to stop SparkContext which is initialized 
> with local master and then start again with conf  over a remote Spark 
> standalone cluster, like :
> import sys
> from random import random
> import atexit
> import os
> import platform
> import py4j
> import pyspark
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext, HiveContext
> from pyspark.storagelevel import StorageLevel
> os.environ["SPARK_HOME"] = "/home/notebook/spark-1.6.0-bin-hadoop2.6"
> os.environ["PYSPARK_SUBMIT_ARGS"] = "--master spark://10.115.89.219:7077"
> os.environ["SPARK_LOCAL_HOSTNAME"] = "wzymaster2011"
> SparkContext.setSystemProperty("spark.master", "spark://10.115.89.219:7077")
> SparkContext.setSystemProperty("spark.cores.max", "4")
> SparkContext.setSystemProperty("spark.driver.host", "wzymaster2011")
> SparkContext.setSystemProperty("spark.driver.port", "9000")
> SparkContext.setSystemProperty("spark.blockManager.port", "9001")
> SparkContext.setSystemProperty("spark.fileserver.port", "9002") 
> conf = SparkConf().setAppName("Python-Test")
> sc = SparkContext(conf=conf)
> However, I always get error in Executor like :
> 16/02/05 14:37:32 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> from BlockManagerId(driver, localhost, 9002)
> 16/02/05 14:37:32 DEBUG TransportClientFactory: Creating new connection to 
> localhost/127.0.0.1:9002
> 16/02/05 14:37:32 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks
> java.io.IOException: Failed to connect to localhost/127.0.0.1:9002
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> I suspect that new SparkConf isn't properly passed to executor through Spark 
> Master for some reason. 
> Please advise it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13218) Executor failed after SparkContext start and start again

2016-02-05 Thread leo wu (JIRA)
leo wu created SPARK-13218:
--

 Summary: Executor failed  after SparkContext start and start again 
 Key: SPARK-13218
 URL: https://issues.apache.org/jira/browse/SPARK-13218
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
 Environment: Run IPython/Jupyter along with Spark on ubuntu 14.04
Reporter: leo wu


In a python notebook, I am trying to stop SparkContext which is initialized 
with local master and then start again with conf  over a remote Spark 
standalone cluster, like :

import sys
from random import random
import atexit
import os
import platform

import py4j

import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, HiveContext
from pyspark.storagelevel import StorageLevel


os.environ["SPARK_HOME"] = "/home/notebook/spark-1.6.0-bin-hadoop2.6"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master spark://10.115.89.219:7077"
os.environ["SPARK_LOCAL_HOSTNAME"] = "wzymaster2011"

SparkContext.setSystemProperty("spark.master", "spark://10.115.89.219:7077")
SparkContext.setSystemProperty("spark.cores.max", "4")
SparkContext.setSystemProperty("spark.driver.host", "wzymaster2011")
SparkContext.setSystemProperty("spark.driver.port", "9000")
SparkContext.setSystemProperty("spark.blockManager.port", "9001")
SparkContext.setSystemProperty("spark.fileserver.port", "9002") 

conf = SparkConf().setAppName("Python-Test")
sc = SparkContext(conf=conf)

However, I always get error in Executor like :

16/02/05 14:37:32 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
from BlockManagerId(driver, localhost, 9002)
16/02/05 14:37:32 DEBUG TransportClientFactory: Creating new connection to 
localhost/127.0.0.1:9002
16/02/05 14:37:32 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks
java.io.IOException: Failed to connect to localhost/127.0.0.1:9002
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)

I suspect that new SparkConf isn't properly passed to executor through Spark 
Master for some reason. 

Please advise it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13217) UI of visualization of stage

2016-02-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-13217.
--
Resolution: Not A Problem

It works after restart the driver.

> UI of visualization of stage 
> -
>
> Key: SPARK-13217
> URL: https://issues.apache.org/jira/browse/SPARK-13217
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>
> http://localhost:4041/stages/stage/?id=36&attempt=0&expandDagViz=true
> {code}
> HTTP ERROR 500
> Problem accessing /stages/stage/. Reason:
> Server Error
> Caused by:
> java.lang.IncompatibleClassChangeError: vtable stub
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:102)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
>   at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:79)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)
> Powered by Jetty://
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13217) UI of visualization of stage

2016-02-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13217:
--

 Summary: UI of visualization of stage 
 Key: SPARK-13217
 URL: https://issues.apache.org/jira/browse/SPARK-13217
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Davies Liu


http://localhost:4041/stages/stage/?id=36&attempt=0&expandDagViz=true

{code}
HTTP ERROR 500

Problem accessing /stages/stage/. Reason:

Server Error
Caused by:

java.lang.IncompatibleClassChangeError: vtable stub
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:102)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:79)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
Powered by Jetty://



















{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2016-02-05 Thread leo wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135036#comment-15135036
 ] 

leo wu commented on SPARK-13198:


Hi, Sean

 I am trying to do this similar in an IPython/Jupyter notebook by  stopping 
a sparkContext and then start a new one with new sparkconf over a remote Spark 
standalone cluster, instead of local master which is originally initialized,  
like :

import sys
from random import random
from operator import add
import atexit
import os
import platform

import py4j

import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, HiveContext
from pyspark.storagelevel import StorageLevel


os.environ["SPARK_HOME"] = "/home/notebook/spark-1.6.0-bin-hadoop2.6"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master spark://10.115.89.219:7077"
os.environ["SPARK_LOCAL_HOSTNAME"] = "wzymaster2011"

SparkContext.setSystemProperty("spark.master", "spark://10.115.89.219:7077")
SparkContext.setSystemProperty("spark.cores.max", "4")
SparkContext.setSystemProperty("spark.driver.host", "wzymaster2011")
SparkContext.setSystemProperty("spark.driver.port", "9000")
SparkContext.setSystemProperty("spark.blockManager.port", "9001")
SparkContext.setSystemProperty("spark.fileserver.port", "9002")

conf = SparkConf().setAppName("Leo-Python-Test")
sc = SparkContext(conf=conf)

However, I always get error on executor due to failing to load BlockManager 
info from driver but "localhost" , instead of setting in "spark.driver.host"  
like "wzymaster2011":

16/02/05 14:37:32 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
from BlockManagerId(driver, localhost, 9002)
16/02/05 14:37:32 DEBUG TransportClientFactory: Creating new connection to 
localhost/127.0.0.1:9002
16/02/05 14:37:32 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks
java.io.IOException: Failed to connect to localhost/127.0.0.1:9002
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)


So, I strongly suspect if there is a bug in SparkContext.stop() to clean up all 
data and resetting  SparkConf() doesn't work well within one app.

Please advise it. 
Millions of thanks


> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm r

[jira] [Assigned] (SPARK-13215) Remove fallback in codegen

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13215:


Assignee: Apache Spark

> Remove fallback in codegen
> --
>
> Key: SPARK-13215
> URL: https://issues.apache.org/jira/browse/SPARK-13215
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> in newMutableProjection, it will fallback to InterpretedMutableProjection if 
> failed to compile.
> Since we remove the configuration for codegen, we are heavily reply on 
> codegen (also TungstenAggregate require the generated MutableProjection to 
> update UnsafeRow), should remove the fallback, which could make user 
> confusing, see the discussion in SPARK-13116.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13215) Remove fallback in codegen

2016-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135009#comment-15135009
 ] 

Apache Spark commented on SPARK-13215:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11097

> Remove fallback in codegen
> --
>
> Key: SPARK-13215
> URL: https://issues.apache.org/jira/browse/SPARK-13215
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> in newMutableProjection, it will fallback to InterpretedMutableProjection if 
> failed to compile.
> Since we remove the configuration for codegen, we are heavily reply on 
> codegen (also TungstenAggregate require the generated MutableProjection to 
> update UnsafeRow), should remove the fallback, which could make user 
> confusing, see the discussion in SPARK-13116.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13215) Remove fallback in codegen

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13215:


Assignee: (was: Apache Spark)

> Remove fallback in codegen
> --
>
> Key: SPARK-13215
> URL: https://issues.apache.org/jira/browse/SPARK-13215
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> in newMutableProjection, it will fallback to InterpretedMutableProjection if 
> failed to compile.
> Since we remove the configuration for codegen, we are heavily reply on 
> codegen (also TungstenAggregate require the generated MutableProjection to 
> update UnsafeRow), should remove the fallback, which could make user 
> confusing, see the discussion in SPARK-13116.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13216) Spark streaming application not honoring --num-executors in restarting of an application from a checkpoint

2016-02-05 Thread Neelesh Srinivas Salian (JIRA)
Neelesh Srinivas Salian created SPARK-13216:
---

 Summary: Spark streaming application not honoring --num-executors 
in restarting of an application from a checkpoint
 Key: SPARK-13216
 URL: https://issues.apache.org/jira/browse/SPARK-13216
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit, Streaming
Affects Versions: 1.5.0
Reporter: Neelesh Srinivas Salian


Scenario to help understand:
1) The Spark streaming job with 12 executors was initiated with checkpointing 
enabled.
2) In version 1.3, the user was able to append the number of executors to 20 
using --num-executors but was unable to do so in version 1.5.

In 1.5, the spark application still runs with 13 executors (1 for driver and 12 
executors).
There is a need to start from the checkpoint itself and not restart the 
application to avoid the loss of information.

3) Checked the code in 1.3 and 1.5, which shows the command ''--num-executors" 
has been deprecated.

Any thoughts on this? Not sure if anyone hit this one specifically before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-05 Thread Asif Hussain Shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134988#comment-15134988
 ] 

Asif Hussain Shahid commented on SPARK-13116:
-

It will show up if you throw an exception in the below  mentioned place.

In SparkPlan. newMutableProjection, if the code path for some exception, takes 
to creating InterpretedMutableProjection, the issue will show up.
Just add 
throw new UnsupportedOperationException("")

after the  GenerateMutableProjection.generate, to simulate .


At this point I can only reproduce it artificially, by throwing an Exception in 
the code below in source code.
SparkPlan:
protected def newMutableProjection(
expressions: Seq[Expression],
inputSchema: Seq[Attribute],
useSubexprElimination: Boolean = false): () => MutableProjection = {
log.debug(s"Creating MutableProj: $expressions, inputSchema: $inputSchema")
try
{ GenerateMutableProjection.generate(expressions, inputSchema, 
useSubexprElimination) throw new UnsupportedOperationException("TEST") }
catch {
case e: Exception =>
if (isTesting)
{ throw e }
else
{ log.error("Failed to generate mutable projection, fallback to interpreted", 
e) () => new InterpretedMutableProjection(expressions, inputSchema) }
}
}



> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprA

[jira] [Created] (SPARK-13215) Remove fallback in codegen

2016-02-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13215:
--

 Summary: Remove fallback in codegen
 Key: SPARK-13215
 URL: https://issues.apache.org/jira/browse/SPARK-13215
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


in newMutableProjection, it will fallback to InterpretedMutableProjection if 
failed to compile.

Since we remove the configuration for codegen, we are heavily reply on codegen 
(also TungstenAggregate require the generated MutableProjection to update 
UnsafeRow), should remove the fallback, which could make user confusing, see 
the discussion in SPARK-13116.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134979#comment-15134979
 ] 

Davies Liu commented on SPARK-13116:


I tried the test you attached, it work well on master.

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-05 Thread Asif Hussain Shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134969#comment-15134969
 ] 

Asif Hussain Shahid commented on SPARK-13116:
-

Now the unary expression was our code , not of spark's. It was mistake in
our code generation snippet, which resulted in this issue. Yes, the
cleanest would be to remove the fallback.




> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-05 Thread Asif Hussain Shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134968#comment-15134968
 ] 

Asif Hussain Shahid commented on SPARK-13116:
-

No, the unary expression was our code , not of spark's. It was mistake in
our code generation snippet, which resulted in this issue. Yes, the
cleanest would be to remove the fallback.




> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134959#comment-15134959
 ] 

Davies Liu commented on SPARK-13116:


Which UnaryExpression generate incorrect code? I think we may just remove the 
fallback.

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-05 Thread Asif Hussain Shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134929#comment-15134929
 ] 

Asif Hussain Shahid commented on SPARK-13116:
-


I further debugged the issue & found the following 

The problem appears only if the SparkPlan.newMutableProjection ,  throws an 
exception in generating mutable projection through code generation. In which 
case, the code defaults to creating InterpretedMutableProjection.  This is what 
causes the problem. Looks like in our case, one of our  UnaryExpression had 
incorrect code generation which caused exception in bytes generation, resulting 
in InterpretedMutableProjection being created.
I am attaching a Test , which will fail, if  the generated mutable projection 
fails for some reason.

At this point I can only reproduce it artificially, by throwing an Exception in 
the code below in source code.

SparkPlan:
protected def newMutableProjection(
  expressions: Seq[Expression],
  inputSchema: Seq[Attribute],
  useSubexprElimination: Boolean = false): () => MutableProjection = {
log.debug(s"Creating MutableProj: $expressions, inputSchema: $inputSchema")
try {
  GenerateMutableProjection.generate(expressions, inputSchema, 
useSubexprElimination)
  throw new UnsupportedOperationException("TEST")
} catch {
  case e: Exception =>
if (isTesting) {
  throw e
} else {
  log.error("Failed to generate mutable projection, fallback to 
interpreted", e)
  () => new InterpretedMutableProjection(expressions, inputSchema)
}
}
  }

I suppose it is not a high priority bug , or something which requires fix , 
just something which I came across in our development.   
Just to note: I found this issue to be present in the latest code of spark , 
too.

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (targe

[jira] [Updated] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-05 Thread Asif Hussain Shahid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif Hussain Shahid updated SPARK-13116:

Attachment: SPARK_13116_Test.scala

A test which reproduces the issue.
Pls note that to reproduce the issue you will have to tweak the SparkPlan code 
a little as explained in the ticket.

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13166) Remove DataStreamReader/Writer

2016-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134890#comment-15134890
 ] 

Apache Spark commented on SPARK-13166:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11096

> Remove DataStreamReader/Writer
> --
>
> Key: SPARK-13166
> URL: https://issues.apache.org/jira/browse/SPARK-13166
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> They seem redundant and we can simply use DataFrameReader/Writer. 
> The usage looks like:
> {code}
> val df = sqlContext.read.stream("...")
> val handle = df.write.stream("...")
> handle.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.

2016-02-05 Thread Fede Bar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134874#comment-15134874
 ] 

Fede Bar commented on SPARK-12430:
--

Very well explained thanks.
I'll try the new build and post a feedback as soon as possible.
Have a good weekend.

> Temporary folders do not get deleted after Task completes causing problems 
> with disk space.
> ---
>
> Key: SPARK-12430
> URL: https://issues.apache.org/jira/browse/SPARK-12430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2, 1.6.0
> Environment: Ubuntu server
>Reporter: Fede Bar
>
> We are experiencing an issue with automatic /tmp folder deletion after 
> framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as 
> Spark 1.5.1) over Mesos will not delete some temporary folders causing free 
> disk space on server to exhaust. 
> Behavior of M/R job using Spark 1.4.1 over Mesos cluster:
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/*  , 
>  */tmp/spark-#/blockmgr-#*
> - When task is completed */tmp/spark-#/* gets deleted along with 
> */tmp/spark-#/blockmgr-#* sub-folder.
> Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job):
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/mesos/slaves/id** * , 
> */tmp/spark-***/ *  ,{color:red} /tmp/blockmgr-***{color}
> - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle 
> container folder {color:red} /tmp/blockmgr-***{color}
> Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several 
> GB depending on the job that ran. Over time this causes disk space to become 
> full with consequences that we all know. 
> Running a shell script would probably work but it is difficult to identify 
> folders in use by a running M/R or stale folders. I did notice similar issues 
> opened by other users marked as "resolved", but none seems to exactly match 
> the above behavior. 
> I really hope someone has insights on how to fix it.
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13210) NPE in Sort

2016-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134828#comment-15134828
 ] 

Apache Spark commented on SPARK-13210:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11095

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13214) Fix dynamic allocation docs

2016-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134777#comment-15134777
 ] 

Apache Spark commented on SPARK-13214:
--

User 'anabranch' has created a pull request for this issue:
https://github.com/apache/spark/pull/11094

> Fix dynamic allocation docs
> ---
>
> Key: SPARK-13214
> URL: https://issues.apache.org/jira/browse/SPARK-13214
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Bill Chambers
>Priority: Trivial
>
> Update docs to reflect dynamicAllocation to be available for all cluster 
> managers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13214) Fix dynamic allocation docs

2016-02-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13214:
--
Affects Version/s: 1.6.0
 Target Version/s: 1.6.1, 2.0.0
  Component/s: Documentation

> Fix dynamic allocation docs
> ---
>
> Key: SPARK-13214
> URL: https://issues.apache.org/jira/browse/SPARK-13214
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Bill Chambers
>Priority: Trivial
>
> Update docs to reflect dynamicAllocation to be available for all cluster 
> managers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13214) Fix dynamic allocation docs

2016-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134746#comment-15134746
 ] 

Apache Spark commented on SPARK-13214:
--

User 'anabranch' has created a pull request for this issue:
https://github.com/apache/spark/pull/11093

> Fix dynamic allocation docs
> ---
>
> Key: SPARK-13214
> URL: https://issues.apache.org/jira/browse/SPARK-13214
> Project: Spark
>  Issue Type: Documentation
>Reporter: Bill Chambers
>Priority: Trivial
>
> Update docs to reflect dynamicAllocation to be available for all cluster 
> managers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13214) Fix dynamic allocation docs

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13214:


Assignee: Apache Spark

> Fix dynamic allocation docs
> ---
>
> Key: SPARK-13214
> URL: https://issues.apache.org/jira/browse/SPARK-13214
> Project: Spark
>  Issue Type: Documentation
>Reporter: Bill Chambers
>Assignee: Apache Spark
>Priority: Trivial
>
> Update docs to reflect dynamicAllocation to be available for all cluster 
> managers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13214) Fix dynamic allocation docs

2016-02-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13214:


Assignee: (was: Apache Spark)

> Fix dynamic allocation docs
> ---
>
> Key: SPARK-13214
> URL: https://issues.apache.org/jira/browse/SPARK-13214
> Project: Spark
>  Issue Type: Documentation
>Reporter: Bill Chambers
>Priority: Trivial
>
> Update docs to reflect dynamicAllocation to be available for all cluster 
> managers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13214) Update docs to reflect dynamicAllocation to be true

2016-02-05 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-13214:
-

 Summary: Update docs to reflect dynamicAllocation to be true
 Key: SPARK-13214
 URL: https://issues.apache.org/jira/browse/SPARK-13214
 Project: Spark
  Issue Type: Documentation
Reporter: Bill Chambers
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13214) Fix dynamic allocation docs

2016-02-05 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-13214:
--
Description: Update docs to reflect dynamicAllocation to be available for 
all cluster managers
Summary: Fix dynamic allocation docs  (was: Update docs to reflect 
dynamicAllocation to be true)

> Fix dynamic allocation docs
> ---
>
> Key: SPARK-13214
> URL: https://issues.apache.org/jira/browse/SPARK-13214
> Project: Spark
>  Issue Type: Documentation
>Reporter: Bill Chambers
>Priority: Trivial
>
> Update docs to reflect dynamicAllocation to be available for all cluster 
> managers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13213) BroadcastNestedLoopJoin is very slow

2016-02-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13213:
--

 Summary: BroadcastNestedLoopJoin is very slow
 Key: SPARK-13213
 URL: https://issues.apache.org/jira/browse/SPARK-13213
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


Since we have improve the performance of CartisianProduct, which should be 
faster and robuster than BroacastNestedLoopJoin, we should do CartisianProduct 
instead of BroacastNestedLoopJoin, especially  when the broadcasted table is 
not that small.

Today, we hit a query that take very long time but still not finished, once 
decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it just 
finished in seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13180) Protect against SessionState being null when accessing HiveClientImpl#conf

2016-02-05 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-13180:
---
Attachment: spark-13180-util.patch

Patch with HiveConfUtil.scala

For record.

> Protect against SessionState being null when accessing HiveClientImpl#conf
> --
>
> Key: SPARK-13180
> URL: https://issues.apache.org/jira/browse/SPARK-13180
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-13180-util.patch
>
>
> See this thread http://search-hadoop.com/m/q3RTtFoTDi2HVCrM1
> {code}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
> at 
> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
> at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
> at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
> at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
> at org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
> at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at 
> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
> at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13189) Cleanup build references to Scala 2.10

2016-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134632#comment-15134632
 ] 

Apache Spark commented on SPARK-13189:
--

User 'lresende' has created a pull request for this issue:
https://github.com/apache/spark/pull/11092

> Cleanup build references to Scala 2.10
> --
>
> Key: SPARK-13189
> URL: https://issues.apache.org/jira/browse/SPARK-13189
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>
> There are still few places referencing scala 2.10/2.10.5 while it should be 
> 2.11/2.11.7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-02-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134414#comment-15134414
 ] 

Steve Loughran commented on SPARK-12807:


mixing dependency versions in a single project is dangerous. And it still runs 
the risk of being brittle against whatever version of jackson Hadoop namenodes 
run with. While I'm not a fan of shading, until we get better classpath/process 
isolation in nodemanagers, shading here avoids problems and will ensure that 
ASF spark releases work with ASF Hadoop releases.

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-02-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134412#comment-15134412
 ] 

Steve Loughran commented on SPARK-12807:


It's good to hear that things work with older versions, but they do need to be 
compiled in sync. The risk with downgrading jackson versions  is that someone 
who has upgraded will find their code won't link any more. This is the same 
dilemma that HADOOP-10104 created: revert or tell others "sorry, time to 
upgrade". We went with the latter, but have added jackson to the list of 
dependencies whose upgrades are traumatic: Guava, protobuf (which will never be 
upgraded on the 2.x line)

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13202) Jars specified with --jars do not exist on the worker classpath.

2016-02-05 Thread Michael Schmitz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134332#comment-15134332
 ] 

Michael Schmitz edited comment on SPARK-13202 at 2/5/16 3:40 PM:
-

[~srowen] in that case the documentation is confusing as it claims 
spark.executor.extraClassPath is deprecated (specifically "This exists 
primarily for backwards-compatibility with older versions of Spark. Users 
typically should not need to set this option.").  It also means that the 
`--jars` options isn't as useful as it could be.  It copies the jars to the 
worker nodes, but I cannot use those jars for the Kryo serialization because I 
don't know where they will end up.


was (Author: schmmd):
[~srowen] in that case the documentation is confusing as it claims 
spark.executor.extraClassPath is deprecated.  It also means that the `--jars` 
options isn't as useful as it could be.  It copies the jars to the worker 
nodes, but I cannot use those jars for the Kryo serialization because I don't 
know where they will end up.

> Jars specified with --jars do not exist on the worker classpath.
> 
>
> Key: SPARK-13202
> URL: https://issues.apache.org/jira/browse/SPARK-13202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Michael Schmitz
>
> I have a Spark Scala 2.11 application.  To deploy it to the cluster, I create 
> a jar of the dependencies and a jar of the project (although this problem 
> still manifests if I create a single jar with everything).  I will focus on 
> problems specific to Spark Shell, but I'm pretty sure they also apply to 
> Spark Submit.
> I can get Spark Shell to work with my application, however I need to set 
> spark.executor.extraClassPath.  From reading the documentation 
> (http://spark.apache.org/docs/latest/configuration.html#runtime-environment) 
> it sounds like I shouldn't need to set this option ("Users typically should 
> not need to set this option.") After reading about --jars, I understand that 
> this should set the classpath for the workers to use the jars that are synced 
> to those machines.
> When I don't set spark.executor.extraClassPath, I get a kryo registrator 
> exception with the root cause being that a class is not found.
> java.io.IOException: org.apache.spark.SparkException: Failed to register 
> classes with Kryo
> java.lang.ClassNotFoundException: org.allenai.common.Enum
> If I SSH into the workers, I can see that we did create directories that 
> contain the jars specified by --jars.
> /opt/data/spark/worker/app-20160204212742-0002/0
> /opt/data/spark/worker/app-20160204212742-0002/1
> Now, if I re-run spark-shell but with `--conf 
> spark.executor.extraClassPath=/opt/data/spark/worker/app-20160204212742-0002/0/myjar.jar`,
>  my job will succeed.  In other words, if I put my jars at a location that is 
> available to all the workers and specify that as an extra executor class 
> path, the job succeeds.
> Unfortunately, this means that the jars are being copied to the workers for 
> no reason.  How can I get --jars to add the jars it copies to the workers to 
> the classpath?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13202) Jars specified with --jars do not exist on the worker classpath.

2016-02-05 Thread Michael Schmitz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134332#comment-15134332
 ] 

Michael Schmitz commented on SPARK-13202:
-

[~srowen] in that case the documentation is confusing as it claims 
spark.executor.extraClassPath is deprecated.  It also means that the `--jars` 
options isn't as useful as it could be.  It copies the jars to the worker 
nodes, but I cannot use those jars for the Kryo serialization because I don't 
know where they will end up.

> Jars specified with --jars do not exist on the worker classpath.
> 
>
> Key: SPARK-13202
> URL: https://issues.apache.org/jira/browse/SPARK-13202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Michael Schmitz
>
> I have a Spark Scala 2.11 application.  To deploy it to the cluster, I create 
> a jar of the dependencies and a jar of the project (although this problem 
> still manifests if I create a single jar with everything).  I will focus on 
> problems specific to Spark Shell, but I'm pretty sure they also apply to 
> Spark Submit.
> I can get Spark Shell to work with my application, however I need to set 
> spark.executor.extraClassPath.  From reading the documentation 
> (http://spark.apache.org/docs/latest/configuration.html#runtime-environment) 
> it sounds like I shouldn't need to set this option ("Users typically should 
> not need to set this option.") After reading about --jars, I understand that 
> this should set the classpath for the workers to use the jars that are synced 
> to those machines.
> When I don't set spark.executor.extraClassPath, I get a kryo registrator 
> exception with the root cause being that a class is not found.
> java.io.IOException: org.apache.spark.SparkException: Failed to register 
> classes with Kryo
> java.lang.ClassNotFoundException: org.allenai.common.Enum
> If I SSH into the workers, I can see that we did create directories that 
> contain the jars specified by --jars.
> /opt/data/spark/worker/app-20160204212742-0002/0
> /opt/data/spark/worker/app-20160204212742-0002/1
> Now, if I re-run spark-shell but with `--conf 
> spark.executor.extraClassPath=/opt/data/spark/worker/app-20160204212742-0002/0/myjar.jar`,
>  my job will succeed.  In other words, if I put my jars at a location that is 
> available to all the workers and specify that as an extra executor class 
> path, the job succeeds.
> Unfortunately, this means that the jars are being copied to the workers for 
> no reason.  How can I get --jars to add the jars it copies to the workers to 
> the classpath?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11725) Let UDF to handle null value

2016-02-05 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134322#comment-15134322
 ] 

Cristian Opris commented on SPARK-11725:


I would argue this is a rather inappropriate solution. Scala does not normally 
distinguish between primitive and boxed types, with Long for example 
representing both. 

So having Long args in a function and then testing for null is a valid thing to 
do in Scala.

This is made worse by the fact that the behaviour is actually not documented 
anywhere, so results in very strange and unexpected behaviour.

By the principle of least surprise, I would suggest either (or both) of:

- Throwing an error when a null value cannot be passed to a UDF that has been 
compiled to only accept nulls.
- Using Option[T] as a UDF arg to signal that the function accepts nulls.






> Let UDF to handle null value
> 
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jeff Zhang
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
>  Output ///
> ++---++
> | age|   name|age2|
> ++---++
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> ++---++
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12969) Exception while casting a spark supported date formatted "string" to "date" data type.

2016-02-05 Thread Jais Sebastian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jais Sebastian updated SPARK-12969:
---
Description: 
Getting exception while  converting a string column( column is having spark 
supported date format -MM-dd ) to date data type. Below is the code snippet 

List jsonData = Arrays.asList( 
"{\"d\":\"2015-02-01\",\"n\":1}");
JavaRDD dataRDD = this.getSparkContext().parallelize(jsonData);
DataFrame data = this.getSqlContext().read().json(dataRDD);
DataFrame newData = data.select(data.col("d").cast("date"));
newData.show();


Above code will give the error

failed to compile: org.codehaus.commons.compiler.CompileException: File 
generated.java, Line 95, Column 28: Expression "scala.Option < Long > 
longOpt16" is not an lvalue

This happens only if we execute the program in client mode , it works if we 
execute through spark submit. Here is the sample project : 
https://github.com/uhonnavarkar/spark_test

  was:
Getting exception while  converting a string column( column is having spark 
supported date format -MM-dd ) to date data type. Below is the code snippet 

List jsonData = Arrays.asList( 
"{\"d\":\"2015-02-01\",\"n\":1}");
JavaRDD dataRDD = this.getSparkContext().parallelize(jsonData);
DataFrame data = this.getSqlContext().read().json(dataRDD);
DataFrame newData = data.select(data.col("d").cast("date"));
newData.show();


Above code will give the error

failed to compile: org.codehaus.commons.compiler.CompileException: File 
generated.java, Line 95, Column 28: Expression "scala.Option < Long > 
longOpt16" is not an lvalue



> Exception while  casting a spark supported date formatted "string" to "date" 
> data type.
> ---
>
> Key: SPARK-12969
> URL: https://issues.apache.org/jira/browse/SPARK-12969
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Spark Java 
>Reporter: Jais Sebastian
>
> Getting exception while  converting a string column( column is having spark 
> supported date format -MM-dd ) to date data type. Below is the code 
> snippet 
> List jsonData = Arrays.asList( 
> "{\"d\":\"2015-02-01\",\"n\":1}");
> JavaRDD dataRDD = 
> this.getSparkContext().parallelize(jsonData);
> DataFrame data = this.getSqlContext().read().json(dataRDD);
> DataFrame newData = data.select(data.col("d").cast("date"));
> newData.show();
> Above code will give the error
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> generated.java, Line 95, Column 28: Expression "scala.Option < Long > 
> longOpt16" is not an lvalue
> This happens only if we execute the program in client mode , it works if we 
> execute through spark submit. Here is the sample project : 
> https://github.com/uhonnavarkar/spark_test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12969) Exception while casting a spark supported date formatted "string" to "date" data type.

2016-02-05 Thread Jais Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134287#comment-15134287
 ] 

Jais Sebastian commented on SPARK-12969:


Hi ,

Looks like this issue happens only if the code is deployed as client mode.This 
works if we execute the same using Spark submit ! 

Here is the sample project for your reference 
https://github.com/uhonnavarkar/spark_test

Regards,
Jais

> Exception while  casting a spark supported date formatted "string" to "date" 
> data type.
> ---
>
> Key: SPARK-12969
> URL: https://issues.apache.org/jira/browse/SPARK-12969
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Spark Java 
>Reporter: Jais Sebastian
>
> Getting exception while  converting a string column( column is having spark 
> supported date format -MM-dd ) to date data type. Below is the code 
> snippet 
> List jsonData = Arrays.asList( 
> "{\"d\":\"2015-02-01\",\"n\":1}");
> JavaRDD dataRDD = 
> this.getSparkContext().parallelize(jsonData);
> DataFrame data = this.getSqlContext().read().json(dataRDD);
> DataFrame newData = data.select(data.col("d").cast("date"));
> newData.show();
> Above code will give the error
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> generated.java, Line 95, Column 28: Expression "scala.Option < Long > 
> longOpt16" is not an lvalue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13212) Provide a way to unregister data sources from a SQLContext

2016-02-05 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-13212:
--

 Summary: Provide a way to unregister data sources from a SQLContext
 Key: SPARK-13212
 URL: https://issues.apache.org/jira/browse/SPARK-13212
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Daniel Darabos


We allow our users to run SQL queries on their data via a web interface. We 
create an isolated SQLContext with {{sqlContext.newSession()}}, create their 
DataFrames in this context, register them with {{registerTempTable}}, then 
execute the query with {{isolatedContext.sql(query)}}.

The issue is that they have the full power of Spark SQL at their disposal. They 
can run {{SELECT * FROM csv.`/etc/passwd`}}. This specific syntax can be 
disabled by setting {{spark.sql.runSQLOnFiles}} (a private, undocumented 
configuration) to {{false}}. But creating a temporary table 
(http://spark.apache.org/docs/latest/sql-programming-guide.html#loading-data-programmatically)
 would still work, if we had a HiveContext.

As long as all DataSources on the classpath are readily available, I don't 
think we can be reassured about the security implications. So I think a nice 
solution would be to make the list of available DataSources a property of the 
SQLContext. Then for the isolated SQLContext we could simply remove all 
DataSources. This would allow more fine-grained use cases too.

What do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10548) Concurrent execution in SQL does not work

2016-02-05 Thread Akshay Harale (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134146#comment-15134146
 ] 

Akshay Harale commented on SPARK-10548:
---

We are still facing this issue while repeatedly querying cassandra database 
using spark-cassandra-connector.

Spark version 1.5.1
spark-cassandra-connector 1.5.0-M3

> Concurrent execution in SQL does not work
> -
>
> Key: SPARK-10548
> URL: https://issues.apache.org/jira/browse/SPARK-10548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> From the mailing list:
> {code}
> future { df1.count() } 
> future { df2.count() } 
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> === edit ===
> Simple reproduction:
> {code}
> (1 to 100).par.foreach { _ =>
>   sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12739) Details of batch in Streaming tab uses two Duration columns

2016-02-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12739:
--
Assignee: Mario Briggs

> Details of batch in Streaming tab uses two Duration columns
> ---
>
> Key: SPARK-12739
> URL: https://issues.apache.org/jira/browse/SPARK-12739
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Assignee: Mario Briggs
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
> Attachments: SPARK-12739.png
>
>
> "Details of batch" screen in Streaming tab in web UI uses two Duration 
> columns. I think one should be "Processing Time" while the other "Job 
> Duration".
> See the attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2016-02-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134028#comment-15134028
 ] 

Sean Owen commented on SPARK-13198:
---

I don't see evidence of a problem here yet. Stuff stays on heap until it is 
GCed. Are you sure you triggered one like with a profiler and then measured?

You also generally would never stop and start a context in an app

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13199) Upgrade apache httpclient version to the latest 4.5 for security

2016-02-05 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134022#comment-15134022
 ] 

Ted Yu commented on SPARK-13199:


This task should be done when there is a Hadoop release that upgrades 
httpclient to 4.5+

I logged this as Bug since it deals with CVEs. 
See HADOOP-12767. 

> Upgrade apache httpclient version to the latest 4.5 for security
> 
>
> Key: SPARK-13199
> URL: https://issues.apache.org/jira/browse/SPARK-13199
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Ted Yu
>Priority: Minor
>
> Various SSL security fixes are needed. 
> See:  CVE-2012-6153, CVE-2011-4461, CVE-2014-3577, CVE-2015-5262.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!

2016-02-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134010#comment-15134010
 ] 

Sean Owen commented on SPARK-13176:
---

Can we possibly not use this API at all? Sorry have not investigated whether 
that's realistic. I don't have other bright ideas. 

> Ignore deprecation warning for ProcessBuilder lines_!
> -
>
> Key: SPARK-13176
> URL: https://issues.apache.org/jira/browse/SPARK-13176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> The replacement,  stream_! & lineStream_! is not present in 2.10 API.
> Note @SupressWarnings for deprecation doesn't appear to work 
> https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings 
> might involve wrapping or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13204) Replace use of mutable.SynchronizedMap with ConcurrentHashMap

2016-02-05 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134006#comment-15134006
 ] 

Ted Yu commented on SPARK-13204:


Just noticed that parent JIRA was marked trivial. 

Should have done the search first. 

> Replace use of mutable.SynchronizedMap with ConcurrentHashMap
> -
>
> Key: SPARK-13204
> URL: https://issues.apache.org/jira/browse/SPARK-13204
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Trivial
>
> From 
> http://www.scala-lang.org/api/2.11.1/index.html#scala.collection.mutable.SynchronizedMap
>  :
> Synchronization via traits is deprecated as it is inherently unreliable. 
> Consider java.util.concurrent.ConcurrentHashMap as an alternative.
> This issue is to replace the use of mutable.SynchronizedMap and add 
> scalastyle rule banning mutable.SynchronizedMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-05 Thread sachin aggarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133963#comment-15133963
 ] 

sachin aggarwal commented on SPARK-13172:
-

as scala also recommends the same 
http://www.scala-lang.org/api/2.11.1/index.html#scala.runtime.RichException

it should be change , I will go ahead and make the change

> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13188) There is OMM when a receiver stores block

2016-02-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13188.
---
  Resolution: Not A Problem
   Fix Version/s: (was: 2.0.0)
Target Version/s:   (was: 2.0.0)

Based on this alone - you merely didn't have enough memory or it was too tight. 
Not a bug.

Please again dont set target or fix version

> There is OMM when a receiver stores block
> -
>
> Key: SPARK-13188
> URL: https://issues.apache.org/jira/browse/SPARK-13188
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
>
> 2016-02-04 15:13:27,349 | ERROR | [Thread-35] | Uncaught exception in thread 
> Thread[Thread-35,5,main] 
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:173)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:156)
> at 
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127)
> at 
> org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:124)
> at 
> org.apache.spark.streaming.kafka.ReliableKafkaReceiver.org$apache$spark$streaming$kafka$ReliableKafkaReceiver$$storeBlockAndCommitOffset(ReliableKafkaReceiver.scala:215)
> at 
> org.apache.spark.streaming.kafka.ReliableKafkaReceiver$GeneratedBlockHandler.onPushBlock(ReliableKafkaReceiver.scala:294)
> at 
> org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:294)
> at 
> org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:266)
> at 
> org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13039) Spark Streaming with Mesos shutdown without any reason on logs

2016-02-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-13039:
---

> Spark Streaming with Mesos shutdown without any reason on logs
> --
>
> Key: SPARK-13039
> URL: https://issues.apache.org/jira/browse/SPARK-13039
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Luis Alves
>Priority: Minor
>
> I've a Spark Application running with Mesos that is being killed (this 
> happens every 2 days). When I see the logs, this is what I have in the spark 
> driver:
> {quote}
> 16/01/27 05:24:24 INFO JobScheduler: Starting job streaming job 1453872264000 
> ms.0 from job set of time 1453872264000 ms
> 16/01/27 05:24:24 INFO JobScheduler: Added jobs for time 1453872264000 ms
> 16/01/27 05:24:24 INFO SparkContext: Starting job: foreachRDD at 
> StreamingApplication.scala:59
> 16/01/27 05:24:24 INFO DAGScheduler: Got job 40085 (foreachRDD at 
> StreamingApplication.scala:59) with 1 output partitions
> 16/01/27 05:24:24 INFO DAGScheduler: Final stage: ResultStage 
> 40085(foreachRDD at StreamingApplication.scala:59)
> 16/01/27 05:24:24 INFO DAGScheduler: Parents of final stage: List()
> 16/01/27 05:24:24 INFO DAGScheduler: Missing parents: List()
> 16/01/27 05:24:24 INFO DAGScheduler: Submitting ResultStage 40085 
> (MapPartitionsRDD[80171] at map at StreamingApplication.scala:59), which has 
> no missing parents
> 16/01/27 05:24:24 INFO MemoryStore: ensureFreeSpace(4720) called with 
> curMem=147187, maxMem=560497950
> 16/01/27 05:24:24 INFO MemoryStore: Block broadcast_40085 stored as values in 
> memory (estimated size 4.6 KB, free 534.4 MB)
> Killed
> {quote}
> And this is what I see in the spark slaves:
> {quote}
> 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80167
> 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80166
> 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80166
> I0127 05:24:24.070618 11142 exec.cpp:381] Executor asked to shutdown
> 16/01/27 05:24:24 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
> SIGTERM
> 16/01/27 05:24:24 ERROR CoarseGrainedExecutorBackend: Driver 
> 10.241.10.13:51810 disassociated! Shutting down.
> 16/01/27 05:24:24 INFO DiskBlockManager: Shutdown hook called
> 16/01/27 05:24:24 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkDriver@10.241.10.13:51810] has failed, address is now 
> gated for [5000] ms. Reason: [Disassociated]
> 16/01/27 05:24:24 INFO ShutdownHookManager: Shutdown hook called
> 16/01/27 05:24:24 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-f80464b5-1de2-461e-b78b-8ddbd077682a
> {quote}
> As you can see, this doesn't give any information about the reason why the 
> driver was killed.
> The mesos version I'm using is 0.25.0.
> How can I get more information about why it is being killed?
> Curious fact: I also have a Spark Jobserver clustering running and without 
> any problem (same versions). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13039) Spark Streaming with Mesos shutdown without any reason on logs

2016-02-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13039.
---
Resolution: Not A Problem

> Spark Streaming with Mesos shutdown without any reason on logs
> --
>
> Key: SPARK-13039
> URL: https://issues.apache.org/jira/browse/SPARK-13039
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Luis Alves
>Priority: Minor
>
> I've a Spark Application running with Mesos that is being killed (this 
> happens every 2 days). When I see the logs, this is what I have in the spark 
> driver:
> {quote}
> 16/01/27 05:24:24 INFO JobScheduler: Starting job streaming job 1453872264000 
> ms.0 from job set of time 1453872264000 ms
> 16/01/27 05:24:24 INFO JobScheduler: Added jobs for time 1453872264000 ms
> 16/01/27 05:24:24 INFO SparkContext: Starting job: foreachRDD at 
> StreamingApplication.scala:59
> 16/01/27 05:24:24 INFO DAGScheduler: Got job 40085 (foreachRDD at 
> StreamingApplication.scala:59) with 1 output partitions
> 16/01/27 05:24:24 INFO DAGScheduler: Final stage: ResultStage 
> 40085(foreachRDD at StreamingApplication.scala:59)
> 16/01/27 05:24:24 INFO DAGScheduler: Parents of final stage: List()
> 16/01/27 05:24:24 INFO DAGScheduler: Missing parents: List()
> 16/01/27 05:24:24 INFO DAGScheduler: Submitting ResultStage 40085 
> (MapPartitionsRDD[80171] at map at StreamingApplication.scala:59), which has 
> no missing parents
> 16/01/27 05:24:24 INFO MemoryStore: ensureFreeSpace(4720) called with 
> curMem=147187, maxMem=560497950
> 16/01/27 05:24:24 INFO MemoryStore: Block broadcast_40085 stored as values in 
> memory (estimated size 4.6 KB, free 534.4 MB)
> Killed
> {quote}
> And this is what I see in the spark slaves:
> {quote}
> 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80167
> 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80166
> 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80166
> I0127 05:24:24.070618 11142 exec.cpp:381] Executor asked to shutdown
> 16/01/27 05:24:24 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
> SIGTERM
> 16/01/27 05:24:24 ERROR CoarseGrainedExecutorBackend: Driver 
> 10.241.10.13:51810 disassociated! Shutting down.
> 16/01/27 05:24:24 INFO DiskBlockManager: Shutdown hook called
> 16/01/27 05:24:24 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkDriver@10.241.10.13:51810] has failed, address is now 
> gated for [5000] ms. Reason: [Disassociated]
> 16/01/27 05:24:24 INFO ShutdownHookManager: Shutdown hook called
> 16/01/27 05:24:24 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-f80464b5-1de2-461e-b78b-8ddbd077682a
> {quote}
> As you can see, this doesn't give any information about the reason why the 
> driver was killed.
> The mesos version I'm using is 0.25.0.
> How can I get more information about why it is being killed?
> Curious fact: I also have a Spark Jobserver clustering running and without 
> any problem (same versions). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13202) Jars specified with --jars do not exist on the worker classpath.

2016-02-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13202.
---
Resolution: Not A Problem

You have a case where you need your classes available at the same level as 
Spark and Kryo classes. That's not a bug. You seem to have done it correctly

> Jars specified with --jars do not exist on the worker classpath.
> 
>
> Key: SPARK-13202
> URL: https://issues.apache.org/jira/browse/SPARK-13202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Michael Schmitz
>
> I have a Spark Scala 2.11 application.  To deploy it to the cluster, I create 
> a jar of the dependencies and a jar of the project (although this problem 
> still manifests if I create a single jar with everything).  I will focus on 
> problems specific to Spark Shell, but I'm pretty sure they also apply to 
> Spark Submit.
> I can get Spark Shell to work with my application, however I need to set 
> spark.executor.extraClassPath.  From reading the documentation 
> (http://spark.apache.org/docs/latest/configuration.html#runtime-environment) 
> it sounds like I shouldn't need to set this option ("Users typically should 
> not need to set this option.") After reading about --jars, I understand that 
> this should set the classpath for the workers to use the jars that are synced 
> to those machines.
> When I don't set spark.executor.extraClassPath, I get a kryo registrator 
> exception with the root cause being that a class is not found.
> java.io.IOException: org.apache.spark.SparkException: Failed to register 
> classes with Kryo
> java.lang.ClassNotFoundException: org.allenai.common.Enum
> If I SSH into the workers, I can see that we did create directories that 
> contain the jars specified by --jars.
> /opt/data/spark/worker/app-20160204212742-0002/0
> /opt/data/spark/worker/app-20160204212742-0002/1
> Now, if I re-run spark-shell but with `--conf 
> spark.executor.extraClassPath=/opt/data/spark/worker/app-20160204212742-0002/0/myjar.jar`,
>  my job will succeed.  In other words, if I put my jars at a location that is 
> available to all the workers and specify that as an extra executor class 
> path, the job succeeds.
> Unfortunately, this means that the jars are being copied to the workers for 
> no reason.  How can I get --jars to add the jars it copies to the workers to 
> the classpath?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.

2016-02-05 Thread sachin aggarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133954#comment-15133954
 ] 

sachin aggarwal edited comment on SPARK-13177 at 2/5/16 10:23 AM:
--

Hi [~holdenk]

I will like to work on this, 

thanks
Sachin


was (Author: sachin aggarwal):
Hi [~AlHolden],

I will like to work on this, 

thanks
Sachin

> Update ActorWordCount example to not directly use low level linked list as it 
> is deprecated.
> 
>
> Key: SPARK-13177
> URL: https://issues.apache.org/jira/browse/SPARK-13177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: holdenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.

2016-02-05 Thread sachin aggarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133954#comment-15133954
 ] 

sachin aggarwal commented on SPARK-13177:
-

Hi [~AlHolden],

I will like to work on this, 

thanks
Sachin

> Update ActorWordCount example to not directly use low level linked list as it 
> is deprecated.
> 
>
> Key: SPARK-13177
> URL: https://issues.apache.org/jira/browse/SPARK-13177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: holdenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13192) Some memory leak problem?

2016-02-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13192.
---
Resolution: Invalid

Questions go to user@

> Some memory leak problem?
> -
>
> Key: SPARK-13192
> URL: https://issues.apache.org/jira/browse/SPARK-13192
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: uncleGen
>
> In my Spark Streaming job, the executor container was killed by Yarn. The 
> nodemanager log showed that container memory was constantly increaming, and 
> exceeded the container memory limit. Specifically, only the container which 
> contains receiver occurred this problem. Is there any memory leak issue here? 
> Or is there some known memory leak issue? 
> My code snippet:
> {code}
> dStream.foreachRDD(new Function, Void>() {
>   @Override
>   public Void call(JavaRDD rdd) throws Exception {
>   T card = rdd.first();
>   String time = 
> DateUtil.formatToDay(DateUtil.strLong(card.getTime()));
>   System.out.println("time:"+time+", count"+rdd.count)
>   return null;
>   }
> });
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13191) Update LICENSE with Scala 2.11 dependencies

2016-02-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133952#comment-15133952
 ] 

Sean Owen commented on SPARK-13191:
---

It is really not necessary to make so many PRs and JIRAs! These are tiny and 
logically related changes. It makes more noise than it is worth. 

> Update LICENSE with Scala 2.11 dependencies
> ---
>
> Key: SPARK-13191
> URL: https://issues.apache.org/jira/browse/SPARK-13191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13199) Upgrade apache httpclient version to the latest 4.5 for security

2016-02-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13199:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

This also was not filled.out correctly.

Ted as you probably know it is not necessarily that simple. You need to 
investigate the impact of making all transitive dependencies use it. And what 
happens when an earlier version is provided by Hadoop. Have you done that?

> Upgrade apache httpclient version to the latest 4.5 for security
> 
>
> Key: SPARK-13199
> URL: https://issues.apache.org/jira/browse/SPARK-13199
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Ted Yu
>Priority: Minor
>
> Various SSL security fixes are needed. 
> See:  CVE-2012-6153, CVE-2011-4461, CVE-2014-3577, CVE-2015-5262.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-02-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133946#comment-15133946
 ] 

Sean Owen commented on SPARK-12807:
---

Why vary this to use an earlier version - seems like all downside?

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13204) Replace use of mutable.SynchronizedMap with ConcurrentHashMap

2016-02-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13204.
---
Resolution: Duplicate

> Replace use of mutable.SynchronizedMap with ConcurrentHashMap
> -
>
> Key: SPARK-13204
> URL: https://issues.apache.org/jira/browse/SPARK-13204
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Trivial
>
> From 
> http://www.scala-lang.org/api/2.11.1/index.html#scala.collection.mutable.SynchronizedMap
>  :
> Synchronization via traits is deprecated as it is inherently unreliable. 
> Consider java.util.concurrent.ConcurrentHashMap as an alternative.
> This issue is to replace the use of mutable.SynchronizedMap and add 
> scalastyle rule banning mutable.SynchronizedMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13204) Replace use of mutable.SynchronizedMap with ConcurrentHashMap

2016-02-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13204:
--
   Priority: Trivial  (was: Major)
Component/s: Spark Core
 Issue Type: Improvement  (was: Bug)

> Replace use of mutable.SynchronizedMap with ConcurrentHashMap
> -
>
> Key: SPARK-13204
> URL: https://issues.apache.org/jira/browse/SPARK-13204
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Trivial
>
> From 
> http://www.scala-lang.org/api/2.11.1/index.html#scala.collection.mutable.SynchronizedMap
>  :
> Synchronization via traits is deprecated as it is inherently unreliable. 
> Consider java.util.concurrent.ConcurrentHashMap as an alternative.
> This issue is to replace the use of mutable.SynchronizedMap and add 
> scalastyle rule banning mutable.SynchronizedMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13204) Replace use of mutable.SynchronizedMap with ConcurrentHashMap

2016-02-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133940#comment-15133940
 ] 

Sean Owen commented on SPARK-13204:
---

I think you have been around long enough to know that you need to set JIRA 
fields correctly if you're going to open one. This can't be a 'major bug' and 
has no component. We also just fixed this right? It's a duplicate

> Replace use of mutable.SynchronizedMap with ConcurrentHashMap
> -
>
> Key: SPARK-13204
> URL: https://issues.apache.org/jira/browse/SPARK-13204
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>
> From 
> http://www.scala-lang.org/api/2.11.1/index.html#scala.collection.mutable.SynchronizedMap
>  :
> Synchronization via traits is deprecated as it is inherently unreliable. 
> Consider java.util.concurrent.ConcurrentHashMap as an alternative.
> This issue is to replace the use of mutable.SynchronizedMap and add 
> scalastyle rule banning mutable.SynchronizedMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >