[jira] [Commented] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135630#comment-15135630 ] Jason C Lee commented on SPARK-12519: - The memory leak in question comes from a TungstenAggregationIterator's hashmap that is usually freed at the end of iterating through the list. However, show() by default takes the first 20 items and does not usually get to the end of the list. Therefore, the hashmap does not get freed. Any idea how to fix it is welcome. > "Managed memory leak detected" when using distinct on PySpark DataFrame > --- > > Key: SPARK-12519 > URL: https://issues.apache.org/jira/browse/SPARK-12519 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: OS X 10.9.5, Java 1.8.0_66 >Reporter: Paul Shearer > > After running the distinct() method to transform a DataFrame, subsequent > actions like count() and show() may report a managed memory leak. Here is a > minimal example that reproduces the bug on my machine: > h1. Script > {noformat} > logger = sc._jvm.org.apache.log4j > logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) > logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) > import string > import random > def id_generator(size=6, chars=string.ascii_uppercase + string.digits): > return ''.join(random.choice(chars) for _ in range(size)) > nrow = 8 > ncol = 20 > ndrow = 4 # number distinct rows > tmp = [id_generator() for i in xrange(ndrow*ncol)] > tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in > xrange(nrow)] > dat = sc.parallelize(tmp,1000).toDF() > dat = dat.distinct() # if this line is commented out, no memory leak will be > reported > # dat = dat.rdd.distinct().toDF() # if this line is used instead of the > above, no leak > ct = dat.count() > print ct > # memory leak warning prints at this point in the code > dat.show() > {noformat} > h1. Output > When this script is run in PySpark (with IPython kernel), I get this error: > {noformat} > $ pyspark --executor-memory 12G --driver-memory 12G > Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) > Type "copyright", "credits" or "license" for more information. > IPython 4.0.0 -- An enhanced Interactive Python. > ? -> Introduction and overview of IPython's features. > %quickref -> Quick reference. > help -> Python's own help system. > object? -> Details about 'object', use 'object??' for extra details. > <<<... usual loading info...>>> > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) > SparkContext available as sc, SQLContext available as sqlContext. > In [1]: execfile('bugtest.py') > 4 > 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = > 16777216 bytes, TID = 2202 > +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ > |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| > _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| > +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ > |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| > |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| > |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| > |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| > |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| > |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| > |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| > |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| > |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| > |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B
[jira] [Updated] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mao, Wei updated SPARK-13222: - Description: The CheckpointInterval of DStream could be customized to multiple of BatchInteveal. So when user call shutdown it could be in the middle of two RDD checkpoint. In case the input source is not repeatable and user don't want to enable WAL because of its extra cost, the application could be unrecoverable. (was: In case the input source is not repeatable, and user don't want to enable WAL because of the overhead costing, we need to make sure the latest status of stateful RDD is checkpointed.) > checkpoint latest stateful RDD on graceful shutdown > --- > > Key: SPARK-13222 > URL: https://issues.apache.org/jira/browse/SPARK-13222 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Mao, Wei > > The CheckpointInterval of DStream could be customized to multiple of > BatchInteveal. So when user call shutdown it could be in the middle of two > RDD checkpoint. In case the input source is not repeatable and user don't > want to enable WAL because of its extra cost, the application could be > unrecoverable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13222: Assignee: (was: Apache Spark) > checkpoint latest stateful RDD on graceful shutdown > --- > > Key: SPARK-13222 > URL: https://issues.apache.org/jira/browse/SPARK-13222 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Mao, Wei > > In case the input source is not repeatable, and user don't want to enable WAL > because of the overhead costing, we need to make sure the latest status of > stateful RDD is checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135611#comment-15135611 ] Apache Spark commented on SPARK-13222: -- User 'mwws' has created a pull request for this issue: https://github.com/apache/spark/pull/11101 > checkpoint latest stateful RDD on graceful shutdown > --- > > Key: SPARK-13222 > URL: https://issues.apache.org/jira/browse/SPARK-13222 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Mao, Wei > > In case the input source is not repeatable, and user don't want to enable WAL > because of the overhead costing, we need to make sure the latest status of > stateful RDD is checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13222: Assignee: Apache Spark > checkpoint latest stateful RDD on graceful shutdown > --- > > Key: SPARK-13222 > URL: https://issues.apache.org/jira/browse/SPARK-13222 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Mao, Wei >Assignee: Apache Spark > > In case the input source is not repeatable, and user don't want to enable WAL > because of the overhead costing, we need to make sure the latest status of > stateful RDD is checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mao, Wei updated SPARK-13222: - Description: In case the input source is not repeatable, and user don't want to enable WAL because of the overhead costing, we need to make sure the latest status of stateful RDD is checkpointed. > checkpoint latest stateful RDD on graceful shutdown > --- > > Key: SPARK-13222 > URL: https://issues.apache.org/jira/browse/SPARK-13222 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Mao, Wei > > In case the input source is not repeatable, and user don't want to enable WAL > because of the overhead costing, we need to make sure the latest status of > stateful RDD is checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mao, Wei updated SPARK-13222: - Component/s: Streaming > checkpoint latest stateful RDD on graceful shutdown > --- > > Key: SPARK-13222 > URL: https://issues.apache.org/jira/browse/SPARK-13222 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Mao, Wei > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13222) checkpoint latest stateful RDD on graceful shutdown
Mao, Wei created SPARK-13222: Summary: checkpoint latest stateful RDD on graceful shutdown Key: SPARK-13222 URL: https://issues.apache.org/jira/browse/SPARK-13222 Project: Spark Issue Type: Bug Reporter: Mao, Wei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13171) Update promise & future to Promise and Future as the old ones are deprecated
[ https://issues.apache.org/jira/browse/SPARK-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13171. - Resolution: Fixed Assignee: Jakob Odersky Fix Version/s: 2.0.0 > Update promise & future to Promise and Future as the old ones are deprecated > > > Key: SPARK-13171 > URL: https://issues.apache.org/jira/browse/SPARK-13171 > Project: Spark > Issue Type: Sub-task >Reporter: holdenk >Assignee: Jakob Odersky >Priority: Trivial > Fix For: 2.0.0 > > > We use the promise and future functions on the concurrent object, both of > which have been deprecated in 2.11 . The full traits are present in Scala > 2.10 as well so this should be a safe migration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11725) Let UDF to handle null value
[ https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135447#comment-15135447 ] Cristian Opris commented on SPARK-11725: You're right, sorry it seems I lost track of which Scala weirdness applies here: {code} scala> def i : Int = (null : java.lang.Integer) i: Int scala> i == null :22: warning: comparing values of types Int and Null using `==' will always yield false i == null ^ java.lang.NullPointerException at scala.Predef$.Integer2int(Predef.scala:392) {code} At least the previous behaviour of passing 0 as default value was consistent with this specific Scala weirdness: {code} scala> val i: Int = null.asInstanceOf[Int] i: Int = 0 {code} But the current behaviour of "null propagation" seems even more confusing semantically, since for example one can have an UDF with multiple args and just not calling the UDF when any of the args is null hardly can make any sense semantically. {code} def udf(i : Int, l: Long, s: String) = ... sql("select udf(i, l, s) from df") {code} I understand the need to not break existing code but perhaps having a more clear and documented UDF spec perhaps with the use of Option would really help here. > Let UDF to handle null value > > > Key: SPARK-11725 > URL: https://issues.apache.org/jira/browse/SPARK-11725 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jeff Zhang >Assignee: Wenchen Fan >Priority: Blocker > Labels: releasenotes > Fix For: 1.6.0 > > > I notice that currently spark will take the long field as -1 if it is null. > Here's the sample code. > {code} > sqlContext.udf.register("f", (x:Int)=>x+1) > df.withColumn("age2", expr("f(age)")).show() > Output /// > ++---++ > | age| name|age2| > ++---++ > |null|Michael| 0| > | 30| Andy| 31| > | 19| Justin| 20| > ++---++ > {code} > I think for the null value we have 3 options > * Use a special value to represent it (what spark does now) > * Always return null if the udf input has null value argument > * Let udf itself to handle null > I would prefer the third option -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13221) GroupingSets Returns an Incorrect Results
[ https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13221: Assignee: Apache Spark > GroupingSets Returns an Incorrect Results > - > > Key: SPARK-13221 > URL: https://issues.apache.org/jira/browse/SPARK-13221 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > The following query returns a wrong result: > {code} > sql("select course, sum(earnings) as sum from courseSales group by course, > earnings" + > " grouping sets((), (course), (course, earnings))" + > " order by course, sum").show() > {code} > Before the fix, the results are like > {code} > [null,null] > [Java,null] > [Java,2.0] > [Java,3.0] > [dotNET,null] > [dotNET,5000.0] > [dotNET,1.0] > [dotNET,48000.0] > {code} > After the fix, the results are corrected: > {code} > [null,113000.0] > [Java,2.0] > [Java,3.0] > [Java,5.0] > [dotNET,5000.0] > [dotNET,1.0] > [dotNET,48000.0] > [dotNET,63000.0] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13221) GroupingSets Returns an Incorrect Results
[ https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13221: Assignee: (was: Apache Spark) > GroupingSets Returns an Incorrect Results > - > > Key: SPARK-13221 > URL: https://issues.apache.org/jira/browse/SPARK-13221 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xiao Li >Priority: Critical > > The following query returns a wrong result: > {code} > sql("select course, sum(earnings) as sum from courseSales group by course, > earnings" + > " grouping sets((), (course), (course, earnings))" + > " order by course, sum").show() > {code} > Before the fix, the results are like > {code} > [null,null] > [Java,null] > [Java,2.0] > [Java,3.0] > [dotNET,null] > [dotNET,5000.0] > [dotNET,1.0] > [dotNET,48000.0] > {code} > After the fix, the results are corrected: > {code} > [null,113000.0] > [Java,2.0] > [Java,3.0] > [Java,5.0] > [dotNET,5000.0] > [dotNET,1.0] > [dotNET,48000.0] > [dotNET,63000.0] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13221) GroupingSets Returns an Incorrect Results
[ https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135446#comment-15135446 ] Apache Spark commented on SPARK-13221: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/11100 > GroupingSets Returns an Incorrect Results > - > > Key: SPARK-13221 > URL: https://issues.apache.org/jira/browse/SPARK-13221 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xiao Li >Priority: Critical > > The following query returns a wrong result: > {code} > sql("select course, sum(earnings) as sum from courseSales group by course, > earnings" + > " grouping sets((), (course), (course, earnings))" + > " order by course, sum").show() > {code} > Before the fix, the results are like > {code} > [null,null] > [Java,null] > [Java,2.0] > [Java,3.0] > [dotNET,null] > [dotNET,5000.0] > [dotNET,1.0] > [dotNET,48000.0] > {code} > After the fix, the results are corrected: > {code} > [null,113000.0] > [Java,2.0] > [Java,3.0] > [Java,5.0] > [dotNET,5000.0] > [dotNET,1.0] > [dotNET,48000.0] > [dotNET,63000.0] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13221) GroupingSets Returns an Incorrect Results
Xiao Li created SPARK-13221: --- Summary: GroupingSets Returns an Incorrect Results Key: SPARK-13221 URL: https://issues.apache.org/jira/browse/SPARK-13221 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0, 2.0.0 Reporter: Xiao Li Priority: Critical The following query returns a wrong result: {code} sql("select course, sum(earnings) as sum from courseSales group by course, earnings" + " grouping sets((), (course), (course, earnings))" + " order by course, sum").show() {code} Before the fix, the results are like {code} [null,null] [Java,null] [Java,2.0] [Java,3.0] [dotNET,null] [dotNET,5000.0] [dotNET,1.0] [dotNET,48000.0] {code} After the fix, the results are corrected: {code} [null,113000.0] [Java,2.0] [Java,3.0] [Java,5.0] [dotNET,5000.0] [dotNET,1.0] [dotNET,48000.0] [dotNET,63000.0] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set
[ https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135386#comment-15135386 ] Xiao Li commented on SPARK-12720: - The solution is done, but hitting a bug in grouping set. Thus, it is blocked. Will open a JIRA and PR for resolving the bug at first. {code} sql("SELECT `key`, `value`, count( value ) AS `_c2` FROM `default`.`t1` " + "GROUP BY `key`, `value` GROUPING SETS ((), (`key`), (`key`, `value`))").show() {code} {code} sql("SELECT `key`, `value`, count( value ) AS `_c2` FROM `default`.`t1` " + "GROUP BY `key`, `value` with rollup").show() {code} These two queries return different results, however, they should be identical. > SQL generation support for cube, rollup, and grouping set > - > > Key: SPARK-12720 > URL: https://issues.apache.org/jira/browse/SPARK-12720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11725) Let UDF to handle null value
[ https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135357#comment-15135357 ] Jeff Zhang commented on SPARK-11725: +1 for Option[T] > Let UDF to handle null value > > > Key: SPARK-11725 > URL: https://issues.apache.org/jira/browse/SPARK-11725 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jeff Zhang >Assignee: Wenchen Fan >Priority: Blocker > Labels: releasenotes > Fix For: 1.6.0 > > > I notice that currently spark will take the long field as -1 if it is null. > Here's the sample code. > {code} > sqlContext.udf.register("f", (x:Int)=>x+1) > df.withColumn("age2", expr("f(age)")).show() > Output /// > ++---++ > | age| name|age2| > ++---++ > |null|Michael| 0| > | 30| Andy| 31| > | 19| Justin| 20| > ++---++ > {code} > I think for the null value we have 3 options > * Use a special value to represent it (what spark does now) > * Always return null if the udf input has null value argument > * Let udf itself to handle null > I would prefer the third option -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists
[ https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13068: Assignee: (was: Apache Spark) > Extend pyspark ml paramtype conversion to support lists > --- > > Key: SPARK-13068 > URL: https://issues.apache.org/jira/browse/SPARK-13068 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > In SPARK-7675 we added type conversion for PySpark ML params. We should > follow up and support param type conversion for lists and nested structures > as required. This blocks having all PySpark ML params having type information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists
[ https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13068: Assignee: Apache Spark > Extend pyspark ml paramtype conversion to support lists > --- > > Key: SPARK-13068 > URL: https://issues.apache.org/jira/browse/SPARK-13068 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Assignee: Apache Spark >Priority: Trivial > > In SPARK-7675 we added type conversion for PySpark ML params. We should > follow up and support param type conversion for lists and nested structures > as required. This blocks having all PySpark ML params having type information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists
[ https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135330#comment-15135330 ] Apache Spark commented on SPARK-13068: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/11099 > Extend pyspark ml paramtype conversion to support lists > --- > > Key: SPARK-13068 > URL: https://issues.apache.org/jira/browse/SPARK-13068 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > In SPARK-7675 we added type conversion for PySpark ML params. We should > follow up and support param type conversion for lists and nested structures > as required. This blocks having all PySpark ML params having type information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13220) Deprecate "yarn-client" and "yarn-cluster"
Andrew Or created SPARK-13220: - Summary: Deprecate "yarn-client" and "yarn-cluster" Key: SPARK-13220 URL: https://issues.apache.org/jira/browse/SPARK-13220 Project: Spark Issue Type: Sub-task Components: YARN Reporter: Andrew Or We currently allow `--master yarn-client`. Instead, the user should do `--master yarn --deploy-mode client` to be more explicit. This is more consistent with other cluster managers and obviates the need to do special parsing of the master string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11725) Let UDF to handle null value
[ https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135264#comment-15135264 ] Michael Armbrust commented on SPARK-11725: -- [~onetoinfin...@yahoo.com], unfortunatly I don't think that that is true. {{Long}} and other primitive types inherit from {{AnyVal}} while only things that inherit from {{AnyRef}} can be {{null}}. If you attempt a comparison between a primitive and null, the compiler will tell you this. {code} scala> 1 == null :8: warning: comparing values of types Int and Null using `==' will always yield false 1 == null {code} {quote} Throwing an error when a null value cannot be passed to a UDF that has been compiled to only accept nulls. {quote} While this might be reasonable in a greenfield situation, I don't think we can change semantics on our users like that. We chose this semantic because its pretty common for databases to use null for error conditions, rather than failing the whole query. {quote} Using {{Option\[T\]}} as a UDF arg to signal that the function accepts nulls. {quote} I like this idea and I actually expected it to work. As you can see it already works in datasets: {code} Seq((1, new Integer(1)), (2, null)).toDF().as[(Int, Option[Int])].collect() res0: Array[(Int, Option[Int])] = Array((1,Some(1)), (2,None)) {code} We should definitely be using the same logic when converting arguments for UDFs. > Let UDF to handle null value > > > Key: SPARK-11725 > URL: https://issues.apache.org/jira/browse/SPARK-11725 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jeff Zhang >Assignee: Wenchen Fan >Priority: Blocker > Labels: releasenotes > Fix For: 1.6.0 > > > I notice that currently spark will take the long field as -1 if it is null. > Here's the sample code. > {code} > sqlContext.udf.register("f", (x:Int)=>x+1) > df.withColumn("age2", expr("f(age)")).show() > Output /// > ++---++ > | age| name|age2| > ++---++ > |null|Michael| 0| > | 30| Andy| 31| > | 19| Justin| 20| > ++---++ > {code} > I think for the null value we have 3 options > * Use a special value to represent it (what spark does now) > * Always return null if the udf input has null value argument > * Let udf itself to handle null > I would prefer the third option -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!
[ https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135254#comment-15135254 ] Jakob Odersky commented on SPARK-13176: --- The PR I submitted uses java.nio.Files, it does not fix the underlying problem of ignoring specific deprecation warnings. > Ignore deprecation warning for ProcessBuilder lines_! > - > > Key: SPARK-13176 > URL: https://issues.apache.org/jira/browse/SPARK-13176 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > The replacement, stream_! & lineStream_! is not present in 2.10 API. > Note @SupressWarnings for deprecation doesn't appear to work > https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings > might involve wrapping or similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!
[ https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13176: Assignee: Apache Spark > Ignore deprecation warning for ProcessBuilder lines_! > - > > Key: SPARK-13176 > URL: https://issues.apache.org/jira/browse/SPARK-13176 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: holdenk >Assignee: Apache Spark >Priority: Trivial > > The replacement, stream_! & lineStream_! is not present in 2.10 API. > Note @SupressWarnings for deprecation doesn't appear to work > https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings > might involve wrapping or similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!
[ https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135251#comment-15135251 ] Apache Spark commented on SPARK-13176: -- User 'jodersky' has created a pull request for this issue: https://github.com/apache/spark/pull/11098 > Ignore deprecation warning for ProcessBuilder lines_! > - > > Key: SPARK-13176 > URL: https://issues.apache.org/jira/browse/SPARK-13176 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > The replacement, stream_! & lineStream_! is not present in 2.10 API. > Note @SupressWarnings for deprecation doesn't appear to work > https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings > might involve wrapping or similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!
[ https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13176: Assignee: (was: Apache Spark) > Ignore deprecation warning for ProcessBuilder lines_! > - > > Key: SPARK-13176 > URL: https://issues.apache.org/jira/browse/SPARK-13176 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > The replacement, stream_! & lineStream_! is not present in 2.10 API. > Note @SupressWarnings for deprecation doesn't appear to work > https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings > might involve wrapping or similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13216) Spark streaming application not honoring --num-executors in restarting of an application from a checkpoint
[ https://issues.apache.org/jira/browse/SPARK-13216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135233#comment-15135233 ] Hari Shreedharan commented on SPARK-13216: -- I think this brings up a broader question - I think the parameters that we are loading from the new configuration is currently very limited. Perhaps we should go back and look at which all parameters to load from checkpoint and which not to. Eventually we would want to "tag" parameters are load-able from checkpoint or not. Either way, we might want to override any parameters from the checkpoint (other than the ones we really cannot and should not) using params from the user in the current run of the app. > Spark streaming application not honoring --num-executors in restarting of an > application from a checkpoint > -- > > Key: SPARK-13216 > URL: https://issues.apache.org/jira/browse/SPARK-13216 > Project: Spark > Issue Type: Bug > Components: Spark Submit, Streaming >Affects Versions: 1.5.0 >Reporter: Neelesh Srinivas Salian > Labels: Streaming > > Scenario to help understand: > 1) The Spark streaming job with 12 executors was initiated with checkpointing > enabled. > 2) In version 1.3, the user was able to append the number of executors to 20 > using --num-executors but was unable to do so in version 1.5. > In 1.5, the spark application still runs with 13 executors (1 for driver and > 12 executors). > There is a need to start from the checkpoint itself and not restart the > application to avoid the loss of information. > 3) Checked the code in 1.3 and 1.5, which shows the command > ''--num-executors" has been deprecated. > Any thoughts on this? Not sure if anyone hit this one specifically before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!
[ https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135222#comment-15135222 ] Jakob Odersky commented on SPARK-13176: --- One of the places the process api is used is in creating symlinks. Since Spark requires at least Java 1.7, we can drop the use of external commands and rely on the nio.Files api instead. > Ignore deprecation warning for ProcessBuilder lines_! > - > > Key: SPARK-13176 > URL: https://issues.apache.org/jira/browse/SPARK-13176 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > The replacement, stream_! & lineStream_! is not present in 2.10 API. > Note @SupressWarnings for deprecation doesn't appear to work > https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings > might involve wrapping or similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhinav Chawade updated SPARK-13219: Description: When 2 or more tables are joined in SparkSQL and there is an equality clause in query on attributes used to perform the join, it is useful to apply that clause on scans for both table. If this is not done, one of the tables results in full scan which can reduce the query dramatically. Consider following example with 2 tables being joined. {code} CREATE TABLE assets ( assetid int PRIMARY KEY, address text, propertyname text ) CREATE TABLE tenants ( assetid int PRIMARY KEY, name text ) spark-sql> explain select t.name from tenants t, assets a where a.assetid = t.assetid and t.assetid='1201'; WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable == Physical Plan == Project [name#14] ShuffledHashJoin [assetid#13], [assetid#15], BuildRight Exchange (HashPartitioning 200) Filter (CAST(assetid#13, DoubleType) = 1201.0) HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, tenants, Some(t)), None Exchange (HashPartitioning 200) HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, Some(a)), None Time taken: 1.354 seconds, Fetched 8 row(s) {code} The simple workaround is to add another equality condition for each table but it becomes cumbersome. It will be helpful if the query planner could improve filter propagation. was: When 2 or more tables are joined in SparkSQL and there is an equality clause in query on attributes used to perform the join, it is useful to apply that clause on scans for both table. If this is not done, one of the tables results in full scan which can reduce the query dramatically. Consider following example with 2 tables being joined. {code} CREATE TABLE assets ( assetid int PRIMARY KEY, address text, propertyname text ) CREATE TABLE element22082.tenants ( assetid int PRIMARY KEY, name text ) spark-sql> explain select t.name from tenants t, assets a where a.assetid = t.assetid and t.assetid='1201'; WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable == Physical Plan == Project [name#14] ShuffledHashJoin [assetid#13], [assetid#15], BuildRight Exchange (HashPartitioning 200) Filter (CAST(assetid#13, DoubleType) = 1201.0) HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, tenants, Some(t)), None Exchange (HashPartitioning 200) HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, Some(a)), None Time taken: 1.354 seconds, Fetched 8 row(s) {code} The simple workaround is to add another equality condition for each table but it becomes cumbersome. It will be helpful if the query planner could improve filter propagation. > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, > tenants, Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, > Some(a)), None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It wil
[jira] [Updated] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhinav Chawade updated SPARK-13219: Description: When 2 or more tables are joined in SparkSQL and there is an equality clause in query on attributes used to perform the join, it is useful to apply that clause on scans for both table. If this is not done, one of the tables results in full scan which can reduce the query dramatically. Consider following example with 2 tables being joined. {code} CREATE TABLE assets ( assetid int PRIMARY KEY, address text, propertyname text ) CREATE TABLE tenants ( assetid int PRIMARY KEY, name text ) spark-sql> explain select t.name from tenants t, assets a where a.assetid = t.assetid and t.assetid='1201'; WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable == Physical Plan == Project [name#14] ShuffledHashJoin [assetid#13], [assetid#15], BuildRight Exchange (HashPartitioning 200) Filter (CAST(assetid#13, DoubleType) = 1201.0) HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, Some(t)), None Exchange (HashPartitioning 200) HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), None Time taken: 1.354 seconds, Fetched 8 row(s) {code} The simple workaround is to add another equality condition for each table but it becomes cumbersome. It will be helpful if the query planner could improve filter propagation. was: When 2 or more tables are joined in SparkSQL and there is an equality clause in query on attributes used to perform the join, it is useful to apply that clause on scans for both table. If this is not done, one of the tables results in full scan which can reduce the query dramatically. Consider following example with 2 tables being joined. {code} CREATE TABLE assets ( assetid int PRIMARY KEY, address text, propertyname text ) CREATE TABLE tenants ( assetid int PRIMARY KEY, name text ) spark-sql> explain select t.name from tenants t, assets a where a.assetid = t.assetid and t.assetid='1201'; WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable == Physical Plan == Project [name#14] ShuffledHashJoin [assetid#13], [assetid#15], BuildRight Exchange (HashPartitioning 200) Filter (CAST(assetid#13, DoubleType) = 1201.0) HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, tenants, Some(t)), None Exchange (HashPartitioning 200) HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, Some(a)), None Time taken: 1.354 seconds, Fetched 8 row(s) {code} The simple workaround is to add another equality condition for each table but it becomes cumbersome. It will be helpful if the query planner could improve filter propagation. > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner
[jira] [Resolved] (SPARK-13215) Remove fallback in codegen
[ https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13215. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11097 [https://github.com/apache/spark/pull/11097] > Remove fallback in codegen > -- > > Key: SPARK-13215 > URL: https://issues.apache.org/jira/browse/SPARK-13215 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > Fix For: 2.0.0 > > > in newMutableProjection, it will fallback to InterpretedMutableProjection if > failed to compile. > Since we remove the configuration for codegen, we are heavily reply on > codegen (also TungstenAggregate require the generated MutableProjection to > update UnsafeRow), should remove the fallback, which could make user > confusing, see the discussion in SPARK-13116. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhinav Chawade updated SPARK-13219: Description: When 2 or more tables are joined in SparkSQL and there is an equality clause in query on attributes used to perform the join, it is useful to apply that clause on scans for both table. If this is not done, one of the tables results in full scan which can reduce the query dramatically. Consider following example with 2 tables being joined. {code} CREATE TABLE assets ( assetid int PRIMARY KEY, address text, propertyname text ) CREATE TABLE element22082.tenants ( assetid int PRIMARY KEY, name text ) spark-sql> explain select t.name from tenants t, assets a where a.assetid = t.assetid and t.assetid='1201'; WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable == Physical Plan == Project [name#14] ShuffledHashJoin [assetid#13], [assetid#15], BuildRight Exchange (HashPartitioning 200) Filter (CAST(assetid#13, DoubleType) = 1201.0) HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, tenants, Some(t)), None Exchange (HashPartitioning 200) HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, Some(a)), None Time taken: 1.354 seconds, Fetched 8 row(s) {code} The simple workaround is to add another equality condition for each table but it becomes cumbersome. It will be helpful if the query planner could improve filter propagation. was: When 2 or more tables are joined in SparkSQL and there is an equality clause in query on attributes used to perform the join, it is useful to apply that clause on scans for both table. If this is not done, one of the tables results in full scan which can reduce the query dramatically. Consider following example with 2 tables being joined. {code} CREATE TABLE assets ( assetid int PRIMARY KEY, address text, propertyname text ) CREATE TABLE element22082.tenants ( assetid int PRIMARY KEY, name text ) spark-sql> explain select t.name from tenants t, assets a where a.assetid = t.assetid and t.assetid='1201' > ; WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable == Physical Plan == Project [name#14] ShuffledHashJoin [assetid#13], [assetid#15], BuildRight Exchange (HashPartitioning 200) Filter (CAST(assetid#13, DoubleType) = 1201.0) HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, tenants, Some(t)), None Exchange (HashPartitioning 200) HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, Some(a)), None Time taken: 1.354 seconds, Fetched 8 row(s) {code} The simple workaround is to add another equality condition for each table but it becomes cumbersome. It will be helpful if the query planner could improve filter propagation. > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE element22082.tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, > tenants, Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, > Some(a)), None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each tabl
[jira] [Created] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
Abhinav Chawade created SPARK-13219: --- Summary: Pushdown predicate propagation in SparkSQL with join Key: SPARK-13219 URL: https://issues.apache.org/jira/browse/SPARK-13219 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.0, 1.4.1 Environment: Spark 1.4 Datastax Spark connector 1.4 Cassandra. 2.1.12 Centos 6.6 Reporter: Abhinav Chawade When 2 or more tables are joined in SparkSQL and there is an equality clause in query on attributes used to perform the join, it is useful to apply that clause on scans for both table. If this is not done, one of the tables results in full scan which can reduce the query dramatically. Consider following example with 2 tables being joined. {code} CREATE TABLE assets ( assetid int PRIMARY KEY, address text, propertyname text ) CREATE TABLE element22082.tenants ( assetid int PRIMARY KEY, name text ) spark-sql> explain select t.name from tenants t, assets a where a.assetid = t.assetid and t.assetid='1201' > ; WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable == Physical Plan == Project [name#14] ShuffledHashJoin [assetid#13], [assetid#15], BuildRight Exchange (HashPartitioning 200) Filter (CAST(assetid#13, DoubleType) = 1201.0) HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, tenants, Some(t)), None Exchange (HashPartitioning 200) HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, Some(a)), None Time taken: 1.354 seconds, Fetched 8 row(s) {code} The simple workaround is to add another equality condition for each table but it becomes cumbersome. It will be helpful if the query planner could improve filter propagation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set
[ https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135206#comment-15135206 ] Xiao Li commented on SPARK-12720: - Ok, I think we should not use subquery to do it. > SQL generation support for cube, rollup, and grouping set > - > > Key: SPARK-12720 > URL: https://issues.apache.org/jira/browse/SPARK-12720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12720) SQL generation support for cube, rollup, and grouping set
[ https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12720: Comment: was deleted (was: I guess the best way is to add another `Project` above `Aggregate`) > SQL generation support for cube, rollup, and grouping set > - > > Key: SPARK-12720 > URL: https://issues.apache.org/jira/browse/SPARK-12720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set
[ https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135187#comment-15135187 ] Xiao Li commented on SPARK-12720: - I guess the best way is to add another `Project` above `Aggregate` > SQL generation support for cube, rollup, and grouping set > - > > Key: SPARK-12720 > URL: https://issues.apache.org/jira/browse/SPARK-12720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set
[ https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135177#comment-15135177 ] Xiao Li commented on SPARK-12720: - {code}SELECT key, value, count(value) FROM t1 GROUP BY key, value with rollup{code} is converted to {code}SELECT `gen_subquery_0`.`key`, `gen_subquery_0`.`value`, count( `gen_subquery_0`.`value`) AS `_c2` FROM (SELECT `t1`.`key`, `t1`.`value`, `t1`.`key` AS `key`, `t1`.`value` AS `value` FROM `default`.`t1`) AS gen_subquery_0 GROUP BY `gen_subquery_0`.`key`, `gen_subquery_0`.`value` GROUPING SETS ((), (`gen_subquery_0`.`key`), (`gen_subquery_0`.`key`, `gen_subquery_0`.`value`)){code} However, it has two `key` attributes in the subquery. {{`t1`.`key` AS `key`}} and {{`t1`.`key`}} It looks weird, but they are generated by the rule `ResolveGroupingAnalytics`. I have to rewrite the rule. I did a try to rename the alias to `key#23`. However, this alias is used in the group by and aggregate functions and then introduces more issues. Will do a switch and let them use the original variables instead of the alias names. > SQL generation support for cube, rollup, and grouping set > - > > Key: SPARK-12720 > URL: https://issues.apache.org/jira/browse/SPARK-12720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13002) Mesos scheduler backend does not follow the property spark.dynamicAllocation.initialExecutors
[ https://issues.apache.org/jira/browse/SPARK-13002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-13002. --- Resolution: Fixed Assignee: Luc Bourlier Fix Version/s: 2.0.0 > Mesos scheduler backend does not follow the property > spark.dynamicAllocation.initialExecutors > - > > Key: SPARK-13002 > URL: https://issues.apache.org/jira/browse/SPARK-13002 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.5.2, 1.6.0 >Reporter: Luc Bourlier >Assignee: Luc Bourlier > Labels: dynamic_allocation, mesos > Fix For: 2.0.0 > > > When starting a Spark job on a Mesos cluster, all available cores are > reserved (up to {{spark.cores.max}}), creating one executor per Mesos node, > and as many executors as needed. > This is the case even when dynamic allocation is enabled. > When dynamic allocation is enabled, the number of executor launched at > startup should be limited to the value of > {{spark.dynamicAllocation.initialExecutors}}. > The Mesos scheduler backend already follows the value computed by the > {{ExecutorAllocationManager}} for the number of executors that should be up > and running. Expect at startup, when it just creates all the executors it can. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13214) Fix dynamic allocation docs
[ https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-13214: -- Assignee: Bill Chambers > Fix dynamic allocation docs > --- > > Key: SPARK-13214 > URL: https://issues.apache.org/jira/browse/SPARK-13214 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Trivial > Fix For: 1.6.1, 2.0.0 > > > Update docs to reflect dynamicAllocation to be available for all cluster > managers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13214) Fix dynamic allocation docs
[ https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-13214. --- Resolution: Fixed Fix Version/s: 2.0.0 1.6.1 > Fix dynamic allocation docs > --- > > Key: SPARK-13214 > URL: https://issues.apache.org/jira/browse/SPARK-13214 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Trivial > Fix For: 1.6.1, 2.0.0 > > > Update docs to reflect dynamicAllocation to be available for all cluster > managers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12939) migrate encoder resolution to Analyzer
[ https://issues.apache.org/jira/browse/SPARK-12939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12939: - Assignee: Wenchen Fan > migrate encoder resolution to Analyzer > -- > > Key: SPARK-12939 > URL: https://issues.apache.org/jira/browse/SPARK-12939 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > in https://github.com/apache/spark/pull/10747, we decided to use expressions > instead of encoders in query plans, to make the dataset easier to optimize > and make encoder resolution less hacky. > However, that PR didn't finish the migration completely and cause some > regressions which are hidden due to the lack of end-to-end test. > We should fix it by completing the migration and add end-to-end test for it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12939) migrate encoder resolution to Analyzer
[ https://issues.apache.org/jira/browse/SPARK-12939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12939. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10852 [https://github.com/apache/spark/pull/10852] > migrate encoder resolution to Analyzer > -- > > Key: SPARK-12939 > URL: https://issues.apache.org/jira/browse/SPARK-12939 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > Fix For: 2.0.0 > > > in https://github.com/apache/spark/pull/10747, we decided to use expressions > instead of encoders in query plans, to make the dataset easier to optimize > and make encoder resolution less hacky. > However, that PR didn't finish the migration completely and cause some > regressions which are hidden due to the lack of end-to-end test. > We should fix it by completing the migration and add end-to-end test for it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13202) Jars specified with --jars do not exist on the worker classpath.
[ https://issues.apache.org/jira/browse/SPARK-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135163#comment-15135163 ] Sean Owen commented on SPARK-13202: --- Actually, try spark-submit. This is a class loader visibility problem. Spark has to load app classes from its class loader. I remember this situation is more complex in the shell. You may also try the userClassPathFirst option to see if that matters. I think the worst case is that indeed you're doing what's needed. But these might shed some light or confirm > Jars specified with --jars do not exist on the worker classpath. > > > Key: SPARK-13202 > URL: https://issues.apache.org/jira/browse/SPARK-13202 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.2, 1.6.0 >Reporter: Michael Schmitz > > I have a Spark Scala 2.11 application. To deploy it to the cluster, I create > a jar of the dependencies and a jar of the project (although this problem > still manifests if I create a single jar with everything). I will focus on > problems specific to Spark Shell, but I'm pretty sure they also apply to > Spark Submit. > I can get Spark Shell to work with my application, however I need to set > spark.executor.extraClassPath. From reading the documentation > (http://spark.apache.org/docs/latest/configuration.html#runtime-environment) > it sounds like I shouldn't need to set this option ("Users typically should > not need to set this option.") After reading about --jars, I understand that > this should set the classpath for the workers to use the jars that are synced > to those machines. > When I don't set spark.executor.extraClassPath, I get a kryo registrator > exception with the root cause being that a class is not found. > java.io.IOException: org.apache.spark.SparkException: Failed to register > classes with Kryo > java.lang.ClassNotFoundException: org.allenai.common.Enum > If I SSH into the workers, I can see that we did create directories that > contain the jars specified by --jars. > /opt/data/spark/worker/app-20160204212742-0002/0 > /opt/data/spark/worker/app-20160204212742-0002/1 > Now, if I re-run spark-shell but with `--conf > spark.executor.extraClassPath=/opt/data/spark/worker/app-20160204212742-0002/0/myjar.jar`, > my job will succeed. In other words, if I put my jars at a location that is > available to all the workers and specify that as an extra executor class > path, the job succeeds. > Unfortunately, this means that the jars are being copied to the workers for > no reason. How can I get --jars to add the jars it copies to the workers to > the classpath? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13218) Executor failed after SparkContext start and start again
[ https://issues.apache.org/jira/browse/SPARK-13218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135153#comment-15135153 ] leo wu commented on SPARK-13218: One more point : If spark-default.conf and spark-env.sh are configured over a remote Spark standalone cluster manually, and then launch iPython notebook or pyspark, everything works fine. So I strongly suspect it's a problem with sparkContext/sparkconf initialization after stop and start again. > Executor failed after SparkContext start and start again > -- > > Key: SPARK-13218 > URL: https://issues.apache.org/jira/browse/SPARK-13218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 > Environment: Run IPython/Jupyter along with Spark on ubuntu 14.04 >Reporter: leo wu > > In a python notebook, I am trying to stop SparkContext which is initialized > with local master and then start again with conf over a remote Spark > standalone cluster, like : > import sys > from random import random > import atexit > import os > import platform > import py4j > import pyspark > from pyspark import SparkContext, SparkConf > from pyspark.sql import SQLContext, HiveContext > from pyspark.storagelevel import StorageLevel > os.environ["SPARK_HOME"] = "/home/notebook/spark-1.6.0-bin-hadoop2.6" > os.environ["PYSPARK_SUBMIT_ARGS"] = "--master spark://10.115.89.219:7077" > os.environ["SPARK_LOCAL_HOSTNAME"] = "wzymaster2011" > SparkContext.setSystemProperty("spark.master", "spark://10.115.89.219:7077") > SparkContext.setSystemProperty("spark.cores.max", "4") > SparkContext.setSystemProperty("spark.driver.host", "wzymaster2011") > SparkContext.setSystemProperty("spark.driver.port", "9000") > SparkContext.setSystemProperty("spark.blockManager.port", "9001") > SparkContext.setSystemProperty("spark.fileserver.port", "9002") > conf = SparkConf().setAppName("Python-Test") > sc = SparkContext(conf=conf) > However, I always get error in Executor like : > 16/02/05 14:37:32 DEBUG BlockManager: Getting remote block broadcast_0_piece0 > from BlockManagerId(driver, localhost, 9002) > 16/02/05 14:37:32 DEBUG TransportClientFactory: Creating new connection to > localhost/127.0.0.1:9002 > 16/02/05 14:37:32 ERROR RetryingBlockFetcher: Exception while beginning fetch > of 1 outstanding blocks > java.io.IOException: Failed to connect to localhost/127.0.0.1:9002 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > I suspect that new SparkConf isn't properly passed to executor through Spark > Master for some reason. > Please advise it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13218) Executor failed after SparkContext start and start again
leo wu created SPARK-13218: -- Summary: Executor failed after SparkContext start and start again Key: SPARK-13218 URL: https://issues.apache.org/jira/browse/SPARK-13218 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.0 Environment: Run IPython/Jupyter along with Spark on ubuntu 14.04 Reporter: leo wu In a python notebook, I am trying to stop SparkContext which is initialized with local master and then start again with conf over a remote Spark standalone cluster, like : import sys from random import random import atexit import os import platform import py4j import pyspark from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext, HiveContext from pyspark.storagelevel import StorageLevel os.environ["SPARK_HOME"] = "/home/notebook/spark-1.6.0-bin-hadoop2.6" os.environ["PYSPARK_SUBMIT_ARGS"] = "--master spark://10.115.89.219:7077" os.environ["SPARK_LOCAL_HOSTNAME"] = "wzymaster2011" SparkContext.setSystemProperty("spark.master", "spark://10.115.89.219:7077") SparkContext.setSystemProperty("spark.cores.max", "4") SparkContext.setSystemProperty("spark.driver.host", "wzymaster2011") SparkContext.setSystemProperty("spark.driver.port", "9000") SparkContext.setSystemProperty("spark.blockManager.port", "9001") SparkContext.setSystemProperty("spark.fileserver.port", "9002") conf = SparkConf().setAppName("Python-Test") sc = SparkContext(conf=conf) However, I always get error in Executor like : 16/02/05 14:37:32 DEBUG BlockManager: Getting remote block broadcast_0_piece0 from BlockManagerId(driver, localhost, 9002) 16/02/05 14:37:32 DEBUG TransportClientFactory: Creating new connection to localhost/127.0.0.1:9002 16/02/05 14:37:32 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks java.io.IOException: Failed to connect to localhost/127.0.0.1:9002 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) I suspect that new SparkConf isn't properly passed to executor through Spark Master for some reason. Please advise it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13217) UI of visualization of stage
[ https://issues.apache.org/jira/browse/SPARK-13217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-13217. -- Resolution: Not A Problem It works after restart the driver. > UI of visualization of stage > - > > Key: SPARK-13217 > URL: https://issues.apache.org/jira/browse/SPARK-13217 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Davies Liu > > http://localhost:4041/stages/stage/?id=36&attempt=0&expandDagViz=true > {code} > HTTP ERROR 500 > Problem accessing /stages/stage/. Reason: > Server Error > Caused by: > java.lang.IncompatibleClassChangeError: vtable stub > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:102) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79) > at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:79) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) > at > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > Powered by Jetty:// > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13217) UI of visualization of stage
Davies Liu created SPARK-13217: -- Summary: UI of visualization of stage Key: SPARK-13217 URL: https://issues.apache.org/jira/browse/SPARK-13217 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.0 Reporter: Davies Liu http://localhost:4041/stages/stage/?id=36&attempt=0&expandDagViz=true {code} HTTP ERROR 500 Problem accessing /stages/stage/. Reason: Server Error Caused by: java.lang.IncompatibleClassChangeError: vtable stub at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:102) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:79) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) Powered by Jetty:// {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135036#comment-15135036 ] leo wu commented on SPARK-13198: Hi, Sean I am trying to do this similar in an IPython/Jupyter notebook by stopping a sparkContext and then start a new one with new sparkconf over a remote Spark standalone cluster, instead of local master which is originally initialized, like : import sys from random import random from operator import add import atexit import os import platform import py4j import pyspark from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext, HiveContext from pyspark.storagelevel import StorageLevel os.environ["SPARK_HOME"] = "/home/notebook/spark-1.6.0-bin-hadoop2.6" os.environ["PYSPARK_SUBMIT_ARGS"] = "--master spark://10.115.89.219:7077" os.environ["SPARK_LOCAL_HOSTNAME"] = "wzymaster2011" SparkContext.setSystemProperty("spark.master", "spark://10.115.89.219:7077") SparkContext.setSystemProperty("spark.cores.max", "4") SparkContext.setSystemProperty("spark.driver.host", "wzymaster2011") SparkContext.setSystemProperty("spark.driver.port", "9000") SparkContext.setSystemProperty("spark.blockManager.port", "9001") SparkContext.setSystemProperty("spark.fileserver.port", "9002") conf = SparkConf().setAppName("Leo-Python-Test") sc = SparkContext(conf=conf) However, I always get error on executor due to failing to load BlockManager info from driver but "localhost" , instead of setting in "spark.driver.host" like "wzymaster2011": 16/02/05 14:37:32 DEBUG BlockManager: Getting remote block broadcast_0_piece0 from BlockManagerId(driver, localhost, 9002) 16/02/05 14:37:32 DEBUG TransportClientFactory: Creating new connection to localhost/127.0.0.1:9002 16/02/05 14:37:32 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks java.io.IOException: Failed to connect to localhost/127.0.0.1:9002 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) So, I strongly suspect if there is a bug in SparkContext.stop() to clean up all data and resetting SparkConf() doesn't work well within one app. Please advise it. Millions of thanks > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm r
[jira] [Assigned] (SPARK-13215) Remove fallback in codegen
[ https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13215: Assignee: Apache Spark > Remove fallback in codegen > -- > > Key: SPARK-13215 > URL: https://issues.apache.org/jira/browse/SPARK-13215 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Apache Spark > > in newMutableProjection, it will fallback to InterpretedMutableProjection if > failed to compile. > Since we remove the configuration for codegen, we are heavily reply on > codegen (also TungstenAggregate require the generated MutableProjection to > update UnsafeRow), should remove the fallback, which could make user > confusing, see the discussion in SPARK-13116. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13215) Remove fallback in codegen
[ https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135009#comment-15135009 ] Apache Spark commented on SPARK-13215: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11097 > Remove fallback in codegen > -- > > Key: SPARK-13215 > URL: https://issues.apache.org/jira/browse/SPARK-13215 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > > in newMutableProjection, it will fallback to InterpretedMutableProjection if > failed to compile. > Since we remove the configuration for codegen, we are heavily reply on > codegen (also TungstenAggregate require the generated MutableProjection to > update UnsafeRow), should remove the fallback, which could make user > confusing, see the discussion in SPARK-13116. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13215) Remove fallback in codegen
[ https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13215: Assignee: (was: Apache Spark) > Remove fallback in codegen > -- > > Key: SPARK-13215 > URL: https://issues.apache.org/jira/browse/SPARK-13215 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > > in newMutableProjection, it will fallback to InterpretedMutableProjection if > failed to compile. > Since we remove the configuration for codegen, we are heavily reply on > codegen (also TungstenAggregate require the generated MutableProjection to > update UnsafeRow), should remove the fallback, which could make user > confusing, see the discussion in SPARK-13116. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13216) Spark streaming application not honoring --num-executors in restarting of an application from a checkpoint
Neelesh Srinivas Salian created SPARK-13216: --- Summary: Spark streaming application not honoring --num-executors in restarting of an application from a checkpoint Key: SPARK-13216 URL: https://issues.apache.org/jira/browse/SPARK-13216 Project: Spark Issue Type: Bug Components: Spark Submit, Streaming Affects Versions: 1.5.0 Reporter: Neelesh Srinivas Salian Scenario to help understand: 1) The Spark streaming job with 12 executors was initiated with checkpointing enabled. 2) In version 1.3, the user was able to append the number of executors to 20 using --num-executors but was unable to do so in version 1.5. In 1.5, the spark application still runs with 13 executors (1 for driver and 12 executors). There is a need to start from the checkpoint itself and not restart the application to avoid the loss of information. 3) Checked the code in 1.3 and 1.5, which shows the command ''--num-executors" has been deprecated. Any thoughts on this? Not sure if anyone hit this one specifically before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows
[ https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134988#comment-15134988 ] Asif Hussain Shahid commented on SPARK-13116: - It will show up if you throw an exception in the below mentioned place. In SparkPlan. newMutableProjection, if the code path for some exception, takes to creating InterpretedMutableProjection, the issue will show up. Just add throw new UnsupportedOperationException("") after the GenerateMutableProjection.generate, to simulate . At this point I can only reproduce it artificially, by throwing an Exception in the code below in source code. SparkPlan: protected def newMutableProjection( expressions: Seq[Expression], inputSchema: Seq[Attribute], useSubexprElimination: Boolean = false): () => MutableProjection = { log.debug(s"Creating MutableProj: $expressions, inputSchema: $inputSchema") try { GenerateMutableProjection.generate(expressions, inputSchema, useSubexprElimination) throw new UnsupportedOperationException("TEST") } catch { case e: Exception => if (isTesting) { throw e } else { log.error("Failed to generate mutable projection, fallback to interpreted", e) () => new InterpretedMutableProjection(expressions, inputSchema) } } } > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows > --- > > Key: SPARK-13116 > URL: https://issues.apache.org/jira/browse/SPARK-13116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Asif Hussain Shahid > Attachments: SPARK_13116_Test.scala > > > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows. > If the input to TungstenAggregateIterator is a SafeRow, while the target is > an UnsafeRow , the current code will try to set the fields in the UnsafeRow > using the update method in UnSafeRow. > This method is called via TunsgtenAggregateIterator on the > InterpretedMutableProjection. The target row in the > InterpretedMutableProjection is an UnsafeRow, while the current row is a > SafeRow. > In the InterpretedMutableProjection's apply method, it invokes > mutableRow(i) = exprArray(i).eval(input) > Now for UnsafeRow, the update method throws UnsupportedOperationException. > The proposed fix I did for our forked branch , on the class > InterpretedProjection is: > + private var targetUnsafe = false > + type UnsafeSetter = (UnsafeRow, Any ) => Unit > + private var setters : Array[UnsafeSetter] = _ > private[this] val exprArray = expressions.toArray > private[this] var mutableRow: MutableRow = new > GenericMutableRow(exprArray.length) > def currentValue: InternalRow = mutableRow > > + > override def target(row: MutableRow): MutableProjection = { > mutableRow = row > +targetUnsafe = row match { > + case _:UnsafeRow =>{ > +if(setters == null) { > + setters = Array.ofDim[UnsafeSetter](exprArray.length) > + for(i <- 0 until exprArray.length) { > +setters(i) = exprArray(i).dataType match { > + case IntegerType => (target: UnsafeRow, value: Any ) => > +target.setInt(i,value.asInstanceOf[Int]) > + case LongType => (target: UnsafeRow, value: Any ) => > +target.setLong(i,value.asInstanceOf[Long]) > + case DoubleType => (target: UnsafeRow, value: Any ) => > +target.setDouble(i,value.asInstanceOf[Double]) > + case FloatType => (target: UnsafeRow, value: Any ) => > +target.setFloat(i,value.asInstanceOf[Float]) > + > + case NullType => (target: UnsafeRow, value: Any ) => > +target.setNullAt(i) > + > + case BooleanType => (target: UnsafeRow, value: Any ) => > +target.setBoolean(i,value.asInstanceOf[Boolean]) > + > + case ByteType => (target: UnsafeRow, value: Any ) => > +target.setByte(i,value.asInstanceOf[Byte]) > + case ShortType => (target: UnsafeRow, value: Any ) => > +target.setShort(i,value.asInstanceOf[Short]) > + > +} > + } > +} > +true > + } > + case _ => false > +} > + > this > } > > override def apply(input: InternalRow): InternalRow = { > var i = 0 > while (i < exprArray.length) { > - mutableRow(i) = exprArray(i).eval(input) > + if(targetUnsafe) { > +setters(i)(mutableRow.asInstanceOf[UnsafeRow], > exprArray(i).eval(input)) > + }else { > +mutableRow(i) = exprA
[jira] [Created] (SPARK-13215) Remove fallback in codegen
Davies Liu created SPARK-13215: -- Summary: Remove fallback in codegen Key: SPARK-13215 URL: https://issues.apache.org/jira/browse/SPARK-13215 Project: Spark Issue Type: Improvement Reporter: Davies Liu in newMutableProjection, it will fallback to InterpretedMutableProjection if failed to compile. Since we remove the configuration for codegen, we are heavily reply on codegen (also TungstenAggregate require the generated MutableProjection to update UnsafeRow), should remove the fallback, which could make user confusing, see the discussion in SPARK-13116. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows
[ https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134979#comment-15134979 ] Davies Liu commented on SPARK-13116: I tried the test you attached, it work well on master. > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows > --- > > Key: SPARK-13116 > URL: https://issues.apache.org/jira/browse/SPARK-13116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Asif Hussain Shahid > Attachments: SPARK_13116_Test.scala > > > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows. > If the input to TungstenAggregateIterator is a SafeRow, while the target is > an UnsafeRow , the current code will try to set the fields in the UnsafeRow > using the update method in UnSafeRow. > This method is called via TunsgtenAggregateIterator on the > InterpretedMutableProjection. The target row in the > InterpretedMutableProjection is an UnsafeRow, while the current row is a > SafeRow. > In the InterpretedMutableProjection's apply method, it invokes > mutableRow(i) = exprArray(i).eval(input) > Now for UnsafeRow, the update method throws UnsupportedOperationException. > The proposed fix I did for our forked branch , on the class > InterpretedProjection is: > + private var targetUnsafe = false > + type UnsafeSetter = (UnsafeRow, Any ) => Unit > + private var setters : Array[UnsafeSetter] = _ > private[this] val exprArray = expressions.toArray > private[this] var mutableRow: MutableRow = new > GenericMutableRow(exprArray.length) > def currentValue: InternalRow = mutableRow > > + > override def target(row: MutableRow): MutableProjection = { > mutableRow = row > +targetUnsafe = row match { > + case _:UnsafeRow =>{ > +if(setters == null) { > + setters = Array.ofDim[UnsafeSetter](exprArray.length) > + for(i <- 0 until exprArray.length) { > +setters(i) = exprArray(i).dataType match { > + case IntegerType => (target: UnsafeRow, value: Any ) => > +target.setInt(i,value.asInstanceOf[Int]) > + case LongType => (target: UnsafeRow, value: Any ) => > +target.setLong(i,value.asInstanceOf[Long]) > + case DoubleType => (target: UnsafeRow, value: Any ) => > +target.setDouble(i,value.asInstanceOf[Double]) > + case FloatType => (target: UnsafeRow, value: Any ) => > +target.setFloat(i,value.asInstanceOf[Float]) > + > + case NullType => (target: UnsafeRow, value: Any ) => > +target.setNullAt(i) > + > + case BooleanType => (target: UnsafeRow, value: Any ) => > +target.setBoolean(i,value.asInstanceOf[Boolean]) > + > + case ByteType => (target: UnsafeRow, value: Any ) => > +target.setByte(i,value.asInstanceOf[Byte]) > + case ShortType => (target: UnsafeRow, value: Any ) => > +target.setShort(i,value.asInstanceOf[Short]) > + > +} > + } > +} > +true > + } > + case _ => false > +} > + > this > } > > override def apply(input: InternalRow): InternalRow = { > var i = 0 > while (i < exprArray.length) { > - mutableRow(i) = exprArray(i).eval(input) > + if(targetUnsafe) { > +setters(i)(mutableRow.asInstanceOf[UnsafeRow], > exprArray(i).eval(input)) > + }else { > +mutableRow(i) = exprArray(i).eval(input) > + } > i += 1 > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows
[ https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134969#comment-15134969 ] Asif Hussain Shahid commented on SPARK-13116: - Now the unary expression was our code , not of spark's. It was mistake in our code generation snippet, which resulted in this issue. Yes, the cleanest would be to remove the fallback. > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows > --- > > Key: SPARK-13116 > URL: https://issues.apache.org/jira/browse/SPARK-13116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Asif Hussain Shahid > Attachments: SPARK_13116_Test.scala > > > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows. > If the input to TungstenAggregateIterator is a SafeRow, while the target is > an UnsafeRow , the current code will try to set the fields in the UnsafeRow > using the update method in UnSafeRow. > This method is called via TunsgtenAggregateIterator on the > InterpretedMutableProjection. The target row in the > InterpretedMutableProjection is an UnsafeRow, while the current row is a > SafeRow. > In the InterpretedMutableProjection's apply method, it invokes > mutableRow(i) = exprArray(i).eval(input) > Now for UnsafeRow, the update method throws UnsupportedOperationException. > The proposed fix I did for our forked branch , on the class > InterpretedProjection is: > + private var targetUnsafe = false > + type UnsafeSetter = (UnsafeRow, Any ) => Unit > + private var setters : Array[UnsafeSetter] = _ > private[this] val exprArray = expressions.toArray > private[this] var mutableRow: MutableRow = new > GenericMutableRow(exprArray.length) > def currentValue: InternalRow = mutableRow > > + > override def target(row: MutableRow): MutableProjection = { > mutableRow = row > +targetUnsafe = row match { > + case _:UnsafeRow =>{ > +if(setters == null) { > + setters = Array.ofDim[UnsafeSetter](exprArray.length) > + for(i <- 0 until exprArray.length) { > +setters(i) = exprArray(i).dataType match { > + case IntegerType => (target: UnsafeRow, value: Any ) => > +target.setInt(i,value.asInstanceOf[Int]) > + case LongType => (target: UnsafeRow, value: Any ) => > +target.setLong(i,value.asInstanceOf[Long]) > + case DoubleType => (target: UnsafeRow, value: Any ) => > +target.setDouble(i,value.asInstanceOf[Double]) > + case FloatType => (target: UnsafeRow, value: Any ) => > +target.setFloat(i,value.asInstanceOf[Float]) > + > + case NullType => (target: UnsafeRow, value: Any ) => > +target.setNullAt(i) > + > + case BooleanType => (target: UnsafeRow, value: Any ) => > +target.setBoolean(i,value.asInstanceOf[Boolean]) > + > + case ByteType => (target: UnsafeRow, value: Any ) => > +target.setByte(i,value.asInstanceOf[Byte]) > + case ShortType => (target: UnsafeRow, value: Any ) => > +target.setShort(i,value.asInstanceOf[Short]) > + > +} > + } > +} > +true > + } > + case _ => false > +} > + > this > } > > override def apply(input: InternalRow): InternalRow = { > var i = 0 > while (i < exprArray.length) { > - mutableRow(i) = exprArray(i).eval(input) > + if(targetUnsafe) { > +setters(i)(mutableRow.asInstanceOf[UnsafeRow], > exprArray(i).eval(input)) > + }else { > +mutableRow(i) = exprArray(i).eval(input) > + } > i += 1 > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows
[ https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134968#comment-15134968 ] Asif Hussain Shahid commented on SPARK-13116: - No, the unary expression was our code , not of spark's. It was mistake in our code generation snippet, which resulted in this issue. Yes, the cleanest would be to remove the fallback. > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows > --- > > Key: SPARK-13116 > URL: https://issues.apache.org/jira/browse/SPARK-13116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Asif Hussain Shahid > Attachments: SPARK_13116_Test.scala > > > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows. > If the input to TungstenAggregateIterator is a SafeRow, while the target is > an UnsafeRow , the current code will try to set the fields in the UnsafeRow > using the update method in UnSafeRow. > This method is called via TunsgtenAggregateIterator on the > InterpretedMutableProjection. The target row in the > InterpretedMutableProjection is an UnsafeRow, while the current row is a > SafeRow. > In the InterpretedMutableProjection's apply method, it invokes > mutableRow(i) = exprArray(i).eval(input) > Now for UnsafeRow, the update method throws UnsupportedOperationException. > The proposed fix I did for our forked branch , on the class > InterpretedProjection is: > + private var targetUnsafe = false > + type UnsafeSetter = (UnsafeRow, Any ) => Unit > + private var setters : Array[UnsafeSetter] = _ > private[this] val exprArray = expressions.toArray > private[this] var mutableRow: MutableRow = new > GenericMutableRow(exprArray.length) > def currentValue: InternalRow = mutableRow > > + > override def target(row: MutableRow): MutableProjection = { > mutableRow = row > +targetUnsafe = row match { > + case _:UnsafeRow =>{ > +if(setters == null) { > + setters = Array.ofDim[UnsafeSetter](exprArray.length) > + for(i <- 0 until exprArray.length) { > +setters(i) = exprArray(i).dataType match { > + case IntegerType => (target: UnsafeRow, value: Any ) => > +target.setInt(i,value.asInstanceOf[Int]) > + case LongType => (target: UnsafeRow, value: Any ) => > +target.setLong(i,value.asInstanceOf[Long]) > + case DoubleType => (target: UnsafeRow, value: Any ) => > +target.setDouble(i,value.asInstanceOf[Double]) > + case FloatType => (target: UnsafeRow, value: Any ) => > +target.setFloat(i,value.asInstanceOf[Float]) > + > + case NullType => (target: UnsafeRow, value: Any ) => > +target.setNullAt(i) > + > + case BooleanType => (target: UnsafeRow, value: Any ) => > +target.setBoolean(i,value.asInstanceOf[Boolean]) > + > + case ByteType => (target: UnsafeRow, value: Any ) => > +target.setByte(i,value.asInstanceOf[Byte]) > + case ShortType => (target: UnsafeRow, value: Any ) => > +target.setShort(i,value.asInstanceOf[Short]) > + > +} > + } > +} > +true > + } > + case _ => false > +} > + > this > } > > override def apply(input: InternalRow): InternalRow = { > var i = 0 > while (i < exprArray.length) { > - mutableRow(i) = exprArray(i).eval(input) > + if(targetUnsafe) { > +setters(i)(mutableRow.asInstanceOf[UnsafeRow], > exprArray(i).eval(input)) > + }else { > +mutableRow(i) = exprArray(i).eval(input) > + } > i += 1 > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows
[ https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134959#comment-15134959 ] Davies Liu commented on SPARK-13116: Which UnaryExpression generate incorrect code? I think we may just remove the fallback. > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows > --- > > Key: SPARK-13116 > URL: https://issues.apache.org/jira/browse/SPARK-13116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Asif Hussain Shahid > Attachments: SPARK_13116_Test.scala > > > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows. > If the input to TungstenAggregateIterator is a SafeRow, while the target is > an UnsafeRow , the current code will try to set the fields in the UnsafeRow > using the update method in UnSafeRow. > This method is called via TunsgtenAggregateIterator on the > InterpretedMutableProjection. The target row in the > InterpretedMutableProjection is an UnsafeRow, while the current row is a > SafeRow. > In the InterpretedMutableProjection's apply method, it invokes > mutableRow(i) = exprArray(i).eval(input) > Now for UnsafeRow, the update method throws UnsupportedOperationException. > The proposed fix I did for our forked branch , on the class > InterpretedProjection is: > + private var targetUnsafe = false > + type UnsafeSetter = (UnsafeRow, Any ) => Unit > + private var setters : Array[UnsafeSetter] = _ > private[this] val exprArray = expressions.toArray > private[this] var mutableRow: MutableRow = new > GenericMutableRow(exprArray.length) > def currentValue: InternalRow = mutableRow > > + > override def target(row: MutableRow): MutableProjection = { > mutableRow = row > +targetUnsafe = row match { > + case _:UnsafeRow =>{ > +if(setters == null) { > + setters = Array.ofDim[UnsafeSetter](exprArray.length) > + for(i <- 0 until exprArray.length) { > +setters(i) = exprArray(i).dataType match { > + case IntegerType => (target: UnsafeRow, value: Any ) => > +target.setInt(i,value.asInstanceOf[Int]) > + case LongType => (target: UnsafeRow, value: Any ) => > +target.setLong(i,value.asInstanceOf[Long]) > + case DoubleType => (target: UnsafeRow, value: Any ) => > +target.setDouble(i,value.asInstanceOf[Double]) > + case FloatType => (target: UnsafeRow, value: Any ) => > +target.setFloat(i,value.asInstanceOf[Float]) > + > + case NullType => (target: UnsafeRow, value: Any ) => > +target.setNullAt(i) > + > + case BooleanType => (target: UnsafeRow, value: Any ) => > +target.setBoolean(i,value.asInstanceOf[Boolean]) > + > + case ByteType => (target: UnsafeRow, value: Any ) => > +target.setByte(i,value.asInstanceOf[Byte]) > + case ShortType => (target: UnsafeRow, value: Any ) => > +target.setShort(i,value.asInstanceOf[Short]) > + > +} > + } > +} > +true > + } > + case _ => false > +} > + > this > } > > override def apply(input: InternalRow): InternalRow = { > var i = 0 > while (i < exprArray.length) { > - mutableRow(i) = exprArray(i).eval(input) > + if(targetUnsafe) { > +setters(i)(mutableRow.asInstanceOf[UnsafeRow], > exprArray(i).eval(input)) > + }else { > +mutableRow(i) = exprArray(i).eval(input) > + } > i += 1 > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows
[ https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134929#comment-15134929 ] Asif Hussain Shahid commented on SPARK-13116: - I further debugged the issue & found the following The problem appears only if the SparkPlan.newMutableProjection , throws an exception in generating mutable projection through code generation. In which case, the code defaults to creating InterpretedMutableProjection. This is what causes the problem. Looks like in our case, one of our UnaryExpression had incorrect code generation which caused exception in bytes generation, resulting in InterpretedMutableProjection being created. I am attaching a Test , which will fail, if the generated mutable projection fails for some reason. At this point I can only reproduce it artificially, by throwing an Exception in the code below in source code. SparkPlan: protected def newMutableProjection( expressions: Seq[Expression], inputSchema: Seq[Attribute], useSubexprElimination: Boolean = false): () => MutableProjection = { log.debug(s"Creating MutableProj: $expressions, inputSchema: $inputSchema") try { GenerateMutableProjection.generate(expressions, inputSchema, useSubexprElimination) throw new UnsupportedOperationException("TEST") } catch { case e: Exception => if (isTesting) { throw e } else { log.error("Failed to generate mutable projection, fallback to interpreted", e) () => new InterpretedMutableProjection(expressions, inputSchema) } } } I suppose it is not a high priority bug , or something which requires fix , just something which I came across in our development. Just to note: I found this issue to be present in the latest code of spark , too. > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows > --- > > Key: SPARK-13116 > URL: https://issues.apache.org/jira/browse/SPARK-13116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Asif Hussain Shahid > Attachments: SPARK_13116_Test.scala > > > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows. > If the input to TungstenAggregateIterator is a SafeRow, while the target is > an UnsafeRow , the current code will try to set the fields in the UnsafeRow > using the update method in UnSafeRow. > This method is called via TunsgtenAggregateIterator on the > InterpretedMutableProjection. The target row in the > InterpretedMutableProjection is an UnsafeRow, while the current row is a > SafeRow. > In the InterpretedMutableProjection's apply method, it invokes > mutableRow(i) = exprArray(i).eval(input) > Now for UnsafeRow, the update method throws UnsupportedOperationException. > The proposed fix I did for our forked branch , on the class > InterpretedProjection is: > + private var targetUnsafe = false > + type UnsafeSetter = (UnsafeRow, Any ) => Unit > + private var setters : Array[UnsafeSetter] = _ > private[this] val exprArray = expressions.toArray > private[this] var mutableRow: MutableRow = new > GenericMutableRow(exprArray.length) > def currentValue: InternalRow = mutableRow > > + > override def target(row: MutableRow): MutableProjection = { > mutableRow = row > +targetUnsafe = row match { > + case _:UnsafeRow =>{ > +if(setters == null) { > + setters = Array.ofDim[UnsafeSetter](exprArray.length) > + for(i <- 0 until exprArray.length) { > +setters(i) = exprArray(i).dataType match { > + case IntegerType => (target: UnsafeRow, value: Any ) => > +target.setInt(i,value.asInstanceOf[Int]) > + case LongType => (target: UnsafeRow, value: Any ) => > +target.setLong(i,value.asInstanceOf[Long]) > + case DoubleType => (target: UnsafeRow, value: Any ) => > +target.setDouble(i,value.asInstanceOf[Double]) > + case FloatType => (target: UnsafeRow, value: Any ) => > +target.setFloat(i,value.asInstanceOf[Float]) > + > + case NullType => (target: UnsafeRow, value: Any ) => > +target.setNullAt(i) > + > + case BooleanType => (target: UnsafeRow, value: Any ) => > +target.setBoolean(i,value.asInstanceOf[Boolean]) > + > + case ByteType => (target: UnsafeRow, value: Any ) => > +target.setByte(i,value.asInstanceOf[Byte]) > + case ShortType => (targe
[jira] [Updated] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows
[ https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif Hussain Shahid updated SPARK-13116: Attachment: SPARK_13116_Test.scala A test which reproduces the issue. Pls note that to reproduce the issue you will have to tweak the SparkPlan code a little as explained in the ticket. > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows > --- > > Key: SPARK-13116 > URL: https://issues.apache.org/jira/browse/SPARK-13116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Asif Hussain Shahid > Attachments: SPARK_13116_Test.scala > > > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows. > If the input to TungstenAggregateIterator is a SafeRow, while the target is > an UnsafeRow , the current code will try to set the fields in the UnsafeRow > using the update method in UnSafeRow. > This method is called via TunsgtenAggregateIterator on the > InterpretedMutableProjection. The target row in the > InterpretedMutableProjection is an UnsafeRow, while the current row is a > SafeRow. > In the InterpretedMutableProjection's apply method, it invokes > mutableRow(i) = exprArray(i).eval(input) > Now for UnsafeRow, the update method throws UnsupportedOperationException. > The proposed fix I did for our forked branch , on the class > InterpretedProjection is: > + private var targetUnsafe = false > + type UnsafeSetter = (UnsafeRow, Any ) => Unit > + private var setters : Array[UnsafeSetter] = _ > private[this] val exprArray = expressions.toArray > private[this] var mutableRow: MutableRow = new > GenericMutableRow(exprArray.length) > def currentValue: InternalRow = mutableRow > > + > override def target(row: MutableRow): MutableProjection = { > mutableRow = row > +targetUnsafe = row match { > + case _:UnsafeRow =>{ > +if(setters == null) { > + setters = Array.ofDim[UnsafeSetter](exprArray.length) > + for(i <- 0 until exprArray.length) { > +setters(i) = exprArray(i).dataType match { > + case IntegerType => (target: UnsafeRow, value: Any ) => > +target.setInt(i,value.asInstanceOf[Int]) > + case LongType => (target: UnsafeRow, value: Any ) => > +target.setLong(i,value.asInstanceOf[Long]) > + case DoubleType => (target: UnsafeRow, value: Any ) => > +target.setDouble(i,value.asInstanceOf[Double]) > + case FloatType => (target: UnsafeRow, value: Any ) => > +target.setFloat(i,value.asInstanceOf[Float]) > + > + case NullType => (target: UnsafeRow, value: Any ) => > +target.setNullAt(i) > + > + case BooleanType => (target: UnsafeRow, value: Any ) => > +target.setBoolean(i,value.asInstanceOf[Boolean]) > + > + case ByteType => (target: UnsafeRow, value: Any ) => > +target.setByte(i,value.asInstanceOf[Byte]) > + case ShortType => (target: UnsafeRow, value: Any ) => > +target.setShort(i,value.asInstanceOf[Short]) > + > +} > + } > +} > +true > + } > + case _ => false > +} > + > this > } > > override def apply(input: InternalRow): InternalRow = { > var i = 0 > while (i < exprArray.length) { > - mutableRow(i) = exprArray(i).eval(input) > + if(targetUnsafe) { > +setters(i)(mutableRow.asInstanceOf[UnsafeRow], > exprArray(i).eval(input)) > + }else { > +mutableRow(i) = exprArray(i).eval(input) > + } > i += 1 > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13166) Remove DataStreamReader/Writer
[ https://issues.apache.org/jira/browse/SPARK-13166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134890#comment-15134890 ] Apache Spark commented on SPARK-13166: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/11096 > Remove DataStreamReader/Writer > -- > > Key: SPARK-13166 > URL: https://issues.apache.org/jira/browse/SPARK-13166 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > They seem redundant and we can simply use DataFrameReader/Writer. > The usage looks like: > {code} > val df = sqlContext.read.stream("...") > val handle = df.write.stream("...") > handle.stop() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.
[ https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134874#comment-15134874 ] Fede Bar commented on SPARK-12430: -- Very well explained thanks. I'll try the new build and post a feedback as soon as possible. Have a good weekend. > Temporary folders do not get deleted after Task completes causing problems > with disk space. > --- > > Key: SPARK-12430 > URL: https://issues.apache.org/jira/browse/SPARK-12430 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2, 1.6.0 > Environment: Ubuntu server >Reporter: Fede Bar > > We are experiencing an issue with automatic /tmp folder deletion after > framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as > Spark 1.5.1) over Mesos will not delete some temporary folders causing free > disk space on server to exhaust. > Behavior of M/R job using Spark 1.4.1 over Mesos cluster: > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/* , > */tmp/spark-#/blockmgr-#* > - When task is completed */tmp/spark-#/* gets deleted along with > */tmp/spark-#/blockmgr-#* sub-folder. > Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job): > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/mesos/slaves/id** * , > */tmp/spark-***/ * ,{color:red} /tmp/blockmgr-***{color} > - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle > container folder {color:red} /tmp/blockmgr-***{color} > Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several > GB depending on the job that ran. Over time this causes disk space to become > full with consequences that we all know. > Running a shell script would probably work but it is difficult to identify > folders in use by a running M/R or stale folders. I did notice similar issues > opened by other users marked as "resolved", but none seems to exactly match > the above behavior. > I really hope someone has insights on how to fix it. > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13210) NPE in Sort
[ https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134828#comment-15134828 ] Apache Spark commented on SPARK-13210: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11095 > NPE in Sort > --- > > Key: SPARK-13210 > URL: https://issues.apache.org/jira/browse/SPARK-13210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > When run TPCDS query Q78 with scale 10: > {code} > 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = > 268435456 bytes, TID = 143 > 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID > 143) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13214) Fix dynamic allocation docs
[ https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134777#comment-15134777 ] Apache Spark commented on SPARK-13214: -- User 'anabranch' has created a pull request for this issue: https://github.com/apache/spark/pull/11094 > Fix dynamic allocation docs > --- > > Key: SPARK-13214 > URL: https://issues.apache.org/jira/browse/SPARK-13214 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Bill Chambers >Priority: Trivial > > Update docs to reflect dynamicAllocation to be available for all cluster > managers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13214) Fix dynamic allocation docs
[ https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-13214: -- Affects Version/s: 1.6.0 Target Version/s: 1.6.1, 2.0.0 Component/s: Documentation > Fix dynamic allocation docs > --- > > Key: SPARK-13214 > URL: https://issues.apache.org/jira/browse/SPARK-13214 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Bill Chambers >Priority: Trivial > > Update docs to reflect dynamicAllocation to be available for all cluster > managers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13214) Fix dynamic allocation docs
[ https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134746#comment-15134746 ] Apache Spark commented on SPARK-13214: -- User 'anabranch' has created a pull request for this issue: https://github.com/apache/spark/pull/11093 > Fix dynamic allocation docs > --- > > Key: SPARK-13214 > URL: https://issues.apache.org/jira/browse/SPARK-13214 > Project: Spark > Issue Type: Documentation >Reporter: Bill Chambers >Priority: Trivial > > Update docs to reflect dynamicAllocation to be available for all cluster > managers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13214) Fix dynamic allocation docs
[ https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13214: Assignee: Apache Spark > Fix dynamic allocation docs > --- > > Key: SPARK-13214 > URL: https://issues.apache.org/jira/browse/SPARK-13214 > Project: Spark > Issue Type: Documentation >Reporter: Bill Chambers >Assignee: Apache Spark >Priority: Trivial > > Update docs to reflect dynamicAllocation to be available for all cluster > managers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13214) Fix dynamic allocation docs
[ https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13214: Assignee: (was: Apache Spark) > Fix dynamic allocation docs > --- > > Key: SPARK-13214 > URL: https://issues.apache.org/jira/browse/SPARK-13214 > Project: Spark > Issue Type: Documentation >Reporter: Bill Chambers >Priority: Trivial > > Update docs to reflect dynamicAllocation to be available for all cluster > managers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13214) Update docs to reflect dynamicAllocation to be true
Bill Chambers created SPARK-13214: - Summary: Update docs to reflect dynamicAllocation to be true Key: SPARK-13214 URL: https://issues.apache.org/jira/browse/SPARK-13214 Project: Spark Issue Type: Documentation Reporter: Bill Chambers Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13214) Fix dynamic allocation docs
[ https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-13214: -- Description: Update docs to reflect dynamicAllocation to be available for all cluster managers Summary: Fix dynamic allocation docs (was: Update docs to reflect dynamicAllocation to be true) > Fix dynamic allocation docs > --- > > Key: SPARK-13214 > URL: https://issues.apache.org/jira/browse/SPARK-13214 > Project: Spark > Issue Type: Documentation >Reporter: Bill Chambers >Priority: Trivial > > Update docs to reflect dynamicAllocation to be available for all cluster > managers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13213) BroadcastNestedLoopJoin is very slow
Davies Liu created SPARK-13213: -- Summary: BroadcastNestedLoopJoin is very slow Key: SPARK-13213 URL: https://issues.apache.org/jira/browse/SPARK-13213 Project: Spark Issue Type: Improvement Reporter: Davies Liu Since we have improve the performance of CartisianProduct, which should be faster and robuster than BroacastNestedLoopJoin, we should do CartisianProduct instead of BroacastNestedLoopJoin, especially when the broadcasted table is not that small. Today, we hit a query that take very long time but still not finished, once decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it just finished in seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13180) Protect against SessionState being null when accessing HiveClientImpl#conf
[ https://issues.apache.org/jira/browse/SPARK-13180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-13180: --- Attachment: spark-13180-util.patch Patch with HiveConfUtil.scala For record. > Protect against SessionState being null when accessing HiveClientImpl#conf > -- > > Key: SPARK-13180 > URL: https://issues.apache.org/jira/browse/SPARK-13180 > Project: Spark > Issue Type: Bug >Reporter: Ted Yu >Priority: Minor > Attachments: spark-13180-util.patch > > > See this thread http://search-hadoop.com/m/q3RTtFoTDi2HVCrM1 > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205) > at > org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552) > at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456) > at org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473) > at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at > org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13189) Cleanup build references to Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134632#comment-15134632 ] Apache Spark commented on SPARK-13189: -- User 'lresende' has created a pull request for this issue: https://github.com/apache/spark/pull/11092 > Cleanup build references to Scala 2.10 > -- > > Key: SPARK-13189 > URL: https://issues.apache.org/jira/browse/SPARK-13189 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.0 >Reporter: Luciano Resende > > There are still few places referencing scala 2.10/2.10.5 while it should be > 2.11/2.11.7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134414#comment-15134414 ] Steve Loughran commented on SPARK-12807: mixing dependency versions in a single project is dangerous. And it still runs the risk of being brittle against whatever version of jackson Hadoop namenodes run with. While I'm not a fan of shading, until we get better classpath/process isolation in nodemanagers, shading here avoids problems and will ensure that ASF spark releases work with ASF Hadoop releases. > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134412#comment-15134412 ] Steve Loughran commented on SPARK-12807: It's good to hear that things work with older versions, but they do need to be compiled in sync. The risk with downgrading jackson versions is that someone who has upgraded will find their code won't link any more. This is the same dilemma that HADOOP-10104 created: revert or tell others "sorry, time to upgrade". We went with the latter, but have added jackson to the list of dependencies whose upgrades are traumatic: Guava, protobuf (which will never be upgraded on the 2.x line) > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13202) Jars specified with --jars do not exist on the worker classpath.
[ https://issues.apache.org/jira/browse/SPARK-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134332#comment-15134332 ] Michael Schmitz edited comment on SPARK-13202 at 2/5/16 3:40 PM: - [~srowen] in that case the documentation is confusing as it claims spark.executor.extraClassPath is deprecated (specifically "This exists primarily for backwards-compatibility with older versions of Spark. Users typically should not need to set this option."). It also means that the `--jars` options isn't as useful as it could be. It copies the jars to the worker nodes, but I cannot use those jars for the Kryo serialization because I don't know where they will end up. was (Author: schmmd): [~srowen] in that case the documentation is confusing as it claims spark.executor.extraClassPath is deprecated. It also means that the `--jars` options isn't as useful as it could be. It copies the jars to the worker nodes, but I cannot use those jars for the Kryo serialization because I don't know where they will end up. > Jars specified with --jars do not exist on the worker classpath. > > > Key: SPARK-13202 > URL: https://issues.apache.org/jira/browse/SPARK-13202 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.2, 1.6.0 >Reporter: Michael Schmitz > > I have a Spark Scala 2.11 application. To deploy it to the cluster, I create > a jar of the dependencies and a jar of the project (although this problem > still manifests if I create a single jar with everything). I will focus on > problems specific to Spark Shell, but I'm pretty sure they also apply to > Spark Submit. > I can get Spark Shell to work with my application, however I need to set > spark.executor.extraClassPath. From reading the documentation > (http://spark.apache.org/docs/latest/configuration.html#runtime-environment) > it sounds like I shouldn't need to set this option ("Users typically should > not need to set this option.") After reading about --jars, I understand that > this should set the classpath for the workers to use the jars that are synced > to those machines. > When I don't set spark.executor.extraClassPath, I get a kryo registrator > exception with the root cause being that a class is not found. > java.io.IOException: org.apache.spark.SparkException: Failed to register > classes with Kryo > java.lang.ClassNotFoundException: org.allenai.common.Enum > If I SSH into the workers, I can see that we did create directories that > contain the jars specified by --jars. > /opt/data/spark/worker/app-20160204212742-0002/0 > /opt/data/spark/worker/app-20160204212742-0002/1 > Now, if I re-run spark-shell but with `--conf > spark.executor.extraClassPath=/opt/data/spark/worker/app-20160204212742-0002/0/myjar.jar`, > my job will succeed. In other words, if I put my jars at a location that is > available to all the workers and specify that as an extra executor class > path, the job succeeds. > Unfortunately, this means that the jars are being copied to the workers for > no reason. How can I get --jars to add the jars it copies to the workers to > the classpath? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13202) Jars specified with --jars do not exist on the worker classpath.
[ https://issues.apache.org/jira/browse/SPARK-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134332#comment-15134332 ] Michael Schmitz commented on SPARK-13202: - [~srowen] in that case the documentation is confusing as it claims spark.executor.extraClassPath is deprecated. It also means that the `--jars` options isn't as useful as it could be. It copies the jars to the worker nodes, but I cannot use those jars for the Kryo serialization because I don't know where they will end up. > Jars specified with --jars do not exist on the worker classpath. > > > Key: SPARK-13202 > URL: https://issues.apache.org/jira/browse/SPARK-13202 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.2, 1.6.0 >Reporter: Michael Schmitz > > I have a Spark Scala 2.11 application. To deploy it to the cluster, I create > a jar of the dependencies and a jar of the project (although this problem > still manifests if I create a single jar with everything). I will focus on > problems specific to Spark Shell, but I'm pretty sure they also apply to > Spark Submit. > I can get Spark Shell to work with my application, however I need to set > spark.executor.extraClassPath. From reading the documentation > (http://spark.apache.org/docs/latest/configuration.html#runtime-environment) > it sounds like I shouldn't need to set this option ("Users typically should > not need to set this option.") After reading about --jars, I understand that > this should set the classpath for the workers to use the jars that are synced > to those machines. > When I don't set spark.executor.extraClassPath, I get a kryo registrator > exception with the root cause being that a class is not found. > java.io.IOException: org.apache.spark.SparkException: Failed to register > classes with Kryo > java.lang.ClassNotFoundException: org.allenai.common.Enum > If I SSH into the workers, I can see that we did create directories that > contain the jars specified by --jars. > /opt/data/spark/worker/app-20160204212742-0002/0 > /opt/data/spark/worker/app-20160204212742-0002/1 > Now, if I re-run spark-shell but with `--conf > spark.executor.extraClassPath=/opt/data/spark/worker/app-20160204212742-0002/0/myjar.jar`, > my job will succeed. In other words, if I put my jars at a location that is > available to all the workers and specify that as an extra executor class > path, the job succeeds. > Unfortunately, this means that the jars are being copied to the workers for > no reason. How can I get --jars to add the jars it copies to the workers to > the classpath? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11725) Let UDF to handle null value
[ https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134322#comment-15134322 ] Cristian Opris commented on SPARK-11725: I would argue this is a rather inappropriate solution. Scala does not normally distinguish between primitive and boxed types, with Long for example representing both. So having Long args in a function and then testing for null is a valid thing to do in Scala. This is made worse by the fact that the behaviour is actually not documented anywhere, so results in very strange and unexpected behaviour. By the principle of least surprise, I would suggest either (or both) of: - Throwing an error when a null value cannot be passed to a UDF that has been compiled to only accept nulls. - Using Option[T] as a UDF arg to signal that the function accepts nulls. > Let UDF to handle null value > > > Key: SPARK-11725 > URL: https://issues.apache.org/jira/browse/SPARK-11725 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jeff Zhang >Assignee: Wenchen Fan >Priority: Blocker > Labels: releasenotes > Fix For: 1.6.0 > > > I notice that currently spark will take the long field as -1 if it is null. > Here's the sample code. > {code} > sqlContext.udf.register("f", (x:Int)=>x+1) > df.withColumn("age2", expr("f(age)")).show() > Output /// > ++---++ > | age| name|age2| > ++---++ > |null|Michael| 0| > | 30| Andy| 31| > | 19| Justin| 20| > ++---++ > {code} > I think for the null value we have 3 options > * Use a special value to represent it (what spark does now) > * Always return null if the udf input has null value argument > * Let udf itself to handle null > I would prefer the third option -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12969) Exception while casting a spark supported date formatted "string" to "date" data type.
[ https://issues.apache.org/jira/browse/SPARK-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jais Sebastian updated SPARK-12969: --- Description: Getting exception while converting a string column( column is having spark supported date format -MM-dd ) to date data type. Below is the code snippet List jsonData = Arrays.asList( "{\"d\":\"2015-02-01\",\"n\":1}"); JavaRDD dataRDD = this.getSparkContext().parallelize(jsonData); DataFrame data = this.getSqlContext().read().json(dataRDD); DataFrame newData = data.select(data.col("d").cast("date")); newData.show(); Above code will give the error failed to compile: org.codehaus.commons.compiler.CompileException: File generated.java, Line 95, Column 28: Expression "scala.Option < Long > longOpt16" is not an lvalue This happens only if we execute the program in client mode , it works if we execute through spark submit. Here is the sample project : https://github.com/uhonnavarkar/spark_test was: Getting exception while converting a string column( column is having spark supported date format -MM-dd ) to date data type. Below is the code snippet List jsonData = Arrays.asList( "{\"d\":\"2015-02-01\",\"n\":1}"); JavaRDD dataRDD = this.getSparkContext().parallelize(jsonData); DataFrame data = this.getSqlContext().read().json(dataRDD); DataFrame newData = data.select(data.col("d").cast("date")); newData.show(); Above code will give the error failed to compile: org.codehaus.commons.compiler.CompileException: File generated.java, Line 95, Column 28: Expression "scala.Option < Long > longOpt16" is not an lvalue > Exception while casting a spark supported date formatted "string" to "date" > data type. > --- > > Key: SPARK-12969 > URL: https://issues.apache.org/jira/browse/SPARK-12969 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.0 > Environment: Spark Java >Reporter: Jais Sebastian > > Getting exception while converting a string column( column is having spark > supported date format -MM-dd ) to date data type. Below is the code > snippet > List jsonData = Arrays.asList( > "{\"d\":\"2015-02-01\",\"n\":1}"); > JavaRDD dataRDD = > this.getSparkContext().parallelize(jsonData); > DataFrame data = this.getSqlContext().read().json(dataRDD); > DataFrame newData = data.select(data.col("d").cast("date")); > newData.show(); > Above code will give the error > failed to compile: org.codehaus.commons.compiler.CompileException: File > generated.java, Line 95, Column 28: Expression "scala.Option < Long > > longOpt16" is not an lvalue > This happens only if we execute the program in client mode , it works if we > execute through spark submit. Here is the sample project : > https://github.com/uhonnavarkar/spark_test -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12969) Exception while casting a spark supported date formatted "string" to "date" data type.
[ https://issues.apache.org/jira/browse/SPARK-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134287#comment-15134287 ] Jais Sebastian commented on SPARK-12969: Hi , Looks like this issue happens only if the code is deployed as client mode.This works if we execute the same using Spark submit ! Here is the sample project for your reference https://github.com/uhonnavarkar/spark_test Regards, Jais > Exception while casting a spark supported date formatted "string" to "date" > data type. > --- > > Key: SPARK-12969 > URL: https://issues.apache.org/jira/browse/SPARK-12969 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.0 > Environment: Spark Java >Reporter: Jais Sebastian > > Getting exception while converting a string column( column is having spark > supported date format -MM-dd ) to date data type. Below is the code > snippet > List jsonData = Arrays.asList( > "{\"d\":\"2015-02-01\",\"n\":1}"); > JavaRDD dataRDD = > this.getSparkContext().parallelize(jsonData); > DataFrame data = this.getSqlContext().read().json(dataRDD); > DataFrame newData = data.select(data.col("d").cast("date")); > newData.show(); > Above code will give the error > failed to compile: org.codehaus.commons.compiler.CompileException: File > generated.java, Line 95, Column 28: Expression "scala.Option < Long > > longOpt16" is not an lvalue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13212) Provide a way to unregister data sources from a SQLContext
Daniel Darabos created SPARK-13212: -- Summary: Provide a way to unregister data sources from a SQLContext Key: SPARK-13212 URL: https://issues.apache.org/jira/browse/SPARK-13212 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.0 Reporter: Daniel Darabos We allow our users to run SQL queries on their data via a web interface. We create an isolated SQLContext with {{sqlContext.newSession()}}, create their DataFrames in this context, register them with {{registerTempTable}}, then execute the query with {{isolatedContext.sql(query)}}. The issue is that they have the full power of Spark SQL at their disposal. They can run {{SELECT * FROM csv.`/etc/passwd`}}. This specific syntax can be disabled by setting {{spark.sql.runSQLOnFiles}} (a private, undocumented configuration) to {{false}}. But creating a temporary table (http://spark.apache.org/docs/latest/sql-programming-guide.html#loading-data-programmatically) would still work, if we had a HiveContext. As long as all DataSources on the classpath are readily available, I don't think we can be reassured about the security implications. So I think a nice solution would be to make the list of available DataSources a property of the SQLContext. Then for the isolated SQLContext we could simply remove all DataSources. This would allow more fine-grained use cases too. What do you think? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10548) Concurrent execution in SQL does not work
[ https://issues.apache.org/jira/browse/SPARK-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134146#comment-15134146 ] Akshay Harale commented on SPARK-10548: --- We are still facing this issue while repeatedly querying cassandra database using spark-cassandra-connector. Spark version 1.5.1 spark-cassandra-connector 1.5.0-M3 > Concurrent execution in SQL does not work > - > > Key: SPARK-10548 > URL: https://issues.apache.org/jira/browse/SPARK-10548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > From the mailing list: > {code} > future { df1.count() } > future { df2.count() } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > === edit === > Simple reproduction: > {code} > (1 to 100).par.foreach { _ => > sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12739) Details of batch in Streaming tab uses two Duration columns
[ https://issues.apache.org/jira/browse/SPARK-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12739: -- Assignee: Mario Briggs > Details of batch in Streaming tab uses two Duration columns > --- > > Key: SPARK-12739 > URL: https://issues.apache.org/jira/browse/SPARK-12739 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Affects Versions: 1.6.0 >Reporter: Jacek Laskowski >Assignee: Mario Briggs >Priority: Minor > Fix For: 1.6.1, 2.0.0 > > Attachments: SPARK-12739.png > > > "Details of batch" screen in Streaming tab in web UI uses two Duration > columns. I think one should be "Processing Time" while the other "Job > Duration". > See the attachment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134028#comment-15134028 ] Sean Owen commented on SPARK-13198: --- I don't see evidence of a problem here yet. Stuff stays on heap until it is GCed. Are you sure you triggered one like with a profiler and then measured? You also generally would never stop and start a context in an app > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13199) Upgrade apache httpclient version to the latest 4.5 for security
[ https://issues.apache.org/jira/browse/SPARK-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134022#comment-15134022 ] Ted Yu commented on SPARK-13199: This task should be done when there is a Hadoop release that upgrades httpclient to 4.5+ I logged this as Bug since it deals with CVEs. See HADOOP-12767. > Upgrade apache httpclient version to the latest 4.5 for security > > > Key: SPARK-13199 > URL: https://issues.apache.org/jira/browse/SPARK-13199 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Ted Yu >Priority: Minor > > Various SSL security fixes are needed. > See: CVE-2012-6153, CVE-2011-4461, CVE-2014-3577, CVE-2015-5262. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!
[ https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134010#comment-15134010 ] Sean Owen commented on SPARK-13176: --- Can we possibly not use this API at all? Sorry have not investigated whether that's realistic. I don't have other bright ideas. > Ignore deprecation warning for ProcessBuilder lines_! > - > > Key: SPARK-13176 > URL: https://issues.apache.org/jira/browse/SPARK-13176 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > The replacement, stream_! & lineStream_! is not present in 2.10 API. > Note @SupressWarnings for deprecation doesn't appear to work > https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings > might involve wrapping or similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13204) Replace use of mutable.SynchronizedMap with ConcurrentHashMap
[ https://issues.apache.org/jira/browse/SPARK-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134006#comment-15134006 ] Ted Yu commented on SPARK-13204: Just noticed that parent JIRA was marked trivial. Should have done the search first. > Replace use of mutable.SynchronizedMap with ConcurrentHashMap > - > > Key: SPARK-13204 > URL: https://issues.apache.org/jira/browse/SPARK-13204 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu >Priority: Trivial > > From > http://www.scala-lang.org/api/2.11.1/index.html#scala.collection.mutable.SynchronizedMap > : > Synchronization via traits is deprecated as it is inherently unreliable. > Consider java.util.concurrent.ConcurrentHashMap as an alternative. > This issue is to replace the use of mutable.SynchronizedMap and add > scalastyle rule banning mutable.SynchronizedMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated
[ https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133963#comment-15133963 ] sachin aggarwal commented on SPARK-13172: - as scala also recommends the same http://www.scala-lang.org/api/2.11.1/index.html#scala.runtime.RichException it should be change , I will go ahead and make the change > Stop using RichException.getStackTrace it is deprecated > --- > > Key: SPARK-13172 > URL: https://issues.apache.org/jira/browse/SPARK-13172 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > Throwable getStackTrace is the recommended alternative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13188) There is OMM when a receiver stores block
[ https://issues.apache.org/jira/browse/SPARK-13188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13188. --- Resolution: Not A Problem Fix Version/s: (was: 2.0.0) Target Version/s: (was: 2.0.0) Based on this alone - you merely didn't have enough memory or it was too tight. Not a bug. Please again dont set target or fix version > There is OMM when a receiver stores block > - > > Key: SPARK-13188 > URL: https://issues.apache.org/jira/browse/SPARK-13188 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 >Reporter: KaiXinXIaoLei > > 2016-02-04 15:13:27,349 | ERROR | [Thread-35] | Uncaught exception in thread > Thread[Thread-35,5,main] > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3236) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) > at > org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153) > at > org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190) > at > org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199) > at > org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:173) > at > org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:156) > at > org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127) > at > org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:124) > at > org.apache.spark.streaming.kafka.ReliableKafkaReceiver.org$apache$spark$streaming$kafka$ReliableKafkaReceiver$$storeBlockAndCommitOffset(ReliableKafkaReceiver.scala:215) > at > org.apache.spark.streaming.kafka.ReliableKafkaReceiver$GeneratedBlockHandler.onPushBlock(ReliableKafkaReceiver.scala:294) > at > org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:294) > at > org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:266) > at > org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:108) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13039) Spark Streaming with Mesos shutdown without any reason on logs
[ https://issues.apache.org/jira/browse/SPARK-13039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-13039: --- > Spark Streaming with Mesos shutdown without any reason on logs > -- > > Key: SPARK-13039 > URL: https://issues.apache.org/jira/browse/SPARK-13039 > Project: Spark > Issue Type: Question > Components: Streaming >Affects Versions: 1.5.1 >Reporter: Luis Alves >Priority: Minor > > I've a Spark Application running with Mesos that is being killed (this > happens every 2 days). When I see the logs, this is what I have in the spark > driver: > {quote} > 16/01/27 05:24:24 INFO JobScheduler: Starting job streaming job 1453872264000 > ms.0 from job set of time 1453872264000 ms > 16/01/27 05:24:24 INFO JobScheduler: Added jobs for time 1453872264000 ms > 16/01/27 05:24:24 INFO SparkContext: Starting job: foreachRDD at > StreamingApplication.scala:59 > 16/01/27 05:24:24 INFO DAGScheduler: Got job 40085 (foreachRDD at > StreamingApplication.scala:59) with 1 output partitions > 16/01/27 05:24:24 INFO DAGScheduler: Final stage: ResultStage > 40085(foreachRDD at StreamingApplication.scala:59) > 16/01/27 05:24:24 INFO DAGScheduler: Parents of final stage: List() > 16/01/27 05:24:24 INFO DAGScheduler: Missing parents: List() > 16/01/27 05:24:24 INFO DAGScheduler: Submitting ResultStage 40085 > (MapPartitionsRDD[80171] at map at StreamingApplication.scala:59), which has > no missing parents > 16/01/27 05:24:24 INFO MemoryStore: ensureFreeSpace(4720) called with > curMem=147187, maxMem=560497950 > 16/01/27 05:24:24 INFO MemoryStore: Block broadcast_40085 stored as values in > memory (estimated size 4.6 KB, free 534.4 MB) > Killed > {quote} > And this is what I see in the spark slaves: > {quote} > 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80167 > 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80166 > 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80166 > I0127 05:24:24.070618 11142 exec.cpp:381] Executor asked to shutdown > 16/01/27 05:24:24 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: > SIGTERM > 16/01/27 05:24:24 ERROR CoarseGrainedExecutorBackend: Driver > 10.241.10.13:51810 disassociated! Shutting down. > 16/01/27 05:24:24 INFO DiskBlockManager: Shutdown hook called > 16/01/27 05:24:24 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkDriver@10.241.10.13:51810] has failed, address is now > gated for [5000] ms. Reason: [Disassociated] > 16/01/27 05:24:24 INFO ShutdownHookManager: Shutdown hook called > 16/01/27 05:24:24 INFO ShutdownHookManager: Deleting directory > /tmp/spark-f80464b5-1de2-461e-b78b-8ddbd077682a > {quote} > As you can see, this doesn't give any information about the reason why the > driver was killed. > The mesos version I'm using is 0.25.0. > How can I get more information about why it is being killed? > Curious fact: I also have a Spark Jobserver clustering running and without > any problem (same versions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13039) Spark Streaming with Mesos shutdown without any reason on logs
[ https://issues.apache.org/jira/browse/SPARK-13039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13039. --- Resolution: Not A Problem > Spark Streaming with Mesos shutdown without any reason on logs > -- > > Key: SPARK-13039 > URL: https://issues.apache.org/jira/browse/SPARK-13039 > Project: Spark > Issue Type: Question > Components: Streaming >Affects Versions: 1.5.1 >Reporter: Luis Alves >Priority: Minor > > I've a Spark Application running with Mesos that is being killed (this > happens every 2 days). When I see the logs, this is what I have in the spark > driver: > {quote} > 16/01/27 05:24:24 INFO JobScheduler: Starting job streaming job 1453872264000 > ms.0 from job set of time 1453872264000 ms > 16/01/27 05:24:24 INFO JobScheduler: Added jobs for time 1453872264000 ms > 16/01/27 05:24:24 INFO SparkContext: Starting job: foreachRDD at > StreamingApplication.scala:59 > 16/01/27 05:24:24 INFO DAGScheduler: Got job 40085 (foreachRDD at > StreamingApplication.scala:59) with 1 output partitions > 16/01/27 05:24:24 INFO DAGScheduler: Final stage: ResultStage > 40085(foreachRDD at StreamingApplication.scala:59) > 16/01/27 05:24:24 INFO DAGScheduler: Parents of final stage: List() > 16/01/27 05:24:24 INFO DAGScheduler: Missing parents: List() > 16/01/27 05:24:24 INFO DAGScheduler: Submitting ResultStage 40085 > (MapPartitionsRDD[80171] at map at StreamingApplication.scala:59), which has > no missing parents > 16/01/27 05:24:24 INFO MemoryStore: ensureFreeSpace(4720) called with > curMem=147187, maxMem=560497950 > 16/01/27 05:24:24 INFO MemoryStore: Block broadcast_40085 stored as values in > memory (estimated size 4.6 KB, free 534.4 MB) > Killed > {quote} > And this is what I see in the spark slaves: > {quote} > 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80167 > 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80166 > 16/01/27 05:24:20 INFO BlockManager: Removing RDD 80166 > I0127 05:24:24.070618 11142 exec.cpp:381] Executor asked to shutdown > 16/01/27 05:24:24 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: > SIGTERM > 16/01/27 05:24:24 ERROR CoarseGrainedExecutorBackend: Driver > 10.241.10.13:51810 disassociated! Shutting down. > 16/01/27 05:24:24 INFO DiskBlockManager: Shutdown hook called > 16/01/27 05:24:24 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkDriver@10.241.10.13:51810] has failed, address is now > gated for [5000] ms. Reason: [Disassociated] > 16/01/27 05:24:24 INFO ShutdownHookManager: Shutdown hook called > 16/01/27 05:24:24 INFO ShutdownHookManager: Deleting directory > /tmp/spark-f80464b5-1de2-461e-b78b-8ddbd077682a > {quote} > As you can see, this doesn't give any information about the reason why the > driver was killed. > The mesos version I'm using is 0.25.0. > How can I get more information about why it is being killed? > Curious fact: I also have a Spark Jobserver clustering running and without > any problem (same versions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13202) Jars specified with --jars do not exist on the worker classpath.
[ https://issues.apache.org/jira/browse/SPARK-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13202. --- Resolution: Not A Problem You have a case where you need your classes available at the same level as Spark and Kryo classes. That's not a bug. You seem to have done it correctly > Jars specified with --jars do not exist on the worker classpath. > > > Key: SPARK-13202 > URL: https://issues.apache.org/jira/browse/SPARK-13202 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.2, 1.6.0 >Reporter: Michael Schmitz > > I have a Spark Scala 2.11 application. To deploy it to the cluster, I create > a jar of the dependencies and a jar of the project (although this problem > still manifests if I create a single jar with everything). I will focus on > problems specific to Spark Shell, but I'm pretty sure they also apply to > Spark Submit. > I can get Spark Shell to work with my application, however I need to set > spark.executor.extraClassPath. From reading the documentation > (http://spark.apache.org/docs/latest/configuration.html#runtime-environment) > it sounds like I shouldn't need to set this option ("Users typically should > not need to set this option.") After reading about --jars, I understand that > this should set the classpath for the workers to use the jars that are synced > to those machines. > When I don't set spark.executor.extraClassPath, I get a kryo registrator > exception with the root cause being that a class is not found. > java.io.IOException: org.apache.spark.SparkException: Failed to register > classes with Kryo > java.lang.ClassNotFoundException: org.allenai.common.Enum > If I SSH into the workers, I can see that we did create directories that > contain the jars specified by --jars. > /opt/data/spark/worker/app-20160204212742-0002/0 > /opt/data/spark/worker/app-20160204212742-0002/1 > Now, if I re-run spark-shell but with `--conf > spark.executor.extraClassPath=/opt/data/spark/worker/app-20160204212742-0002/0/myjar.jar`, > my job will succeed. In other words, if I put my jars at a location that is > available to all the workers and specify that as an extra executor class > path, the job succeeds. > Unfortunately, this means that the jars are being copied to the workers for > no reason. How can I get --jars to add the jars it copies to the workers to > the classpath? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.
[ https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133954#comment-15133954 ] sachin aggarwal edited comment on SPARK-13177 at 2/5/16 10:23 AM: -- Hi [~holdenk] I will like to work on this, thanks Sachin was (Author: sachin aggarwal): Hi [~AlHolden], I will like to work on this, thanks Sachin > Update ActorWordCount example to not directly use low level linked list as it > is deprecated. > > > Key: SPARK-13177 > URL: https://issues.apache.org/jira/browse/SPARK-13177 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: holdenk >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.
[ https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133954#comment-15133954 ] sachin aggarwal commented on SPARK-13177: - Hi [~AlHolden], I will like to work on this, thanks Sachin > Update ActorWordCount example to not directly use low level linked list as it > is deprecated. > > > Key: SPARK-13177 > URL: https://issues.apache.org/jira/browse/SPARK-13177 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: holdenk >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13192) Some memory leak problem?
[ https://issues.apache.org/jira/browse/SPARK-13192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13192. --- Resolution: Invalid Questions go to user@ > Some memory leak problem? > - > > Key: SPARK-13192 > URL: https://issues.apache.org/jira/browse/SPARK-13192 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 >Reporter: uncleGen > > In my Spark Streaming job, the executor container was killed by Yarn. The > nodemanager log showed that container memory was constantly increaming, and > exceeded the container memory limit. Specifically, only the container which > contains receiver occurred this problem. Is there any memory leak issue here? > Or is there some known memory leak issue? > My code snippet: > {code} > dStream.foreachRDD(new Function, Void>() { > @Override > public Void call(JavaRDD rdd) throws Exception { > T card = rdd.first(); > String time = > DateUtil.formatToDay(DateUtil.strLong(card.getTime())); > System.out.println("time:"+time+", count"+rdd.count) > return null; > } > }); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13191) Update LICENSE with Scala 2.11 dependencies
[ https://issues.apache.org/jira/browse/SPARK-13191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133952#comment-15133952 ] Sean Owen commented on SPARK-13191: --- It is really not necessary to make so many PRs and JIRAs! These are tiny and logically related changes. It makes more noise than it is worth. > Update LICENSE with Scala 2.11 dependencies > --- > > Key: SPARK-13191 > URL: https://issues.apache.org/jira/browse/SPARK-13191 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Luciano Resende > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13199) Upgrade apache httpclient version to the latest 4.5 for security
[ https://issues.apache.org/jira/browse/SPARK-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13199: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) This also was not filled.out correctly. Ted as you probably know it is not necessarily that simple. You need to investigate the impact of making all transitive dependencies use it. And what happens when an earlier version is provided by Hadoop. Have you done that? > Upgrade apache httpclient version to the latest 4.5 for security > > > Key: SPARK-13199 > URL: https://issues.apache.org/jira/browse/SPARK-13199 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Ted Yu >Priority: Minor > > Various SSL security fixes are needed. > See: CVE-2012-6153, CVE-2011-4461, CVE-2014-3577, CVE-2015-5262. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133946#comment-15133946 ] Sean Owen commented on SPARK-12807: --- Why vary this to use an earlier version - seems like all downside? > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13204) Replace use of mutable.SynchronizedMap with ConcurrentHashMap
[ https://issues.apache.org/jira/browse/SPARK-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13204. --- Resolution: Duplicate > Replace use of mutable.SynchronizedMap with ConcurrentHashMap > - > > Key: SPARK-13204 > URL: https://issues.apache.org/jira/browse/SPARK-13204 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu >Priority: Trivial > > From > http://www.scala-lang.org/api/2.11.1/index.html#scala.collection.mutable.SynchronizedMap > : > Synchronization via traits is deprecated as it is inherently unreliable. > Consider java.util.concurrent.ConcurrentHashMap as an alternative. > This issue is to replace the use of mutable.SynchronizedMap and add > scalastyle rule banning mutable.SynchronizedMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13204) Replace use of mutable.SynchronizedMap with ConcurrentHashMap
[ https://issues.apache.org/jira/browse/SPARK-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13204: -- Priority: Trivial (was: Major) Component/s: Spark Core Issue Type: Improvement (was: Bug) > Replace use of mutable.SynchronizedMap with ConcurrentHashMap > - > > Key: SPARK-13204 > URL: https://issues.apache.org/jira/browse/SPARK-13204 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu >Priority: Trivial > > From > http://www.scala-lang.org/api/2.11.1/index.html#scala.collection.mutable.SynchronizedMap > : > Synchronization via traits is deprecated as it is inherently unreliable. > Consider java.util.concurrent.ConcurrentHashMap as an alternative. > This issue is to replace the use of mutable.SynchronizedMap and add > scalastyle rule banning mutable.SynchronizedMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13204) Replace use of mutable.SynchronizedMap with ConcurrentHashMap
[ https://issues.apache.org/jira/browse/SPARK-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133940#comment-15133940 ] Sean Owen commented on SPARK-13204: --- I think you have been around long enough to know that you need to set JIRA fields correctly if you're going to open one. This can't be a 'major bug' and has no component. We also just fixed this right? It's a duplicate > Replace use of mutable.SynchronizedMap with ConcurrentHashMap > - > > Key: SPARK-13204 > URL: https://issues.apache.org/jira/browse/SPARK-13204 > Project: Spark > Issue Type: Bug >Reporter: Ted Yu > > From > http://www.scala-lang.org/api/2.11.1/index.html#scala.collection.mutable.SynchronizedMap > : > Synchronization via traits is deprecated as it is inherently unreliable. > Consider java.util.concurrent.ConcurrentHashMap as an alternative. > This issue is to replace the use of mutable.SynchronizedMap and add > scalastyle rule banning mutable.SynchronizedMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org