[jira] [Commented] (SPARK-11960) User guide section for streaming a/b testing
[ https://issues.apache.org/jira/browse/SPARK-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025593#comment-15025593 ] Feynman Liang commented on SPARK-11960: --- [~josephkb] happy to work on it, when is the 1.6 QA deadline? > User guide section for streaming a/b testing > > > Key: SPARK-11960 > URL: https://issues.apache.org/jira/browse/SPARK-11960 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Joseph K. Bradley >Assignee: Feynman Liang > > [~fliang] Assigning since you added the feature. Will you have a chance to > do this soon? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11969) SQL UI does not work with PySpark
[ https://issues.apache.org/jira/browse/SPARK-11969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11969: Assignee: Davies Liu (was: Apache Spark) > SQL UI does not work with PySpark > - > > Key: SPARK-11969 > URL: https://issues.apache.org/jira/browse/SPARK-11969 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11969) SQL UI does not work with PySpark
[ https://issues.apache.org/jira/browse/SPARK-11969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025732#comment-15025732 ] Apache Spark commented on SPARK-11969: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/9949 > SQL UI does not work with PySpark > - > > Key: SPARK-11969 > URL: https://issues.apache.org/jira/browse/SPARK-11969 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11969) SQL UI does not work with PySpark
[ https://issues.apache.org/jira/browse/SPARK-11969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11969: Assignee: Apache Spark (was: Davies Liu) > SQL UI does not work with PySpark > - > > Key: SPARK-11969 > URL: https://issues.apache.org/jira/browse/SPARK-11969 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11601: -- Assignee: Timothy Hunter (was: Tim Hunter) > ML 1.6 QA: API: Binary incompatible changes > --- > > Key: SPARK-11601 > URL: https://issues.apache.org/jira/browse/SPARK-11601 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Timothy Hunter > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, ping [~mengxr] for advice since he did it for > 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11970) Add missing APIs in Dataset
[ https://issues.apache.org/jira/browse/SPARK-11970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025795#comment-15025795 ] Xiao Li commented on SPARK-11970: - Working on it. Thanks! > Add missing APIs in Dataset > --- > > Key: SPARK-11970 > URL: https://issues.apache.org/jira/browse/SPARK-11970 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > We should add the following functions to Dataset: > 1. show > 2. cache / persist / unpersist > 3. sample > 4. join with outer join support -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11805) SpillableIterator should free the in-memory sorter while spilling
[ https://issues.apache.org/jira/browse/SPARK-11805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-11805. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9793 [https://github.com/apache/spark/pull/9793] > SpillableIterator should free the in-memory sorter while spilling > - > > Key: SPARK-11805 > URL: https://issues.apache.org/jira/browse/SPARK-11805 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.6.0 > > > This array buffer will not be used after spilling, should be freed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6518) Add example code and user guide for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025717#comment-15025717 ] Yu Ishikawa commented on SPARK-6518: All right. I'll send a PR soon. Thanks! > Add example code and user guide for bisecting k-means > - > > Key: SPARK-6518 > URL: https://issues.apache.org/jira/browse/SPARK-6518 > Project: Spark > Issue Type: Documentation > Components: MLlib >Reporter: Yu Ishikawa >Assignee: Yu Ishikawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9670) Examples: Check for new APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9670: - Assignee: Timothy Hunter (was: Tim Hunter) > Examples: Check for new APIs requiring example code > --- > > Key: SPARK-9670 > URL: https://issues.apache.org/jira/browse/SPARK-9670 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Timothy Hunter >Priority: Minor > > Audit list of new features added to MLlib, and see which major items are > missing example code (in the examples folder). We do not need examples for > everything, only for major items such as new ML algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8517: - Assignee: Timothy Hunter (was: Tim Hunter) > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11960) User guide section for streaming a/b testing
[ https://issues.apache.org/jira/browse/SPARK-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025750#comment-15025750 ] Joseph K. Bradley commented on SPARK-11960: --- Would you be able to do this by next Monday? I appreciate it! > User guide section for streaming a/b testing > > > Key: SPARK-11960 > URL: https://issues.apache.org/jira/browse/SPARK-11960 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Joseph K. Bradley >Assignee: Feynman Liang > > [~fliang] Assigning since you added the feature. Will you have a chance to > do this soon? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6518) Add example code and user guide for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025751#comment-15025751 ] Joseph K. Bradley commented on SPARK-6518: -- Great, thanks! > Add example code and user guide for bisecting k-means > - > > Key: SPARK-6518 > URL: https://issues.apache.org/jira/browse/SPARK-6518 > Project: Spark > Issue Type: Documentation > Components: MLlib >Reporter: Yu Ishikawa >Assignee: Yu Ishikawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11934) [SQL] Adding joinType into joinWith
[ https://issues.apache.org/jira/browse/SPARK-11934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11934: Issue Type: Sub-task (was: Improvement) Parent: SPARK- > [SQL] Adding joinType into joinWith > > > Key: SPARK-11934 > URL: https://issues.apache.org/jira/browse/SPARK-11934 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > Adding the joinType into the existing joinWith function call in Dataset APIs. > When using joinWith function, users can specify the following join type: > `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10621) Audit function names in FunctionRegistry and corresponding method names shown in functions.scala and functions.py
[ https://issues.apache.org/jira/browse/SPARK-10621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10621: Assignee: Apache Spark > Audit function names in FunctionRegistry and corresponding method names shown > in functions.scala and functions.py > - > > Key: SPARK-10621 > URL: https://issues.apache.org/jira/browse/SPARK-10621 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > > Right now, there are a few places that we are not very consistent. > * There are a few functions that are registered in {{FunctionRegistry}}, but > not provided in {{functions.scala}} and {{functions.py}}. Examples are > {{isnull}} and {{get_json_object}}. > * There are a few functions that we have different names in FunctionRegistry > and method name in DataFrame API. {{spark_partition_id}} is an example. In > FunctionRegistry, it is called {{spark_partition_id}}. But, DataFrame API, > the method is called {{sparkPartitionId}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10621) Audit function names in FunctionRegistry and corresponding method names shown in functions.scala and functions.py
[ https://issues.apache.org/jira/browse/SPARK-10621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10621: Assignee: (was: Apache Spark) > Audit function names in FunctionRegistry and corresponding method names shown > in functions.scala and functions.py > - > > Key: SPARK-10621 > URL: https://issues.apache.org/jira/browse/SPARK-10621 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now, there are a few places that we are not very consistent. > * There are a few functions that are registered in {{FunctionRegistry}}, but > not provided in {{functions.scala}} and {{functions.py}}. Examples are > {{isnull}} and {{get_json_object}}. > * There are a few functions that we have different names in FunctionRegistry > and method name in DataFrame API. {{spark_partition_id}} is an example. In > FunctionRegistry, it is called {{spark_partition_id}}. But, DataFrame API, > the method is called {{sparkPartitionId}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11968) ALS recommend all methods spend most of time in GC
Joseph K. Bradley created SPARK-11968: - Summary: ALS recommend all methods spend most of time in GC Key: SPARK-11968 URL: https://issues.apache.org/jira/browse/SPARK-11968 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.5.2, 1.6.0 Reporter: Joseph K. Bradley After adding recommendUsersForProducts and recommendProductsForUsers to ALS in spark-perf, I noticed that it takes much longer than ALS itself. Looking at the monitoring page, I can see it is spending about 8min doing GC for each 10min task. That sounds fixable. Looking at the implementation, there is clearly an opportunity to avoid extra allocations: [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10621) Audit function names in FunctionRegistry and corresponding method names shown in functions.scala and functions.py
[ https://issues.apache.org/jira/browse/SPARK-10621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025539#comment-15025539 ] Apache Spark commented on SPARK-10621: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/9948 > Audit function names in FunctionRegistry and corresponding method names shown > in functions.scala and functions.py > - > > Key: SPARK-10621 > URL: https://issues.apache.org/jira/browse/SPARK-10621 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now, there are a few places that we are not very consistent. > * There are a few functions that are registered in {{FunctionRegistry}}, but > not provided in {{functions.scala}} and {{functions.py}}. Examples are > {{isnull}} and {{get_json_object}}. > * There are a few functions that we have different names in FunctionRegistry > and method name in DataFrame API. {{spark_partition_id}} is an example. In > FunctionRegistry, it is called {{spark_partition_id}}. But, DataFrame API, > the method is called {{sparkPartitionId}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data
[ https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025541#comment-15025541 ] Joseph K. Bradley commented on SPARK-10802: --- Linking related issue: recommend all could be faster. > Let ALS recommend for subset of data > > > Key: SPARK-10802 > URL: https://issues.apache.org/jira/browse/SPARK-10802 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Tomasz Bartczak >Priority: Minor > > Currently MatrixFactorizationModel allows to get recommendations for > - single user > - single product > - all users > - all products > recommendation for all users/products do a cartesian join inside. > It would be useful in some cases to get recommendations for subset of > users/products by providing an RDD with which MatrixFactorizationModel could > do an intersection before doing a cartesian join. This would make it much > faster in situation where recommendations are needed only for subset of > users/products, and when the subset is still too large to make it feasible to > recommend one-by-one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9328) Netty IO layer should implement read timeouts
[ https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025711#comment-15025711 ] Josh Rosen commented on SPARK-9328: --- Actually, I spoke slightly too soon: there were some timeouts that had to be lowered in order for the master branch test to pass (my test was originally created for Spark 1.2.x for a backport). It looks like SPARK-7003 has addressed this for Spark 1.4.x+, so I'm going to resolve this as fixed in 1.4.0+. > Netty IO layer should implement read timeouts > - > > Key: SPARK-9328 > URL: https://issues.apache.org/jira/browse/SPARK-9328 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.1, 1.3.1 >Reporter: Josh Rosen >Priority: Blocker > Fix For: 1.4.0 > > > Spark's network layer does not implement read timeouts which may lead to > stalls during shuffle: if a remote shuffle server stalls while responding to > a shuffle block fetch request but does not close the socket then the job may > block until an OS-level socket timeout occurs. > I think that we can fix this using Netty's ReadTimeoutHandler > (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). > The tricky part of working on this will be figuring out the right place to > add the handler and ensuring that we don't introduce performance issues by > not re-using sockets. > Quoting from that linked StackOverflow question: > {quote} > Note that the ReadTimeoutHandler is also unaware of whether you have sent a > request - it only cares whether data has been read from the socket. If your > connection is persistent, and you only want read timeouts to fire when a > request has been sent, you'll need to build a request / response aware > timeout handler. > {quote} > If we want to avoid tearing down connections between shuffles then we may > have to do something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9328) Netty IO layer should implement read timeouts
[ https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-9328: - Assignee: Josh Rosen > Netty IO layer should implement read timeouts > - > > Key: SPARK-9328 > URL: https://issues.apache.org/jira/browse/SPARK-9328 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.1, 1.3.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.4.0 > > > Spark's network layer does not implement read timeouts which may lead to > stalls during shuffle: if a remote shuffle server stalls while responding to > a shuffle block fetch request but does not close the socket then the job may > block until an OS-level socket timeout occurs. > I think that we can fix this using Netty's ReadTimeoutHandler > (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). > The tricky part of working on this will be figuring out the right place to > add the handler and ensuring that we don't introduce performance issues by > not re-using sockets. > Quoting from that linked StackOverflow question: > {quote} > Note that the ReadTimeoutHandler is also unaware of whether you have sent a > request - it only cares whether data has been read from the socket. If your > connection is persistent, and you only want read timeouts to fire when a > request has been sent, you'll need to build a request / response aware > timeout handler. > {quote} > If we want to avoid tearing down connections between shuffles then we may > have to do something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results
[ https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025739#comment-15025739 ] Yin Huai commented on SPARK-11885: -- I tried {code} val q1 = sql(""" select store_country, store_region, gm(amount) from receipts whereamount > 50 and store_country = 'italy' group by store_country, store_region """) {code} Seems the result is good. Looks like using built-in functions and UDAF somehow triggers the problem. > UDAF may nondeterministically generate wrong results > > > Key: SPARK-11885 > URL: https://issues.apache.org/jira/browse/SPARK-11885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). > I think it is an issue in 1.5 branch. > Try the following in spark 1.5 (with a cluster) and you can see the problem. > {code} > import java.math.BigDecimal > import org.apache.spark.sql.expressions.MutableAggregationBuffer > import org.apache.spark.sql.expressions.UserDefinedAggregateFunction > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StructType, StructField, DataType, > DoubleType, LongType} > class GeometricMean extends UserDefinedAggregateFunction { > def inputSchema: StructType = > StructType(StructField("value", DoubleType) :: Nil) > def bufferSchema: StructType = StructType( > StructField("count", LongType) :: > StructField("product", DoubleType) :: Nil > ) > def dataType: DataType = DoubleType > def deterministic: Boolean = true > def initialize(buffer: MutableAggregationBuffer): Unit = { > buffer(0) = 0L > buffer(1) = 1.0 > } > def update(buffer: MutableAggregationBuffer,input: Row): Unit = { > buffer(0) = buffer.getAs[Long](0) + 1 > buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0) > } > def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { > buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0) > buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1) > } > def evaluate(buffer: Row): Any = { > math.pow(buffer.getDouble(1), 1.0d / buffer.getLong(0)) > } > } > sqlContext.udf.register("gm", new GeometricMean) > val df = Seq( > (1, "italy", "emilia", 42, BigDecimal.valueOf(100, 0), "john"), > (2, "italy", "toscana", 42, BigDecimal.valueOf(505, 1), "jim"), > (3, "italy", "puglia", 42, BigDecimal.valueOf(70, 0), "jenn"), > (4, "italy", "emilia", 42, BigDecimal.valueOf(75 ,0), "jack"), > (5, "uk", "london", 42, BigDecimal.valueOf(200 ,0), "carl"), > (6, "italy", "emilia", 42, BigDecimal.valueOf(42, 0), "john")). > toDF("receipt_id", "store_country", "store_region", "store_id", "amount", > "seller_name") > df.registerTempTable("receipts") > > val q = sql(""" > select store_country, > store_region, > avg(amount), > sum(amount), > gm(amount) > from receipts > whereamount > 50 > and store_country = 'italy' > group by store_country, store_region > """) > q.show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11914) [SQL] Support coalesce and repartition in Dataset APIs
[ https://issues.apache.org/jira/browse/SPARK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11914: Assignee: Xiao Li > [SQL] Support coalesce and repartition in Dataset APIs > -- > > Key: SPARK-11914 > URL: https://issues.apache.org/jira/browse/SPARK-11914 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 1.6.0 > > > repartition: Returns a new [[Dataset]] that has exactly `numPartitions` > partitions. > coalesce: Returns a new [[Dataset]] that has exactly `numPartitions` > partitions. Similar to coalesce defined on an [[RDD]], this operation results > in a narrow dependency, e.g. if you go from 1000 partitions to 100 > partitions, there will not be a shuffle, instead each of the 100 new > partitions will claim 10 of the current partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9325) Support `collect` on DataFrame columns
[ https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025611#comment-15025611 ] Hossein Falaki edited comment on SPARK-9325 at 11/24/15 10:50 PM: -- To help R users and not open up the API, how about adding head and collect functions for Column, that just print out a warning that explain how to do it the "right" way: {code} > collect(df$Col) Warning: DataFrame Column may be not be materialized. Please use collect(select(df, df$Col)) {code} Right now, users will get a Java exception which is pretty confusing for most R users. was (Author: falaki): To help R users and not open up the API, how about adding head and collect functions for Column, that just print out a warning that explain how to do it the "right" way: {code} collect(df$Col) Warning: DataFrame Column may be not be materialized. Please use collect(select(df, df$Col)) {code} > Support `collect` on DataFrame columns > -- > > Key: SPARK-9325 > URL: https://issues.apache.org/jira/browse/SPARK-9325 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > This is to support code of the form > ``` > ages <- collect(df$Age) > ``` > Right now `df$Age` returns a Column, which has no functions supported. > Similarly we might consider supporting `head(df$Age)` etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns
[ https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025611#comment-15025611 ] Hossein Falaki commented on SPARK-9325: --- To help R users and not open up the API, how about adding head and collect functions for Column, that just print out a warning that explain how to do it the "right" way: {code} collect(df$Col) Warning: DataFrame Column may be not be materialized. Please use collect(select(df, df$Col)) {code} > Support `collect` on DataFrame columns > -- > > Key: SPARK-9325 > URL: https://issues.apache.org/jira/browse/SPARK-9325 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > This is to support code of the form > ``` > ages <- collect(df$Age) > ``` > Right now `df$Age` returns a Column, which has no functions supported. > Similarly we might consider supporting `head(df$Age)` etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8517: - Assignee: Tim Hunter > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Tim Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11783) When deployed against remote Hive metastore, HiveContext.executionHive points to wrong metastore
[ https://issues.apache.org/jira/browse/SPARK-11783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-11783. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9895 [https://github.com/apache/spark/pull/9895] > When deployed against remote Hive metastore, HiveContext.executionHive points > to wrong metastore > > > Key: SPARK-11783 > URL: https://issues.apache.org/jira/browse/SPARK-11783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 1.7.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.6.0 > > > When using remote metastore, execution Hive client somehow is initialized to > point to the actual remote metastore instead of the dummy local Derby > metastore. > To reproduce this issue: > # Configuring {{conf/hive-site.xml}} to point to a remote Hive 1.2.1 > metastore. > # Set {{hive.metastore.uris}} to {{thrift://localhost:9083}}. > # Start metastore service using {{$HIVE_HOME/bin/hive --service metastore}} > # Start Thrift server with remote debugging options > # Attach the debugger to the Thrift server driver process, we can verify that > {{executionHive}} points to the remote metastore rather than the local > execution Derby metastore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11953) CLONE - Sparksql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append Bug
[ https://issues.apache.org/jira/browse/SPARK-11953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025538#comment-15025538 ] Siva Gudavalli commented on SPARK-11953: I Agree. It depends on how we define SaveMode.Append. Looking for an option similar to InsertIntoJdbc in 1.4.1 > CLONE - Sparksql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append Bug > > > Key: SPARK-11953 > URL: https://issues.apache.org/jira/browse/SPARK-11953 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Submit, SQL >Affects Versions: 1.4.1, 1.5.1 > Environment: Spark stand alone cluster >Reporter: Siva Gudavalli > > In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc. > They are replaced with write.jdbc() in Spark 1.4.1 > When we specify SaveMode.Append we are letting application know that there is > a table in the database which means "tableExists = true". And we do not need > to perform "JdbcUtils.tableExists(conn, table)". > Please let me know if you think differently. > Regards > Shiv > def jdbc(url: String, table: String, connectionProperties: Properties): Unit > = { > val conn = JdbcUtils.createConnection(url, connectionProperties) > try { > var tableExists = JdbcUtils.tableExists(conn, table) > if (mode == SaveMode.Ignore && tableExists) > { return } > if (mode == SaveMode.ErrorIfExists && tableExists) > { sys.error(s"Table $table already exists.") } > if (mode == SaveMode.Overwrite && tableExists) > { JdbcUtils.dropTable(conn, table) tableExists = false } > // Create the table if the table didn't exist. > if (!tableExists) > { val schema = JDBCWriteDetails.schemaString(df, url) val sql = s"CREATE > TABLE $table ($schema)" conn.prepareStatement(sql).executeUpdate() } > } finally > { conn.close() } > JDBCWriteDetails.saveTable(df, url, table, connectionProperties) > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11970) Add missing APIs in Dataset
Reynold Xin created SPARK-11970: --- Summary: Add missing APIs in Dataset Key: SPARK-11970 URL: https://issues.apache.org/jira/browse/SPARK-11970 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11970) Add missing APIs in Dataset
[ https://issues.apache.org/jira/browse/SPARK-11970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11970: Description: We should add the following functions to Dataset: 1. show 2. cache / persist / unpersist 3. sample 4. join with outer join support > Add missing APIs in Dataset > --- > > Key: SPARK-11970 > URL: https://issues.apache.org/jira/browse/SPARK-11970 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > We should add the following functions to Dataset: > 1. show > 2. cache / persist / unpersist > 3. sample > 4. join with outer join support -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9328) Netty IO layer should implement read timeouts
[ https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9328: -- Affects Version/s: (was: 1.4.1) (was: 1.5.0) > Netty IO layer should implement read timeouts > - > > Key: SPARK-9328 > URL: https://issues.apache.org/jira/browse/SPARK-9328 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.1, 1.3.1 >Reporter: Josh Rosen >Priority: Blocker > Fix For: 1.4.0 > > > Spark's network layer does not implement read timeouts which may lead to > stalls during shuffle: if a remote shuffle server stalls while responding to > a shuffle block fetch request but does not close the socket then the job may > block until an OS-level socket timeout occurs. > I think that we can fix this using Netty's ReadTimeoutHandler > (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). > The tricky part of working on this will be figuring out the right place to > add the handler and ensuring that we don't introduce performance issues by > not re-using sockets. > Quoting from that linked StackOverflow question: > {quote} > Note that the ReadTimeoutHandler is also unaware of whether you have sent a > request - it only cares whether data has been read from the socket. If your > connection is persistent, and you only want read timeouts to fire when a > request has been sent, you'll need to build a request / response aware > timeout handler. > {quote} > If we want to avoid tearing down connections between shuffles then we may > have to do something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9328) Netty IO layer should implement read timeouts
[ https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-9328. --- Resolution: Fixed Fix Version/s: 1.4.0 > Netty IO layer should implement read timeouts > - > > Key: SPARK-9328 > URL: https://issues.apache.org/jira/browse/SPARK-9328 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.1, 1.3.1 >Reporter: Josh Rosen >Priority: Blocker > Fix For: 1.4.0 > > > Spark's network layer does not implement read timeouts which may lead to > stalls during shuffle: if a remote shuffle server stalls while responding to > a shuffle block fetch request but does not close the socket then the job may > block until an OS-level socket timeout occurs. > I think that we can fix this using Netty's ReadTimeoutHandler > (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). > The tricky part of working on this will be figuring out the right place to > add the handler and ensuring that we don't introduce performance issues by > not re-using sockets. > Quoting from that linked StackOverflow question: > {quote} > Note that the ReadTimeoutHandler is also unaware of whether you have sent a > request - it only cares whether data has been read from the socket. If your > connection is persistent, and you only want read timeouts to fire when a > request has been sent, you'll need to build a request / response aware > timeout handler. > {quote} > If we want to avoid tearing down connections between shuffles then we may > have to do something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9328) Netty IO layer should implement read timeouts
[ https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9328: -- Target Version/s: (was: 1.6.0) > Netty IO layer should implement read timeouts > - > > Key: SPARK-9328 > URL: https://issues.apache.org/jira/browse/SPARK-9328 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.1, 1.3.1 >Reporter: Josh Rosen >Priority: Blocker > Fix For: 1.4.0 > > > Spark's network layer does not implement read timeouts which may lead to > stalls during shuffle: if a remote shuffle server stalls while responding to > a shuffle block fetch request but does not close the socket then the job may > block until an OS-level socket timeout occurs. > I think that we can fix this using Netty's ReadTimeoutHandler > (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). > The tricky part of working on this will be figuring out the right place to > add the handler and ensuring that we don't introduce performance issues by > not re-using sockets. > Quoting from that linked StackOverflow question: > {quote} > Note that the ReadTimeoutHandler is also unaware of whether you have sent a > request - it only cares whether data has been read from the socket. If your > connection is persistent, and you only want read timeouts to fire when a > request has been sent, you'll need to build a request / response aware > timeout handler. > {quote} > If we want to avoid tearing down connections between shuffles then we may > have to do something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11969) SQL UI does not work with PySpark
Davies Liu created SPARK-11969: -- Summary: SQL UI does not work with PySpark Key: SPARK-11969 URL: https://issues.apache.org/jira/browse/SPARK-11969 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11966) Spark API for UDTFs
Jaka Jancar created SPARK-11966: --- Summary: Spark API for UDTFs Key: SPARK-11966 URL: https://issues.apache.org/jira/browse/SPARK-11966 Project: Spark Issue Type: New Feature Reporter: Jaka Jancar Defining UDFs is easy using sqlContext.udf.register, but not table-generating functions. For those you still have to use these horrendous Hive interfaces: https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11966) Spark API for UDTFs
[ https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jaka Jancar updated SPARK-11966: Priority: Minor (was: Major) > Spark API for UDTFs > --- > > Key: SPARK-11966 > URL: https://issues.apache.org/jira/browse/SPARK-11966 > Project: Spark > Issue Type: New Feature >Reporter: Jaka Jancar >Priority: Minor > > Defining UDFs is easy using sqlContext.udf.register, but not table-generating > functions. For those you still have to use these horrendous Hive interfaces: > https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11967) Use varargs for multiple paths in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11967: Assignee: Reynold Xin (was: Apache Spark) > Use varargs for multiple paths in DataFrameReader > - > > Key: SPARK-11967 > URL: https://issues.apache.org/jira/browse/SPARK-11967 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11967) Use varargs for multiple paths in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025416#comment-15025416 ] Reynold Xin commented on SPARK-11967: - It was consistency with only one function :) varargs is much easier to use here. > Use varargs for multiple paths in DataFrameReader > - > > Key: SPARK-11967 > URL: https://issues.apache.org/jira/browse/SPARK-11967 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"
[ https://issues.apache.org/jira/browse/SPARK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4944: --- Assignee: Apache Spark > Table Not Found exception in "Create Table Like registered RDD table" > - > > Key: SPARK-4944 > URL: https://issues.apache.org/jira/browse/SPARK-4944 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Assignee: Apache Spark > > {code} > rdd_table.saveAsParquetFile("/user/spark/my_data.parquet") > hiveContext.registerRDDAsTable(rdd_table, "rdd_table") > hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION > '/user/spark/my_data.parquet'") > {code} > {noformat} > org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution > Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not > found rdd_table > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284) > at > org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) > at > org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) > at > org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11140) Replace file server in driver with RPC-based alternative
[ https://issues.apache.org/jira/browse/SPARK-11140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025524#comment-15025524 ] Apache Spark commented on SPARK-11140: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/9947 > Replace file server in driver with RPC-based alternative > > > Key: SPARK-11140 > URL: https://issues.apache.org/jira/browse/SPARK-11140 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.7.0 > > > As part of making configuring encryption easy in Spark, it would be better to > use the existing RPC channel between driver and executors to transfer files > and jars added to the application. > This would remove the need to start the HTTP server currently used for that > purpose, which needs to be configured to use SSL if encryption is wanted. SSL > is kinda hard to configure correctly in a multi-user, distributed environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11872) Prevent the call to SparkContext#stop() in the listener bus's thread
[ https://issues.apache.org/jira/browse/SPARK-11872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-11872. -- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Prevent the call to SparkContext#stop() in the listener bus's thread > > > Key: SPARK-11872 > URL: https://issues.apache.org/jira/browse/SPARK-11872 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Ted Yu > Fix For: 1.6.0 > > > This is continuation of SPARK-11761 > Andrew suggested adding this protection. See tail of PR #9741 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10911) Executors should System.exit on clean shutdown
[ https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-10911: -- Assignee: Zhuo Liu > Executors should System.exit on clean shutdown > -- > > Key: SPARK-10911 > URL: https://issues.apache.org/jira/browse/SPARK-10911 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Zhuo Liu >Priority: Minor > > Executors should call System.exit on clean shutdown to make sure all user > threads exit and jvm shuts down. > We ran into a case where an Executor was left around for days trying to > shutdown because the user code was using a non-daemon thread pool and one of > those threads wasn't exiting. We should force the jvm to go away with > System.exit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10911) Executors should System.exit on clean shutdown
[ https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10911: Assignee: Apache Spark (was: Zhuo Liu) > Executors should System.exit on clean shutdown > -- > > Key: SPARK-10911 > URL: https://issues.apache.org/jira/browse/SPARK-10911 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Minor > > Executors should call System.exit on clean shutdown to make sure all user > threads exit and jvm shuts down. > We ran into a case where an Executor was left around for days trying to > shutdown because the user code was using a non-daemon thread pool and one of > those threads wasn't exiting. We should force the jvm to go away with > System.exit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10911) Executors should System.exit on clean shutdown
[ https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10911: Assignee: Zhuo Liu (was: Apache Spark) > Executors should System.exit on clean shutdown > -- > > Key: SPARK-10911 > URL: https://issues.apache.org/jira/browse/SPARK-10911 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Zhuo Liu >Priority: Minor > > Executors should call System.exit on clean shutdown to make sure all user > threads exit and jvm shuts down. > We ran into a case where an Executor was left around for days trying to > shutdown because the user code was using a non-daemon thread pool and one of > those threads wasn't exiting. We should force the jvm to go away with > System.exit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10911) Executors should System.exit on clean shutdown
[ https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025441#comment-15025441 ] Apache Spark commented on SPARK-10911: -- User 'zhuoliu' has created a pull request for this issue: https://github.com/apache/spark/pull/9946 > Executors should System.exit on clean shutdown > -- > > Key: SPARK-10911 > URL: https://issues.apache.org/jira/browse/SPARK-10911 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Zhuo Liu >Priority: Minor > > Executors should call System.exit on clean shutdown to make sure all user > threads exit and jvm shuts down. > We ran into a case where an Executor was left around for days trying to > shutdown because the user code was using a non-daemon thread pool and one of > those threads wasn't exiting. We should force the jvm to go away with > System.exit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11872) Prevent the call to SparkContext#stop() in the listener bus's thread
[ https://issues.apache.org/jira/browse/SPARK-11872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-11872: - Assignee: Ted Yu (was: Shixiong Zhu) > Prevent the call to SparkContext#stop() in the listener bus's thread > > > Key: SPARK-11872 > URL: https://issues.apache.org/jira/browse/SPARK-11872 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Ted Yu >Assignee: Ted Yu > Fix For: 1.6.0 > > > This is continuation of SPARK-11761 > Andrew suggested adding this protection. See tail of PR #9741 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11872) Prevent the call to SparkContext#stop() in the listener bus's thread
[ https://issues.apache.org/jira/browse/SPARK-11872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-11872: Assignee: Shixiong Zhu > Prevent the call to SparkContext#stop() in the listener bus's thread > > > Key: SPARK-11872 > URL: https://issues.apache.org/jira/browse/SPARK-11872 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Ted Yu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > > This is continuation of SPARK-11761 > Andrew suggested adding this protection. See tail of PR #9741 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11946) Audit pivot API for 1.6
[ https://issues.apache.org/jira/browse/SPARK-11946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-11946. - Resolution: Fixed Fix Version/s: 1.6.0 > Audit pivot API for 1.6 > --- > > Key: SPARK-11946 > URL: https://issues.apache.org/jira/browse/SPARK-11946 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.6.0 > > > Currently pivot's signature looks like > {code} > @scala.annotation.varargs > def pivot(pivotColumn: Column, values: Column*): GroupedData > @scala.annotation.varargs > def pivot(pivotColumn: String, values: Any*): GroupedData > {code} > I think we can remove the one that takes "Column" types, since callers should > always be passing in literals. It'd also be more clear if the values are not > varargs, but rather Seq or java.util.List. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11967) Use varargs for multiple paths in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11967: Assignee: Apache Spark (was: Reynold Xin) > Use varargs for multiple paths in DataFrameReader > - > > Key: SPARK-11967 > URL: https://issues.apache.org/jira/browse/SPARK-11967 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11967) Use varargs for multiple paths in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025373#comment-15025373 ] Apache Spark commented on SPARK-11967: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/9945 > Use varargs for multiple paths in DataFrameReader > - > > Key: SPARK-11967 > URL: https://issues.apache.org/jira/browse/SPARK-11967 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11929) spark-shell log level customization is lost if user provides a log4j.properties file
[ https://issues.apache.org/jira/browse/SPARK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-11929. -- Resolution: Fixed Fix Version/s: 1.7.0 Issue resolved by pull request 9816 [https://github.com/apache/spark/pull/9816] > spark-shell log level customization is lost if user provides a > log4j.properties file > > > Key: SPARK-11929 > URL: https://issues.apache.org/jira/browse/SPARK-11929 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Shell >Reporter: Marcelo Vanzin >Priority: Minor > Fix For: 1.7.0 > > > {{Logging.scala}} has code that defines the default log level for the > spark-shell to WARN, to avoid lots of noise in the output. > But if a user provides a log4j.properies file in the Spark configuration, > that customization is lost. That means that without a log4j.properties, there > are two different configurations (one for regular apps, one for the shell). > But if you have a custom file, you lose the ability to easily differentiate > between those two, and you're stuck with a single config for both. > It would be nice to allow different configurations also in the second case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11967) Use varargs for multiple paths in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025421#comment-15025421 ] koert kuipers commented on SPARK-11967: --- i found the comment in my pullreq: * calling the function load(paths: Array[String]) would be more consistent with the rest of the reader API. This precludes using varargs, but that is probably not the most common use of this function. anyhow, i like varargs > Use varargs for multiple paths in DataFrameReader > - > > Key: SPARK-11967 > URL: https://issues.apache.org/jira/browse/SPARK-11967 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11926) unify GetStructField and GetInternalRowField
[ https://issues.apache.org/jira/browse/SPARK-11926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11926: Assignee: Wenchen Fan > unify GetStructField and GetInternalRowField > > > Key: SPARK-11926 > URL: https://issues.apache.org/jira/browse/SPARK-11926 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11967) Use varargs for multiple paths in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025424#comment-15025424 ] koert kuipers commented on SPARK-11967: --- agreed that varargs is easier. thanks > Use varargs for multiple paths in DataFrameReader > - > > Key: SPARK-11967 > URL: https://issues.apache.org/jira/browse/SPARK-11967 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"
[ https://issues.apache.org/jira/browse/SPARK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4944: --- Assignee: (was: Apache Spark) > Table Not Found exception in "Create Table Like registered RDD table" > - > > Key: SPARK-4944 > URL: https://issues.apache.org/jira/browse/SPARK-4944 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao > > {code} > rdd_table.saveAsParquetFile("/user/spark/my_data.parquet") > hiveContext.registerRDDAsTable(rdd_table, "rdd_table") > hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION > '/user/spark/my_data.parquet'") > {code} > {noformat} > org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution > Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not > found rdd_table > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284) > at > org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) > at > org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) > at > org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"
[ https://issues.apache.org/jira/browse/SPARK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025194#comment-15025194 ] Apache Spark commented on SPARK-4944: - User 'dereksabryfb' has created a pull request for this issue: https://github.com/apache/spark/pull/9944 > Table Not Found exception in "Create Table Like registered RDD table" > - > > Key: SPARK-4944 > URL: https://issues.apache.org/jira/browse/SPARK-4944 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao > > {code} > rdd_table.saveAsParquetFile("/user/spark/my_data.parquet") > hiveContext.registerRDDAsTable(rdd_table, "rdd_table") > hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION > '/user/spark/my_data.parquet'") > {code} > {noformat} > org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution > Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not > found rdd_table > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284) > at > org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) > at > org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) > at > org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11967) Use varargs for multiple paths in DataFrameReader
Reynold Xin created SPARK-11967: --- Summary: Use varargs for multiple paths in DataFrameReader Key: SPARK-11967 URL: https://issues.apache.org/jira/browse/SPARK-11967 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11967) Use varargs for multiple paths in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025413#comment-15025413 ] koert kuipers commented on SPARK-11967: --- i think i had varargs originally, and then someone asked to change it to Array for API consistency? > Use varargs for multiple paths in DataFrameReader > - > > Key: SPARK-11967 > URL: https://issues.apache.org/jira/browse/SPARK-11967 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11929) spark-shell log level customization is lost if user provides a log4j.properties file
[ https://issues.apache.org/jira/browse/SPARK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-11929: - Assignee: Marcelo Vanzin > spark-shell log level customization is lost if user provides a > log4j.properties file > > > Key: SPARK-11929 > URL: https://issues.apache.org/jira/browse/SPARK-11929 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Shell >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 1.7.0 > > > {{Logging.scala}} has code that defines the default log level for the > spark-shell to WARN, to avoid lots of noise in the output. > But if a user provides a log4j.properies file in the Spark configuration, > that customization is lost. That means that without a log4j.properties, there > are two different configurations (one for regular apps, one for the shell). > But if you have a custom file, you lose the ability to easily differentiate > between those two, and you're stuck with a single config for both. > It would be nice to allow different configurations also in the second case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11982) Improve performance of CartesianProduct
Davies Liu created SPARK-11982: -- Summary: Improve performance of CartesianProduct Key: SPARK-11982 URL: https://issues.apache.org/jira/browse/SPARK-11982 Project: Spark Issue Type: Improvement Reporter: Davies Liu RDD.cartesian() is very slow, we should improve it or create a special version for SQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9141) DataFrame recomputed instead of using cached parent.
[ https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated SPARK-9141: --- Comment: was deleted (was: When u run these codes, you can find the UDF was called 20 times, not the expected 10 times. ) > DataFrame recomputed instead of using cached parent. > > > Key: SPARK-9141 > URL: https://issues.apache.org/jira/browse/SPARK-9141 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1 >Reporter: Nick Pritchard >Assignee: Michael Armbrust >Priority: Blocker > Labels: cache, dataframe > Fix For: 1.5.0 > > > As I understand, DataFrame.cache() is supposed to work the same as > RDD.cache(), so that repeated operations on it will use the cached results > and not recompute the entire lineage. However, it seems that some DataFrame > operations (e.g. withColumn) change the underlying RDD lineage so that cache > doesn't work as expected. > Below is a Scala example that demonstrates this. First, I define two UDF's > that use println so that it is easy to see when they are being called. Next, > I create a simple data frame with one row and two columns. Next, I add a > column, cache it, and call count() to force the computation. Lastly, I add > another column, cache it, and call count(). > I would have expected the last statement to only compute the last column, > since everything else was cached. However, because withColumn() changes the > lineage, the whole data frame is recomputed. > {code} > // Examples udf's that println when called > val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } > val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } > // Initial dataset > val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") > // Add column by applying twice udf > val df2 = df1.withColumn("twice", twice($"value")) > df2.cache() > df2.count() //prints Computed: twice(1) > // Add column by applying triple udf > val df3 = df2.withColumn("triple", triple($"value")) > df3.cache() > df3.count() //prints Computed: twice(1)\nComputed: triple(1) > {code} > I found a workaround, which helped me understand what was going on behind the > scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD > then back DataFrame, which seems to freeze the lineage. The code below shows > the workaround for creating the second data frame so cache will work as > expected. > {code} > val df2 = { > val tmp = df1.withColumn("twice", twice($"value")) > sqlContext.createDataFrame(tmp.rdd, tmp.schema) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9141) DataFrame recomputed instead of using cached parent.
[ https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026105#comment-15026105 ] Yi Tian edited comment on SPARK-9141 at 11/25/15 7:00 AM: -- [~marmbrus] Here is my codes: {code} val rdd = sc.parallelize(1 to 10).map{line => new GenericRow(Array[Any]("a","b")).asInstanceOf[Row]} val df = hc.createDataFrame(rdd, StructType(Seq(StructField("a",StringType),StructField("b",StringType val mkArrayUDF = org.apache.spark.sql.functions.udf[Array[String],String,String] ((s1: String, s2: String) => { println("udf called") Array[String](s1, s2) }) val df2 = df.withColumn("arr",mkArrayUDF(df("a"),df("b"))) val df3 = df2.withColumn("e0", df2("arr")(0)).withColumn("e1", df2("arr")(1)) df3.collect().foreach(println) {code} was (Author: tianyi): [~marmbrus] Here is my codes: {code:scala} val rdd = sc.parallelize(1 to 10).map{line => new GenericRow(Array[Any]("a","b")).asInstanceOf[Row]} val df = hc.createDataFrame(rdd, StructType(Seq(StructField("a",StringType),StructField("b",StringType val mkArrayUDF = org.apache.spark.sql.functions.udf[Array[String],String,String] ((s1: String, s2: String) => { println("udf called") Array[String](s1, s2) }) val df2 = df.withColumn("arr",mkArrayUDF(df("a"),df("b"))) val df3 = df2.withColumn("e0", df2("arr")(0)).withColumn("e1", df2("arr")(1)) df3.collect().foreach(println) {code} > DataFrame recomputed instead of using cached parent. > > > Key: SPARK-9141 > URL: https://issues.apache.org/jira/browse/SPARK-9141 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1 >Reporter: Nick Pritchard >Assignee: Michael Armbrust >Priority: Blocker > Labels: cache, dataframe > Fix For: 1.5.0 > > > As I understand, DataFrame.cache() is supposed to work the same as > RDD.cache(), so that repeated operations on it will use the cached results > and not recompute the entire lineage. However, it seems that some DataFrame > operations (e.g. withColumn) change the underlying RDD lineage so that cache > doesn't work as expected. > Below is a Scala example that demonstrates this. First, I define two UDF's > that use println so that it is easy to see when they are being called. Next, > I create a simple data frame with one row and two columns. Next, I add a > column, cache it, and call count() to force the computation. Lastly, I add > another column, cache it, and call count(). > I would have expected the last statement to only compute the last column, > since everything else was cached. However, because withColumn() changes the > lineage, the whole data frame is recomputed. > {code} > // Examples udf's that println when called > val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } > val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } > // Initial dataset > val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") > // Add column by applying twice udf > val df2 = df1.withColumn("twice", twice($"value")) > df2.cache() > df2.count() //prints Computed: twice(1) > // Add column by applying triple udf > val df3 = df2.withColumn("triple", triple($"value")) > df3.cache() > df3.count() //prints Computed: twice(1)\nComputed: triple(1) > {code} > I found a workaround, which helped me understand what was going on behind the > scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD > then back DataFrame, which seems to freeze the lineage. The code below shows > the workaround for creating the second data frame so cache will work as > expected. > {code} > val df2 = { > val tmp = df1.withColumn("twice", twice($"value")) > sqlContext.createDataFrame(tmp.rdd, tmp.schema) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11961) User guide section for ChiSqSelector transformer
[ https://issues.apache.org/jira/browse/SPARK-11961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11961: Assignee: Xusen Yin (was: Apache Spark) > User guide section for ChiSqSelector transformer > > > Key: SPARK-11961 > URL: https://issues.apache.org/jira/browse/SPARK-11961 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > > [~yinxusen] Assigning this to you since you added the feature. Will you have > time to add a section? Thank you! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11961) User guide section for ChiSqSelector transformer
[ https://issues.apache.org/jira/browse/SPARK-11961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11961: Assignee: Apache Spark (was: Xusen Yin) > User guide section for ChiSqSelector transformer > > > Key: SPARK-11961 > URL: https://issues.apache.org/jira/browse/SPARK-11961 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > [~yinxusen] Assigning this to you since you added the feature. Will you have > time to add a section? Thank you! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11961) User guide section for ChiSqSelector transformer
[ https://issues.apache.org/jira/browse/SPARK-11961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026329#comment-15026329 ] Apache Spark commented on SPARK-11961: -- User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/9965 > User guide section for ChiSqSelector transformer > > > Key: SPARK-11961 > URL: https://issues.apache.org/jira/browse/SPARK-11961 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > > [~yinxusen] Assigning this to you since you added the feature. Will you have > time to add a section? Thank you! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11979) Empty TrackStateRDD cannot be checkpointed and recovered from checkpoint file
[ https://issues.apache.org/jira/browse/SPARK-11979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-11979. -- Resolution: Fixed Fix Version/s: 1.6.0 > Empty TrackStateRDD cannot be checkpointed and recovered from checkpoint file > -- > > Key: SPARK-11979 > URL: https://issues.apache.org/jira/browse/SPARK-11979 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > Fix For: 1.6.0 > > > {code} > Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 6.0 (TID 20, localhost): > java.lang.IllegalArgumentException: requirement failed: Invalid initial > capacity > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.streaming.util.OpenHashMapBasedStateMap.(StateMap.scala:96) > at > org.apache.spark.streaming.util.OpenHashMapBasedStateMap.(StateMap.scala:86) > at > org.apache.spark.streaming.util.OpenHashMapBasedStateMap.readObject(StateMap.scala:291) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:181) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:921) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:921) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Driver stacktrace: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 > (TID 20, localhost): java.lang.IllegalArgumentException: requirement failed: > Invalid initial capacity > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.streaming.util.OpenHashMapBasedStateMap.(StateMap.scala:96) > at > org.apache.spark.streaming.util.OpenHashMapBasedStateMap.(StateMap.scala:86) > at >
[jira] [Comment Edited] (SPARK-9141) DataFrame recomputed instead of using cached parent.
[ https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026105#comment-15026105 ] Yi Tian edited comment on SPARK-9141 at 11/25/15 7:24 AM: -- [~marmbrus] Sorry, it's our fault. was (Author: tianyi): [~marmbrus] Here is my codes: {code} val rdd = sc.parallelize(1 to 10).map{line => new GenericRow(Array[Any]("a","b")).asInstanceOf[Row]} val df = hc.createDataFrame(rdd, StructType(Seq(StructField("a",StringType),StructField("b",StringType val mkArrayUDF = org.apache.spark.sql.functions.udf[Array[String],String,String] ((s1: String, s2: String) => { println("udf called") Array[String](s1, s2) }) val df2 = df.withColumn("arr",mkArrayUDF(df("a"),df("b"))) val df3 = df2.withColumn("e0", df2("arr")(0)).withColumn("e1", df2("arr")(1)) df3.collect().foreach(println) {code} > DataFrame recomputed instead of using cached parent. > > > Key: SPARK-9141 > URL: https://issues.apache.org/jira/browse/SPARK-9141 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1 >Reporter: Nick Pritchard >Assignee: Michael Armbrust >Priority: Blocker > Labels: cache, dataframe > Fix For: 1.5.0 > > > As I understand, DataFrame.cache() is supposed to work the same as > RDD.cache(), so that repeated operations on it will use the cached results > and not recompute the entire lineage. However, it seems that some DataFrame > operations (e.g. withColumn) change the underlying RDD lineage so that cache > doesn't work as expected. > Below is a Scala example that demonstrates this. First, I define two UDF's > that use println so that it is easy to see when they are being called. Next, > I create a simple data frame with one row and two columns. Next, I add a > column, cache it, and call count() to force the computation. Lastly, I add > another column, cache it, and call count(). > I would have expected the last statement to only compute the last column, > since everything else was cached. However, because withColumn() changes the > lineage, the whole data frame is recomputed. > {code} > // Examples udf's that println when called > val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } > val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } > // Initial dataset > val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") > // Add column by applying twice udf > val df2 = df1.withColumn("twice", twice($"value")) > df2.cache() > df2.count() //prints Computed: twice(1) > // Add column by applying triple udf > val df3 = df2.withColumn("triple", triple($"value")) > df3.cache() > df3.count() //prints Computed: twice(1)\nComputed: triple(1) > {code} > I found a workaround, which helped me understand what was going on behind the > scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD > then back DataFrame, which seems to freeze the lineage. The code below shows > the workaround for creating the second data frame so cache will work as > expected. > {code} > val df2 = { > val tmp = df1.withColumn("twice", twice($"value")) > sqlContext.createDataFrame(tmp.rdd, tmp.schema) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11329) Expand Star when creating a struct
[ https://issues.apache.org/jira/browse/SPARK-11329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026372#comment-15026372 ] Maciej Bryński commented on SPARK-11329: I'm using Spark 1.6. I did some additional tests. This select use TungstenAggregate: ``` sqlCtx.sql('select id, max(data) as max from table group by id ').collect() ``` When I add struct() it changed to ConvertToSafe path. So I think the problem lies in struct() function. > Expand Star when creating a struct > -- > > Key: SPARK-11329 > URL: https://issues.apache.org/jira/browse/SPARK-11329 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Nong Li > Fix For: 1.6.0 > > > It is pretty common for customers to do regular extractions of update data > from an external datasource (e.g. mysql or postgres). While this is possible > today, the syntax is a little onerous. With some small improvements to the > analyzer I think we could make this much easier. > Goal: Allow users to execute the following two queries as well as their > dataframe equivalents > to find the most recent record for each key > {{SELECT max(struct(timestamp, *)) as mostRecentRecord GROUP BY key}} > to unnest the struct from above. > {{SELECT mostRecentRecord.* FROM data}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11818) ExecutorClassLoader cannot see any resources from parent class loader
[ https://issues.apache.org/jira/browse/SPARK-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11818. Resolution: Fixed Assignee: Jungtaek Lim Fix Version/s: 1.6.0 > ExecutorClassLoader cannot see any resources from parent class loader > - > > Key: SPARK-11818 > URL: https://issues.apache.org/jira/browse/SPARK-11818 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.4.1 > Environment: CentOS 6, spark 1.4.1-hadoop2.4, mesos 0.22.1 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim > Fix For: 1.6.0 > > > This issue starts from finding root reason from strange problem from > spark-shell (and zeppelin) which is not a problem for spark-submit. > https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAF5108jMXyOjiGmCgr%3Ds%2BNvTMcyKWMBVM1GsrH7Pz4xUj48LfA%40mail.gmail.com%3E > After some hours (over days) digging into the detail, I found that > ExecutorClassLoader cannot see any resource files which can be seen from > parent class loader. > ExecutorClassLoader itself doesn't need to lookup resource files cause REPL > doesn't generate these, but it should delegate lookup to parent class loader. > I'll provide the pull request which includes tests which fails on master soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11955) Mark one side fields in merging schema for safely pushdowning filters in parquet
[ https://issues.apache.org/jira/browse/SPARK-11955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024893#comment-15024893 ] Apache Spark commented on SPARK-11955: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/9940 > Mark one side fields in merging schema for safely pushdowning filters in > parquet > > > Key: SPARK-11955 > URL: https://issues.apache.org/jira/browse/SPARK-11955 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently we simply skip pushdowning filters in parquet if we enable schema > merging. > However, we can actually mark one side fields in merging schema for safely > pushdowning filters in parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11955) Mark one side fields in merging schema for safely pushdowning filters in parquet
Liang-Chi Hsieh created SPARK-11955: --- Summary: Mark one side fields in merging schema for safely pushdowning filters in parquet Key: SPARK-11955 URL: https://issues.apache.org/jira/browse/SPARK-11955 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently we simply skip pushdowning filters in parquet if we enable schema merging. However, we can actually mark one side fields in merging schema for safely pushdowning filters in parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11942) fix encoder life cycle for CoGroup
[ https://issues.apache.org/jira/browse/SPARK-11942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11942: - Assignee: Wenchen Fan > fix encoder life cycle for CoGroup > -- > > Key: SPARK-11942 > URL: https://issues.apache.org/jira/browse/SPARK-11942 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11955) Mark one side fields in merging schema for safely pushdowning filters in parquet
[ https://issues.apache.org/jira/browse/SPARK-11955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11955: Assignee: (was: Apache Spark) > Mark one side fields in merging schema for safely pushdowning filters in > parquet > > > Key: SPARK-11955 > URL: https://issues.apache.org/jira/browse/SPARK-11955 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently we simply skip pushdowning filters in parquet if we enable schema > merging. > However, we can actually mark one side fields in merging schema for safely > pushdowning filters in parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11955) Mark one side fields in merging schema for safely pushdowning filters in parquet
[ https://issues.apache.org/jira/browse/SPARK-11955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11955: Assignee: Apache Spark > Mark one side fields in merging schema for safely pushdowning filters in > parquet > > > Key: SPARK-11955 > URL: https://issues.apache.org/jira/browse/SPARK-11955 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > Currently we simply skip pushdowning filters in parquet if we enable schema > merging. > However, we can actually mark one side fields in merging schema for safely > pushdowning filters in parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11942) fix encoder life cycle for CoGroup
[ https://issues.apache.org/jira/browse/SPARK-11942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11942. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9928 [https://github.com/apache/spark/pull/9928] > fix encoder life cycle for CoGroup > -- > > Key: SPARK-11942 > URL: https://issues.apache.org/jira/browse/SPARK-11942 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11956) Test failures potentially related to SPARK-11140
Marcelo Vanzin created SPARK-11956: -- Summary: Test failures potentially related to SPARK-11140 Key: SPARK-11956 URL: https://issues.apache.org/jira/browse/SPARK-11956 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.7.0 Reporter: Marcelo Vanzin [~joshrosen] pointed out that some YARN tests started failing intermittently after that change went in. Here's a suspicious excerpt from one of the logs on Jenkins: {noformat} 15/11/24 08:58:18 DEBUG TransportClient: Sending stream request for /jars/sparkJar2657865636759819960.tmp to /192.168.10.27:53256 15/11/24 08:58:18 INFO Utils: Fetching spark://192.168.10.27:53256/jars/sparkJar2657865636759819960.tmp to /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/fetchFileTemp4632692089398180695.tmp 15/11/24 09:00:00 WARN CoarseGrainedExecutorBackend: An unknown (amp-jenkins-worker-07.amp:53256) driver disconnected. 15/11/24 09:00:00 ERROR NettyRpcEnv: Error downloading stream /jars/sparkJar2657865636759819960.tmp. java.nio.channels.ClosedChannelException at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:61) at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:123) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828) at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 15/11/24 09:00:00 INFO Utils: Copying /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/-4120705061448384283661_cache to /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp 15/11/24 09:00:00 INFO Executor: Adding file:/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp to class loader {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9141) DataFrame recomputed instead of using cached parent.
[ https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024927#comment-15024927 ] Michael Armbrust commented on SPARK-9141: - [~tianyi] please provide a reproduction of the issue you are hitting. The example from the description works for me. In particular please include explain for the cache and failing dataframe. > DataFrame recomputed instead of using cached parent. > > > Key: SPARK-9141 > URL: https://issues.apache.org/jira/browse/SPARK-9141 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1 >Reporter: Nick Pritchard >Assignee: Michael Armbrust >Priority: Blocker > Labels: cache, dataframe > Fix For: 1.5.0 > > > As I understand, DataFrame.cache() is supposed to work the same as > RDD.cache(), so that repeated operations on it will use the cached results > and not recompute the entire lineage. However, it seems that some DataFrame > operations (e.g. withColumn) change the underlying RDD lineage so that cache > doesn't work as expected. > Below is a Scala example that demonstrates this. First, I define two UDF's > that use println so that it is easy to see when they are being called. Next, > I create a simple data frame with one row and two columns. Next, I add a > column, cache it, and call count() to force the computation. Lastly, I add > another column, cache it, and call count(). > I would have expected the last statement to only compute the last column, > since everything else was cached. However, because withColumn() changes the > lineage, the whole data frame is recomputed. > {code} > // Examples udf's that println when called > val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } > val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } > // Initial dataset > val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") > // Add column by applying twice udf > val df2 = df1.withColumn("twice", twice($"value")) > df2.cache() > df2.count() //prints Computed: twice(1) > // Add column by applying triple udf > val df3 = df2.withColumn("triple", triple($"value")) > df3.cache() > df3.count() //prints Computed: twice(1)\nComputed: triple(1) > {code} > I found a workaround, which helped me understand what was going on behind the > scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD > then back DataFrame, which seems to freeze the lineage. The code below shows > the workaround for creating the second data frame so cache will work as > expected. > {code} > val df2 = { > val tmp = df1.withColumn("twice", twice($"value")) > sqlContext.createDataFrame(tmp.rdd, tmp.schema) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11952) Remove duplicate ml examples
[ https://issues.apache.org/jira/browse/SPARK-11952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11952. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9933 [https://github.com/apache/spark/pull/9933] > Remove duplicate ml examples > > > Key: SPARK-11952 > URL: https://issues.apache.org/jira/browse/SPARK-11952 > Project: Spark > Issue Type: Sub-task > Components: Examples, ML >Reporter: Yanbo Liang >Priority: Minor > Fix For: 1.6.0 > > > Remove duplicate ml examples (only for ML) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11952) Remove duplicate ml examples (GBT/RF/logistic regression in Python)
[ https://issues.apache.org/jira/browse/SPARK-11952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11952: -- Assignee: Yanbo Liang > Remove duplicate ml examples (GBT/RF/logistic regression in Python) > --- > > Key: SPARK-11952 > URL: https://issues.apache.org/jira/browse/SPARK-11952 > Project: Spark > Issue Type: Sub-task > Components: Examples, ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 1.6.0 > > > Remove duplicate ml examples (only for ML) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11521) LinearRegressionSummary needs to clarify which metrics are weighted in the documentation
[ https://issues.apache.org/jira/browse/SPARK-11521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11521. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9927 [https://github.com/apache/spark/pull/9927] > LinearRegressionSummary needs to clarify which metrics are weighted in the > documentation > > > Key: SPARK-11521 > URL: https://issues.apache.org/jira/browse/SPARK-11521 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > Fix For: 1.6.0 > > > Some metrics in the summary are weighted (e.g., devianceResiduals), but the > ones computed via RegressionMetrics are not. This should be documented very > clearly (unless this gets fixed before the next release in [SPARK-11520]). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11952) Remove duplicate ml examples (GBT/RF/logistic regression in Python)
[ https://issues.apache.org/jira/browse/SPARK-11952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11952: -- Summary: Remove duplicate ml examples (GBT/RF/logistic regression in Python) (was: Remove duplicate ml examples) > Remove duplicate ml examples (GBT/RF/logistic regression in Python) > --- > > Key: SPARK-11952 > URL: https://issues.apache.org/jira/browse/SPARK-11952 > Project: Spark > Issue Type: Sub-task > Components: Examples, ML >Reporter: Yanbo Liang >Priority: Minor > Fix For: 1.6.0 > > > Remove duplicate ml examples (only for ML) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11730) Feature Importance for GBT
[ https://issues.apache.org/jira/browse/SPARK-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025084#comment-15025084 ] Joseph K. Bradley commented on SPARK-11730: --- OK, following Friedman sounds good. : ) I agree it'd be nice to wait for GBT to be moved to spark.ml. > Feature Importance for GBT > -- > > Key: SPARK-11730 > URL: https://issues.apache.org/jira/browse/SPARK-11730 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Brian Webb > > Random Forests have feature importance, but GBT do not. It would be great if > we can add feature importance to GBT as well. Perhaps the code in Random > Forests can be refactored to apply to both types of ensembles. > See https://issues.apache.org/jira/browse/SPARK-5133 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10022: -- Assignee: Yanbo Liang > Scala-Python method/parameter inconsistency check for ML during 1.5 QA > -- > > Key: SPARK-10022 > URL: https://issues.apache.org/jira/browse/SPARK-10022 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 1.6.0 > > Attachments: python-1.4.txt, python-1.5.txt, python1.4-to-1.5.diff > > > The missing classes for PySpark were listed at SPARK-9663. > Here we check and list the missing method/parameter for ML of PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11604) ML 1.6 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025086#comment-15025086 ] Joseph K. Bradley commented on SPARK-11604: --- Great thank you! > ML 1.6 QA: API: Python API coverage > --- > > Key: SPARK-11604 > URL: https://issues.apache.org/jira/browse/SPARK-11604 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below) for this list of to-do items. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-10022. --- Resolution: Fixed Fix Version/s: 1.6.0 > Scala-Python method/parameter inconsistency check for ML during 1.5 QA > -- > > Key: SPARK-10022 > URL: https://issues.apache.org/jira/browse/SPARK-10022 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 1.6.0 > > Attachments: python-1.4.txt, python-1.5.txt, python1.4-to-1.5.diff > > > The missing classes for PySpark were listed at SPARK-9663. > Here we check and list the missing method/parameter for ML of PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11884) Drop multiple columns in the DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-11884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024837#comment-15024837 ] Ted Yu commented on SPARK-11884: Is there interest in moving forward with the PR ? > Drop multiple columns in the DataFrame API > -- > > Key: SPARK-11884 > URL: https://issues.apache.org/jira/browse/SPARK-11884 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu >Priority: Minor > > See the thread Ben started: > http://search-hadoop.com/m/q3RTtveEuhjsr7g/ > This issue adds drop() method to DataFrame which accepts multiple column names -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11602) ML 1.6 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-11602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11602: Assignee: yuhao yang (was: Apache Spark) > ML 1.6 QA: API: New Scala APIs, docs > > > Key: SPARK-11602 > URL: https://issues.apache.org/jira/browse/SPARK-11602 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > Audit new public Scala APIs added to MLlib. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please comment here, or better yet create JIRAs and link > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11602) ML 1.6 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-11602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024838#comment-15024838 ] Apache Spark commented on SPARK-11602: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/9939 > ML 1.6 QA: API: New Scala APIs, docs > > > Key: SPARK-11602 > URL: https://issues.apache.org/jira/browse/SPARK-11602 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > Audit new public Scala APIs added to MLlib. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please comment here, or better yet create JIRAs and link > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11855) Catalyst breaks backwards compatibility in branch-1.6
[ https://issues.apache.org/jira/browse/SPARK-11855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11855: Assignee: (was: Apache Spark) > Catalyst breaks backwards compatibility in branch-1.6 > - > > Key: SPARK-11855 > URL: https://issues.apache.org/jira/browse/SPARK-11855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Santiago M. Mola >Priority: Critical > > There's a number of APIs broken in catalyst 1.6.0. I'm trying to compile most > cases: > *UnresolvedRelation*'s constructor has been changed from taking a Seq to a > TableIdentifier. A deprecated constructor taking Seq would be needed to be > backwards compatible. > {code} > case class UnresolvedRelation( > -tableIdentifier: Seq[String], > +tableIdentifier: TableIdentifier, > alias: Option[String] = None) extends LeafNode { > {code} > It is similar with *UnresolvedStar*: > {code} > -case class UnresolvedStar(table: Option[String]) extends Star with > Unevaluable { > +case class UnresolvedStar(target: Option[Seq[String]]) extends Star with > Unevaluable { > {code} > *Catalog* did get a lot of signatures changed too (because of > TableIdentifier). Providing the older methods as deprecated also seems viable > here. > Spark 1.5 already broke backwards compatibility of part of catalyst API with > respect to 1.4. I understand there are good reasons for some cases, but we > should try to minimize backwards compatibility breakages for 1.x. Specially > now that 2.x is on the horizon and there will be a near opportunity to remove > deprecated stuff. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11956) Test failures potentially related to SPARK-11140
[ https://issues.apache.org/jira/browse/SPARK-11956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024974#comment-15024974 ] Marcelo Vanzin commented on SPARK-11956: This seems to be an issue I identified as part of working on SPARK-11563; there are fixes as part of that PR, I'll try to pull them out so we can unblock tests without having to push everything. > Test failures potentially related to SPARK-11140 > > > Key: SPARK-11956 > URL: https://issues.apache.org/jira/browse/SPARK-11956 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.7.0 >Reporter: Marcelo Vanzin > > [~joshrosen] pointed out that some YARN tests started failing intermittently > after that change went in. Here's a suspicious excerpt from one of the logs > on Jenkins: > {noformat} > 15/11/24 08:58:18 DEBUG TransportClient: Sending stream request for > /jars/sparkJar2657865636759819960.tmp to /192.168.10.27:53256 > 15/11/24 08:58:18 INFO Utils: Fetching > spark://192.168.10.27:53256/jars/sparkJar2657865636759819960.tmp to > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/fetchFileTemp4632692089398180695.tmp > 15/11/24 09:00:00 WARN CoarseGrainedExecutorBackend: An unknown > (amp-jenkins-worker-07.amp:53256) driver disconnected. > 15/11/24 09:00:00 ERROR NettyRpcEnv: Error downloading stream > /jars/sparkJar2657865636759819960.tmp. > java.nio.channels.ClosedChannelException > at > org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:61) > at > org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:123) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > at > io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > 15/11/24 09:00:00 INFO Utils: Copying > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/-4120705061448384283661_cache > to > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp > 15/11/24 09:00:00 INFO Executor: Adding > file:/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp > to class loader > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11953) CLONE - Sparksql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append Bug
[ https://issues.apache.org/jira/browse/SPARK-11953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024991#comment-15024991 ] Huaxin Gao commented on SPARK-11953: My understanding is that SaveMode.Append doesn't mean that table exists for sure. When SaveMode is Append, if table exists, append to the existing table. If table doesn't exist, create the table and append to the newly created table. So the code looks right to me. > CLONE - Sparksql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append Bug > > > Key: SPARK-11953 > URL: https://issues.apache.org/jira/browse/SPARK-11953 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Submit, SQL >Affects Versions: 1.4.1, 1.5.1 > Environment: Spark stand alone cluster >Reporter: Siva Gudavalli > > In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc. > They are replaced with write.jdbc() in Spark 1.4.1 > When we specify SaveMode.Append we are letting application know that there is > a table in the database which means "tableExists = true". And we do not need > to perform "JdbcUtils.tableExists(conn, table)". > Please let me know if you think differently. > Regards > Shiv > def jdbc(url: String, table: String, connectionProperties: Properties): Unit > = { > val conn = JdbcUtils.createConnection(url, connectionProperties) > try { > var tableExists = JdbcUtils.tableExists(conn, table) > if (mode == SaveMode.Ignore && tableExists) > { return } > if (mode == SaveMode.ErrorIfExists && tableExists) > { sys.error(s"Table $table already exists.") } > if (mode == SaveMode.Overwrite && tableExists) > { JdbcUtils.dropTable(conn, table) tableExists = false } > // Create the table if the table didn't exist. > if (!tableExists) > { val schema = JDBCWriteDetails.schemaString(df, url) val sql = s"CREATE > TABLE $table ($schema)" conn.prepareStatement(sql).executeUpdate() } > } finally > { conn.close() } > JDBCWriteDetails.saveTable(df, url, table, connectionProperties) > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11847) Model export/import for spark.ml: LDA
[ https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11847. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9894 [https://github.com/apache/spark/pull/9894] > Model export/import for spark.ml: LDA > - > > Key: SPARK-11847 > URL: https://issues.apache.org/jira/browse/SPARK-11847 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: yuhao yang > Fix For: 1.6.0 > > > Add read/write support to LDA, similar to ALS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11956) Test failures potentially related to SPARK-11140
[ https://issues.apache.org/jira/browse/SPARK-11956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025006#comment-15025006 ] Apache Spark commented on SPARK-11956: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/9941 > Test failures potentially related to SPARK-11140 > > > Key: SPARK-11956 > URL: https://issues.apache.org/jira/browse/SPARK-11956 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.7.0 >Reporter: Marcelo Vanzin > > [~joshrosen] pointed out that some YARN tests started failing intermittently > after that change went in. Here's a suspicious excerpt from one of the logs > on Jenkins: > {noformat} > 15/11/24 08:58:18 DEBUG TransportClient: Sending stream request for > /jars/sparkJar2657865636759819960.tmp to /192.168.10.27:53256 > 15/11/24 08:58:18 INFO Utils: Fetching > spark://192.168.10.27:53256/jars/sparkJar2657865636759819960.tmp to > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/fetchFileTemp4632692089398180695.tmp > 15/11/24 09:00:00 WARN CoarseGrainedExecutorBackend: An unknown > (amp-jenkins-worker-07.amp:53256) driver disconnected. > 15/11/24 09:00:00 ERROR NettyRpcEnv: Error downloading stream > /jars/sparkJar2657865636759819960.tmp. > java.nio.channels.ClosedChannelException > at > org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:61) > at > org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:123) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > at > io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > 15/11/24 09:00:00 INFO Utils: Copying > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/-4120705061448384283661_cache > to > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp > 15/11/24 09:00:00 INFO Executor: Adding > file:/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp > to class loader > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11956) Test failures potentially related to SPARK-11140
[ https://issues.apache.org/jira/browse/SPARK-11956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11956: Assignee: (was: Apache Spark) > Test failures potentially related to SPARK-11140 > > > Key: SPARK-11956 > URL: https://issues.apache.org/jira/browse/SPARK-11956 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.7.0 >Reporter: Marcelo Vanzin > > [~joshrosen] pointed out that some YARN tests started failing intermittently > after that change went in. Here's a suspicious excerpt from one of the logs > on Jenkins: > {noformat} > 15/11/24 08:58:18 DEBUG TransportClient: Sending stream request for > /jars/sparkJar2657865636759819960.tmp to /192.168.10.27:53256 > 15/11/24 08:58:18 INFO Utils: Fetching > spark://192.168.10.27:53256/jars/sparkJar2657865636759819960.tmp to > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/fetchFileTemp4632692089398180695.tmp > 15/11/24 09:00:00 WARN CoarseGrainedExecutorBackend: An unknown > (amp-jenkins-worker-07.amp:53256) driver disconnected. > 15/11/24 09:00:00 ERROR NettyRpcEnv: Error downloading stream > /jars/sparkJar2657865636759819960.tmp. > java.nio.channels.ClosedChannelException > at > org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:61) > at > org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:123) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > at > io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > 15/11/24 09:00:00 INFO Utils: Copying > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/-4120705061448384283661_cache > to > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp > 15/11/24 09:00:00 INFO Executor: Adding > file:/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp > to class loader > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11956) Test failures potentially related to SPARK-11140
[ https://issues.apache.org/jira/browse/SPARK-11956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11956: Assignee: Apache Spark > Test failures potentially related to SPARK-11140 > > > Key: SPARK-11956 > URL: https://issues.apache.org/jira/browse/SPARK-11956 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.7.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > [~joshrosen] pointed out that some YARN tests started failing intermittently > after that change went in. Here's a suspicious excerpt from one of the logs > on Jenkins: > {noformat} > 15/11/24 08:58:18 DEBUG TransportClient: Sending stream request for > /jars/sparkJar2657865636759819960.tmp to /192.168.10.27:53256 > 15/11/24 08:58:18 INFO Utils: Fetching > spark://192.168.10.27:53256/jars/sparkJar2657865636759819960.tmp to > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/fetchFileTemp4632692089398180695.tmp > 15/11/24 09:00:00 WARN CoarseGrainedExecutorBackend: An unknown > (amp-jenkins-worker-07.amp:53256) driver disconnected. > 15/11/24 09:00:00 ERROR NettyRpcEnv: Error downloading stream > /jars/sparkJar2657865636759819960.tmp. > java.nio.channels.ClosedChannelException > at > org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:61) > at > org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:123) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > at > io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > 15/11/24 09:00:00 INFO Utils: Copying > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/-4120705061448384283661_cache > to > /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp > 15/11/24 09:00:00 INFO Executor: Adding > file:/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp > to class loader > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11855) Catalyst breaks backwards compatibility in branch-1.6
[ https://issues.apache.org/jira/browse/SPARK-11855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024832#comment-15024832 ] Apache Spark commented on SPARK-11855: -- User 'smola' has created a pull request for this issue: https://github.com/apache/spark/pull/9938 > Catalyst breaks backwards compatibility in branch-1.6 > - > > Key: SPARK-11855 > URL: https://issues.apache.org/jira/browse/SPARK-11855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Santiago M. Mola >Priority: Critical > > There's a number of APIs broken in catalyst 1.6.0. I'm trying to compile most > cases: > *UnresolvedRelation*'s constructor has been changed from taking a Seq to a > TableIdentifier. A deprecated constructor taking Seq would be needed to be > backwards compatible. > {code} > case class UnresolvedRelation( > -tableIdentifier: Seq[String], > +tableIdentifier: TableIdentifier, > alias: Option[String] = None) extends LeafNode { > {code} > It is similar with *UnresolvedStar*: > {code} > -case class UnresolvedStar(table: Option[String]) extends Star with > Unevaluable { > +case class UnresolvedStar(target: Option[Seq[String]]) extends Star with > Unevaluable { > {code} > *Catalog* did get a lot of signatures changed too (because of > TableIdentifier). Providing the older methods as deprecated also seems viable > here. > Spark 1.5 already broke backwards compatibility of part of catalyst API with > respect to 1.4. I understand there are good reasons for some cases, but we > should try to minimize backwards compatibility breakages for 1.x. Specially > now that 2.x is on the horizon and there will be a near opportunity to remove > deprecated stuff. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11602) ML 1.6 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-11602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11602: Assignee: Apache Spark (was: yuhao yang) > ML 1.6 QA: API: New Scala APIs, docs > > > Key: SPARK-11602 > URL: https://issues.apache.org/jira/browse/SPARK-11602 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Audit new public Scala APIs added to MLlib. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please comment here, or better yet create JIRAs and link > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11855) Catalyst breaks backwards compatibility in branch-1.6
[ https://issues.apache.org/jira/browse/SPARK-11855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11855: Assignee: Apache Spark > Catalyst breaks backwards compatibility in branch-1.6 > - > > Key: SPARK-11855 > URL: https://issues.apache.org/jira/browse/SPARK-11855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Santiago M. Mola >Assignee: Apache Spark >Priority: Critical > > There's a number of APIs broken in catalyst 1.6.0. I'm trying to compile most > cases: > *UnresolvedRelation*'s constructor has been changed from taking a Seq to a > TableIdentifier. A deprecated constructor taking Seq would be needed to be > backwards compatible. > {code} > case class UnresolvedRelation( > -tableIdentifier: Seq[String], > +tableIdentifier: TableIdentifier, > alias: Option[String] = None) extends LeafNode { > {code} > It is similar with *UnresolvedStar*: > {code} > -case class UnresolvedStar(table: Option[String]) extends Star with > Unevaluable { > +case class UnresolvedStar(target: Option[Seq[String]]) extends Star with > Unevaluable { > {code} > *Catalog* did get a lot of signatures changed too (because of > TableIdentifier). Providing the older methods as deprecated also seems viable > here. > Spark 1.5 already broke backwards compatibility of part of catalyst API with > respect to 1.4. I understand there are good reasons for some cases, but we > should try to minimize backwards compatibility breakages for 1.x. Specially > now that 2.x is on the horizon and there will be a near opportunity to remove > deprecated stuff. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9328) Netty IO layer should implement read timeouts
[ https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025107#comment-15025107 ] Michael Armbrust commented on SPARK-9328: - [~joshrosen] is this actually a 1.6 blocker? > Netty IO layer should implement read timeouts > - > > Key: SPARK-9328 > URL: https://issues.apache.org/jira/browse/SPARK-9328 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.1, 1.3.1, 1.4.1, 1.5.0 >Reporter: Josh Rosen >Priority: Blocker > > Spark's network layer does not implement read timeouts which may lead to > stalls during shuffle: if a remote shuffle server stalls while responding to > a shuffle block fetch request but does not close the socket then the job may > block until an OS-level socket timeout occurs. > I think that we can fix this using Netty's ReadTimeoutHandler > (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). > The tricky part of working on this will be figuring out the right place to > add the handler and ensuring that we don't introduce performance issues by > not re-using sockets. > Quoting from that linked StackOverflow question: > {quote} > Note that the ReadTimeoutHandler is also unaware of whether you have sent a > request - it only cares whether data has been read from the socket. If your > connection is persistent, and you only want read timeouts to fire when a > request has been sent, you'll need to build a request / response aware > timeout handler. > {quote} > If we want to avoid tearing down connections between shuffles then we may > have to do something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11382) Replace example code in mllib-decision-tree.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-11382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025109#comment-15025109 ] Apache Spark commented on SPARK-11382: -- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/9942 > Replace example code in mllib-decision-tree.md using include_example > > > Key: SPARK-11382 > URL: https://issues.apache.org/jira/browse/SPARK-11382 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: starter > Fix For: 1.6.0 > > > This is similar to SPARK-11289 but for the example code in > mllib-decision-tree.md. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11957) SQLTransformer docs are unclear about generality of SQL statements
Joseph K. Bradley created SPARK-11957: - Summary: SQLTransformer docs are unclear about generality of SQL statements Key: SPARK-11957 URL: https://issues.apache.org/jira/browse/SPARK-11957 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Priority: Minor See discussion here for context [SPARK-11234]. The Scala doc needs to be clearer about what SQL statements are supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org