[jira] [Updated] (SPARK-13591) Remove Back-ticks in Attribute/Alias Names
[ https://issues.apache.org/jira/browse/SPARK-13591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13591: Description: When calling .sql, back-ticks are automatically added. When using .sql as AttributeReference/Alias names, we hit a couple of issues when doing logicalPlan to SQL. For example, the name could be converted to {{`sum(`name`)`}}. Parser is unable to recognize it. (was: When calling .sql, back-ticks are automatically added. When using .sql as AttributeReference/Alias names, we hit a couple of issues when doing logicalPlan to SQL. The name could be converted to {{`sum(`name`)`}}. Parser is unable to recognize it.) > Remove Back-ticks in Attribute/Alias Names > -- > > Key: SPARK-13591 > URL: https://issues.apache.org/jira/browse/SPARK-13591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > When calling .sql, back-ticks are automatically added. When using .sql as > AttributeReference/Alias names, we hit a couple of issues when doing > logicalPlan to SQL. For example, the name could be converted to > {{`sum(`name`)`}}. Parser is unable to recognize it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13591) Remove Back-ticks in Attribute/Alias Names
Xiao Li created SPARK-13591: --- Summary: Remove Back-ticks in Attribute/Alias Names Key: SPARK-13591 URL: https://issues.apache.org/jira/browse/SPARK-13591 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li When calling .sql, back-ticks are automatically added. When using .sql as AttributeReference/Alias names, we hit a couple of issues when doing logicalPlan to SQL. The name could be converted to {{`sum(`name`)`}}. Parser is unable to recognize it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13550) Add java example for ml.clustering.BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13550. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11428 [https://github.com/apache/spark/pull/11428] > Add java example for ml.clustering.BisectingKMeans > -- > > Key: SPARK-13550 > URL: https://issues.apache.org/jira/browse/SPARK-13550 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.0.0 > > > Add java example for ml.clustering.BisectingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13550) Add java example for ml.clustering.BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13550: -- Assignee: zhengruifeng > Add java example for ml.clustering.BisectingKMeans > -- > > Key: SPARK-13550 > URL: https://issues.apache.org/jira/browse/SPARK-13550 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > > Add java example for ml.clustering.BisectingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13581) LibSVM throws MatchError
[ https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173383#comment-15173383 ] Jeff Zhang commented on SPARK-13581: I suspect it is issue in the code generation. Because the root cause is that it should read the column features but actually it read the column label, so cause the match error. And df.show() is successful without any selection. The stacktrace shows the error come from code generator. Can any guy familiar with code generation help on this ? {code} Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost): scala.MatchError: 0.0 (of class java.lang.Double) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207) at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192) at org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:63) at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:60) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:40) at org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$5$$anon$1.hasNext(WholeStageCodegen.scala:305) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) {code} > LibSVM throws MatchError > > > Key: SPARK-13581 > URL: https://issues.apache.org/jira/browse/SPARK-13581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jakob Odersky >Assignee: Jeff Zhang >Priority: Minor > > When running an action on a DataFrame obtained by reading from a libsvm file > a MatchError is thrown, however doing the same on a cached DataFrame works > fine. > {code} > val df = > sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") > //file is in spark repository > df.select(df("features")).show() //MatchError > df.cache() > df.select(df("features")).show() //OK > {code} > The exception stack trace is the following: > {code} > scala.MatchError: 1.0 (of class java.lang.Double) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56) > {code} > This issue first appeared in commit {{1dac964c1}}, in PR > [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622. > [~jeffzhang], do you have any insight of what could be going on? > cc [~iyounus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13590) Document the behavior of spark.ml logistic regression when there are constant features
Xiangrui Meng created SPARK-13590: - Summary: Document the behavior of spark.ml logistic regression when there are constant features Key: SPARK-13590 URL: https://issues.apache.org/jira/browse/SPARK-13590 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.0.0 Reporter: Xiangrui Meng As discussed in SPARK-13029, we decided to keep the current behavior that sets all coefficients associated with constant feature columns to zero, regardless of intercept, regularization, and standardization settings. This is the same behavior as in glmnet. Since this is different from LIBSVM, we should document the behavior correctly, add tests, and generate warning messages if there are constant columns and `addIntercept` is false. cc [~coderxiang] [~dbtsai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13029. --- Resolution: Won't Fix As discussed on the PR page, we decided to keep the current behavior, which is the same as glmnet. I'll create a PR for the documentation. > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > > This is a bug that appears while fitting a Logistic Regression model with > `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix > has one column with identical value, the resulting model is not correct. > Specifically, the special column will always get a weight of 0, due to the > special check inside the code. However, the correct solution, which is unique > for L2 logistic regression, usually has non-zero weight. > I use the heart_scale data > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and > manually augmented the data matrix with a column of one (available in the > PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the > following tools: > - libsvm > - scikit-learn > - sparkml > (Notice libsvm and scikit-learn use a slightly different formulation, so > their regularizer is equivalently set to 1/270). > The first two will have an objective value 0.7275 and give a solution vector: > [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, > 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 > 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, > 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, > 0.1801661775839843, -0.01248615347419409]. > Spark will produce an objective value 0.7278 and give a solution vector: > [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] > Notice the last element of the weight vector is 0. > A even simpler example is: > {code:title=benchmark.py|borderStyle=solid} > import numpy as np > from sklearn.datasets import load_svmlight_file > from sklearn.linear_model import LogisticRegression > x_train = np.array([[1, 1], [0, 1]]) > y_train = np.array([1, 0]) > model = LogisticRegression(tol=1e-9, C=0.5, max_iter=1000, > fit_intercept=False).fit(x_train, y_train) > print model.coef_ > [[ 0.22478867 -0.02241016]] > {code} > The same data trained by the current solver also gives a different result, > see the unit test in the PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13551) Fix fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample
[ https://issues.apache.org/jira/browse/SPARK-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13551: -- Target Version/s: 2.0.0 > Fix fix wrong comment and remove meanless lines in > mllib.JavaBisectingKMeansExample > --- > > Key: SPARK-13551 > URL: https://issues.apache.org/jira/browse/SPARK-13551 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.0.0 > > > this description is wrong: > /** > * Java example for graph clustering using power iteration clustering (PIC). > */ > this for loop is meanless: > for (Vector center: model.clusterCenters()) { > System.out.println(""); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13551) Fix fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample
[ https://issues.apache.org/jira/browse/SPARK-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13551: -- Assignee: zhengruifeng > Fix fix wrong comment and remove meanless lines in > mllib.JavaBisectingKMeansExample > --- > > Key: SPARK-13551 > URL: https://issues.apache.org/jira/browse/SPARK-13551 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.0.0 > > > this description is wrong: > /** > * Java example for graph clustering using power iteration clustering (PIC). > */ > this for loop is meanless: > for (Vector center: model.clusterCenters()) { > System.out.println(""); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13551) Fix fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample
[ https://issues.apache.org/jira/browse/SPARK-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13551. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11429 [https://github.com/apache/spark/pull/11429] > Fix fix wrong comment and remove meanless lines in > mllib.JavaBisectingKMeansExample > --- > > Key: SPARK-13551 > URL: https://issues.apache.org/jira/browse/SPARK-13551 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Trivial > Fix For: 2.0.0 > > > this description is wrong: > /** > * Java example for graph clustering using power iteration clustering (PIC). > */ > this for loop is meanless: > for (Vector center: model.clusterCenters()) { > System.out.println(""); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13546) GBT with many trees consistently giving java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-13546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173261#comment-15173261 ] Glen Maisey commented on SPARK-13546: - As a workaround I've thrown checkpointing on. It seems to truncate the DAG such that it doesn't crash. {code} // first line we add a checkpoint directory sc.setCheckpointDir("mydir") ... // we add CheckpointInterval and cache node ids val gbt = new GBTRegressor().setMaxIter(1000).setStepSize(0.001).setMaxDepth(3).setMaxMemoryInMB(256).setMinInstancesPerNode(5).setSubsamplingRate(0.8).setMaxBins(10) .setCheckpointInterval(50).setCacheNodeIds(true) ... {code} Feels like the checkpointing might need to be compulsory for very iterative algorithms like the gbt. > GBT with many trees consistently giving java.lang.StackOverflowError > > > Key: SPARK-13546 > URL: https://issues.apache.org/jira/browse/SPARK-13546 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 >Reporter: Glen Maisey > Attachments: iris.csv > > > I've found that creating a GBT in SparkML with a large number of trees is > causing a stackoverflow error. The same code works with a smaller number of > trees. This was occurring on a large dataset however I've reproduced the > error using the tiny Fisher's Iris dataset to make it a bit easier to debug > (https://en.wikipedia.org/wiki/Iris_flower_data_set). > Unfortunately I do not have the skills to fix the issue, just reproduce it. > Let me know if there are things I could test. Thanks! > Edit: I've already tried increasing the JVM stack size significantly and that > doesn't solve the problem. > {code} > // Import > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.evaluation.RegressionEvaluator > import org.apache.spark.ml.feature.RFormula > import org.apache.spark.ml.regression.GBTRegressor > // Load Data & specify model > val data = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load("/user/glenm/iris.csv") > val formulaString = "SepalLength ~ SepalWidth + PetalLength + PetalWidth + > Species" > // Setup R Formula > val formula = new RFormula().setFormula(formulaString) > > // Setup evaluation metric > val evaluator = new > RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("rmse") > > val pipeline = new Pipeline().setStages(Array(formula)) > // Transform the data > val pipelineModel = pipeline.fit(data) > val transformedData = pipelineModel.transform(data).select("features","label") > // Split our data into test and train > val Array(train, test) = transformedData.randomSplit(Array(0.9, 0.1)) > // Cache our training data > val trainingData = train.repartition(2).cache() > trainingData.count() > // Setup GBT > val gbt = new > GBTRegressor().setMaxIter(1000).setStepSize(0.001).setMaxDepth(3).setMaxMemoryInMB(256).setMinInstancesPerNode(5).setSubsamplingRate(0.8).setMaxBins(10) > // Create the GBT > val myGBM = gbt.fit(trainingData) > {code} > Stack trace: > java.lang.StackOverflowError > at > java.io.ObjectStreamClass.getPrimFieldValues(ObjectStreamClass.java:1233) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1532) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:137) > at > scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:135) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at > scala.collection.mutable.HashTable$class.serializeTo(HashTable.scala:124) > at scala.collection.mutable.HashMap.serializeTo(HashMap.scala:39) > at scala.collection.mutable.HashMap.writeObject(HashMap.scala:135) > at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutput
[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0
[ https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173254#comment-15173254 ] Jay Panicker commented on SPARK-13117: -- Sorry, [~devaraj.k] , I did not check the full thread - ran into this issue and fixed it the way I mentioned. Now, digging a bit deeper: {code} protected val publicHostName = Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName) {code} and localHostName comes from: {code} protected val localHostName = Utils.localHostNameForURI() {code} and localHostNameForURI is define as: {code} def localHostNameForURI(): String = { customHostname.getOrElse(InetAddresses.toUriString(localIpAddress)) } {code} ok.. {code} def setCustomHostname(hostname: String) { // DEBUG code Utils.checkHost(hostname) customHostname = Some(hostname) } {code} and {code} private lazy val localIpAddress: InetAddress = findLocalInetAddress() {code} And findLocalInetAddress does this {code} val defaultIpOverride = System.getenv("SPARK_LOCAL_IP") {code} and among other things, also chages loopback IP to real IP... While using "0.0.0.0" instead of localHostName certainly fixes the loopback/localhost issue, it bypasses the sequence of steps above, so might have other side effects..? > WebUI should use the local ip not 0.0.0.0 > - > > Key: SPARK-13117 > URL: https://issues.apache.org/jira/browse/SPARK-13117 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jeremiah Jordan >Priority: Minor > > When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP > except the WebUI. The WebUI should use the SPARK_LOCAL_IP not always use > 0.0.0.0 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13588) Unable to map Parquet file to Hive Table using HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akshat Thakar updated SPARK-13588: -- Summary: Unable to map Parquet file to Hive Table using HiveContext (was: Unable to Map Parquet file to Hive Table using HiveContext) > Unable to map Parquet file to Hive Table using HiveContext > -- > > Key: SPARK-13588 > URL: https://issues.apache.org/jira/browse/SPARK-13588 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.1 > Environment: Linux Red Hat 6.7, Hortonworks HDP 2.3 Distribution >Reporter: Akshat Thakar >Priority: Minor > > I am trying to map existing Parquet file with external table using Pyspark > script. I am using Hive Context to execute Hive SQL. > I was able to execute same SQL command using Hive shell. > I get below error while doing so- > >>> hive.sql('create external table temp_inserts like new_inserts stored as > >>> parquet LOCATION "/user/hive/warehouse/temp_inserts') > 16/03/01 06:24:01 INFO ParseDriver: Parsing command: create external table > temp_inserts like new_inserts stored as parquet LOCATION > "/user/hive/warehouse/temp_inserts > Traceback (most recent call last): > File "", line 1, in > File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/sql/context.py", line 488, > in sql > return DataFrame(self._ssql_ctx.sql(sqlQuery), self) > File > "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql. > : org.apache.spark.sql.AnalysisException: missing EOF at 'stored' near > 'new_inserts'; line 1 pos 52 > at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:254) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) > at > scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at > scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) > at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) > at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) > at > org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) > at > scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at > scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala
[jira] [Updated] (SPARK-13588) Unable to Map Parquet file to Hive Table using HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akshat Thakar updated SPARK-13588: -- Description: I am trying to map existing Parquet file with external table using Pyspark script. I am using Hive Context to execute Hive SQL. I was able to execute same SQL command using Hive shell. I get below error while doing so- >>> hive.sql('create external table temp_inserts like new_inserts stored as >>> parquet LOCATION "/user/hive/warehouse/temp_inserts') 16/03/01 06:24:01 INFO ParseDriver: Parsing command: create external table temp_inserts like new_inserts stored as parquet LOCATION "/user/hive/warehouse/temp_inserts Traceback (most recent call last): File "", line 1, in File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/sql/context.py", line 488, in sql return DataFrame(self._ssql_ctx.sql(sqlQuery), self) File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql. : org.apache.spark.sql.AnalysisException: missing EOF at 'stored' near 'new_inserts'; line 1 pos 52 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:254) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) a
[jira] [Updated] (SPARK-13588) Unable to Map Parquet file to Hive Table using HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akshat Thakar updated SPARK-13588: -- Description: I am trying to map existing Parquet file with external table using Pyspark script. I am using Hive Context to execute Hive SQL. I was able to execute same SQL command using Hive shell. I get below error while doing so- >>> hive.sql('create external table temp_inserts like new_inserts stored as >>> parquet LOCATION "/user/hive/warehouse/temp_inserts') 16/03/01 06:16:35 INFO ParseDriver: Parsing command: create external table cdc_new like cdc_new stored as parquet LOCATION "/user/hive/warehouse/cdc_temp_inserts Traceback (most recent call last): File "", line 1, in File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/sql/context.py", line 488, in sql return DataFrame(self._ssql_ctx.sql(sqlQuery), self) File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql. : org.apache.spark.sql.AnalysisException: missing EOF at 'stored' near 'temp_inserts'; line 1 pos 44 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:254) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at
[jira] [Commented] (SPARK-13589) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType
[ https://issues.apache.org/jira/browse/SPARK-13589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173249#comment-15173249 ] Cheng Lian commented on SPARK-13589: [~nongli] Is SPARK-13533 related to this one? > Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType > --- > > Key: SPARK-13589 > URL: https://issues.apache.org/jira/browse/SPARK-13589 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > Labels: flaky-test > > Here are a few sample build failures caused by this test case: > # > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52164/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ > # > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52154/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ > # > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52153/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ > (I've pinned these builds on Jenkins so that they won't be cleaned up.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13589) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType
Cheng Lian created SPARK-13589: -- Summary: Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType Key: SPARK-13589 URL: https://issues.apache.org/jira/browse/SPARK-13589 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.0.0 Reporter: Cheng Lian Here are a few sample build failures caused by this test case: # https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52164/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ # https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52154/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ # https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52153/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ (I've pinned these builds on Jenkins so that they won't be cleaned up.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13588) Unable to Map Parquet file to Hive Table using HiveContext
Akshat Thakar created SPARK-13588: - Summary: Unable to Map Parquet file to Hive Table using HiveContext Key: SPARK-13588 URL: https://issues.apache.org/jira/browse/SPARK-13588 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Environment: Linux Red Hat 6.7, Hortonworks HDP 2.3 Distribution Reporter: Akshat Thakar Priority: Minor I am trying to map existing Parquet file with external table using Pyspark script. I am using Hive Context to execute Hive SQL. I was able to execute same SQL command using Hive shell. I get below error while doing so- >>> hive.sql('create external table temp_inserts like new_inserts stored as >>> parquet LOCATION "/user/hive/warehouse/temp_inserts') 16/03/01 06:16:35 INFO ParseDriver: Parsing command: create external table cdc_new like cdc_new stored as parquet LOCATION "/user/hive/warehouse/cdc_temp_inserts Traceback (most recent call last): File "", line 1, in File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/sql/context.py", line 488, in sql return DataFrame(self._ssql_ctx.sql(sqlQuery), self) File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql. : org.apache.spark.sql.AnalysisException: missing EOF at 'stored' near 'cdc_new'; line 1 pos 44 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:254) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:
[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173228#comment-15173228 ] Jeff Zhang edited comment on SPARK-13587 at 3/1/16 5:04 AM: I have implemented POC for this features. Here's oen simple command for how to use virtualenv in pyspark {code} bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py {code} There's 4 properties needs to be set * spark.pyspark.virtualenv.enabled(enable virtualenv) * spark.pyspark.virtualenv.type (default/conda are supported, default is native) * spark.pyspark.virtualenv.requirements (requirement file for the dependencies) * spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Comments and feedback are welcome about how to improve it and whether it's valuable for users. was (Author: zjffdu): I have implemented POC for this features. Here's oen simple command for how to execute use virtualenv in pyspark {code} bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py {code} There's 4 properties needs to be set * spark.pyspark.virtualenv.enabled(enable virtualenv) * spark.pyspark.virtualenv.type (default/conda are supported, default is native) * spark.pyspark.virtualenv.requirements (requirement file for the dependencies) * spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Comments and feedback are welcome about how to improve it and whether it's valuable for users. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173228#comment-15173228 ] Jeff Zhang commented on SPARK-13587: I have implemented POC for this features. Here's oen simple command for how to execute use virtualenv in pyspark {code} bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py {code} There's 4 property needs to be set * spark.pyspark.virtualenv.enabled(enable virtualenv) * spark.pyspark.virtualenv.type (default/conda are supported, default is native) * spark.pyspark.virtualenv.requirements (requirement file for the dependencies) * spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Comments and feedback are welcome about how to improve it and whether it's valuable for users. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173228#comment-15173228 ] Jeff Zhang edited comment on SPARK-13587 at 3/1/16 5:02 AM: I have implemented POC for this features. Here's oen simple command for how to execute use virtualenv in pyspark {code} bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py {code} There's 4 properties needs to be set * spark.pyspark.virtualenv.enabled(enable virtualenv) * spark.pyspark.virtualenv.type (default/conda are supported, default is native) * spark.pyspark.virtualenv.requirements (requirement file for the dependencies) * spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Comments and feedback are welcome about how to improve it and whether it's valuable for users. was (Author: zjffdu): I have implemented POC for this features. Here's oen simple command for how to execute use virtualenv in pyspark {code} bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py {code} There's 4 property needs to be set * spark.pyspark.virtualenv.enabled(enable virtualenv) * spark.pyspark.virtualenv.type (default/conda are supported, default is native) * spark.pyspark.virtualenv.requirements (requirement file for the dependencies) * spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Comments and feedback are welcome about how to improve it and whether it's valuable for users. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set
[ https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173226#comment-15173226 ] Xiao Li commented on SPARK-12720: - [~yhuai] I did a few tries. Unfortunately, its Expand is different from what GroupingSet is forming. We need a separate function for multi-distinct aggregation queries. : ( > SQL generation support for cube, rollup, and grouping set > - > > Key: SPARK-12720 > URL: https://issues.apache.org/jira/browse/SPARK-12720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13587) Support virtualenv in PySpark
Jeff Zhang created SPARK-13587: -- Summary: Support virtualenv in PySpark Key: SPARK-13587 URL: https://issues.apache.org/jira/browse/SPARK-13587 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Jeff Zhang Currently, it's not easy for user to add third party python packages in pyspark. * One way is to using --py-files (suitable for simple dependency, but not suitable for complicated dependency, especially with transitive dependency) * Another way is install packages manually on each node (time wasting, and not easy to switch to different environment) Python has now 2 different virtualenv implementation. One is native virtualenv another is through conda. This jira is trying to migrate these 2 tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0
[ https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173224#comment-15173224 ] Devaraj K commented on SPARK-13117: --- I think we can start the Jetty server with the default value as "0.0.0.0" and it can take effect of the configured value for SPARK_PUBLIC_DNS if it is configured. It would change only for the Web UI and doesn't impact any other things. Changes will be some thing like below, {code:xml} protected val publicHostName = Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse("0.0.0.0") {code} {code:xml} try { serverInfo = Some(startJettyServer(publicHostName, port, sslOptions, handlers, conf, name)) logInfo("Started %s at http://%s:%d".format(className, publicHostName, boundPort)) } catch { {code} [~srowen], any suggestions? Thanks > WebUI should use the local ip not 0.0.0.0 > - > > Key: SPARK-13117 > URL: https://issues.apache.org/jira/browse/SPARK-13117 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jeremiah Jordan >Priority: Minor > > When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP > except the WebUI. The WebUI should use the SPARK_LOCAL_IP not always use > 0.0.0.0 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0
[ https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173218#comment-15173218 ] Devaraj K commented on SPARK-13117: --- [~jaypanicker], the proposed PR does the same but there is a problem while accessing web UI using the localhost or 127.0.0.1. Please have look into this comment https://github.com/apache/spark/pull/11133#issuecomment-188937933. > WebUI should use the local ip not 0.0.0.0 > - > > Key: SPARK-13117 > URL: https://issues.apache.org/jira/browse/SPARK-13117 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jeremiah Jordan >Priority: Minor > > When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP > except the WebUI. The WebUI should use the SPARK_LOCAL_IP not always use > 0.0.0.0 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13117) WebUI should use the local ip not 0.0.0.0
[ https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173214#comment-15173214 ] Jay Panicker edited comment on SPARK-13117 at 3/1/16 4:40 AM: -- On systems with multiple interfaces, ability to select the bind ip will be nice, instead of binding to all. Actually, the code has everything needed to do it, except one change. Relevant lines from spark-1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala: {code} ... ... protected val publicHostName = Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName) ... .. try { serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name)) logInfo("Started %s at http://%s:%d".format(className, publicHostName, boundPort)) } {code} Note the "0.0.0.0", even though publicHostName is available as a configuration option. Making the following change and an export SPARK_PUBLIC_DNS= in spark-config.sh solved the problem: serverInfo = Some(startJettyServer(publicHostName, port, handlers, conf, name)) was (Author: jaypanicker): On systems with multiple interfaces, ability to select the bind ip will be nice, instead of binding to all. Actually, the code has everything needed to do it, except one change. Relevant lines from spark-1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala: ... ... protected val publicHostName = Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName) ... .. try { serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name)) logInfo("Started %s at http://%s:%d".format(className, publicHostName, boundPort)) } Note the "0.0.0.0", even though publicHostName is available as a configuration option. Making the following change and an export SPARK_PUBLIC_DNS= in spark-config.sh solved the problem: serverInfo = Some(startJettyServer(publicHostName, port, handlers, conf, name)) > WebUI should use the local ip not 0.0.0.0 > - > > Key: SPARK-13117 > URL: https://issues.apache.org/jira/browse/SPARK-13117 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jeremiah Jordan >Priority: Minor > > When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP > except the WebUI. The WebUI should use the SPARK_LOCAL_IP not always use > 0.0.0.0 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13117) WebUI should use the local ip not 0.0.0.0
[ https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173214#comment-15173214 ] Jay Panicker edited comment on SPARK-13117 at 3/1/16 4:40 AM: -- On systems with multiple interfaces, ability to select the bind ip will be nice, instead of binding to all. Actually, the code has everything needed to do it, except one change. Relevant lines from spark-1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala: {code} ... ... protected val publicHostName = Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName) ... .. try { serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name)) logInfo("Started %s at http://%s:%d".format(className, publicHostName, boundPort)) } {code} Note the "0.0.0.0", even though publicHostName is available as a configuration option. Making the following change and an export SPARK_PUBLIC_DNS= in spark-config.sh solved the problem: {code} serverInfo = Some(startJettyServer(publicHostName, port, handlers, conf, name)) {code} was (Author: jaypanicker): On systems with multiple interfaces, ability to select the bind ip will be nice, instead of binding to all. Actually, the code has everything needed to do it, except one change. Relevant lines from spark-1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala: {code} ... ... protected val publicHostName = Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName) ... .. try { serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name)) logInfo("Started %s at http://%s:%d".format(className, publicHostName, boundPort)) } {code} Note the "0.0.0.0", even though publicHostName is available as a configuration option. Making the following change and an export SPARK_PUBLIC_DNS= in spark-config.sh solved the problem: serverInfo = Some(startJettyServer(publicHostName, port, handlers, conf, name)) > WebUI should use the local ip not 0.0.0.0 > - > > Key: SPARK-13117 > URL: https://issues.apache.org/jira/browse/SPARK-13117 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jeremiah Jordan >Priority: Minor > > When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP > except the WebUI. The WebUI should use the SPARK_LOCAL_IP not always use > 0.0.0.0 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0
[ https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173214#comment-15173214 ] Jay Panicker commented on SPARK-13117: -- On systems with multiple interfaces, ability to select the bind ip will be nice, instead of binding to all. Actually, the code has everything needed to do it, except one change. Relevant lines from spark-1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala: ... ... protected val publicHostName = Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName) ... .. try { serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name)) logInfo("Started %s at http://%s:%d".format(className, publicHostName, boundPort)) } Note the "0.0.0.0", even though publicHostName is available as a configuration option. Making the following change and an export SPARK_PUBLIC_DNS= in spark-config.sh solved the problem: serverInfo = Some(startJettyServer(publicHostName, port, handlers, conf, name)) > WebUI should use the local ip not 0.0.0.0 > - > > Key: SPARK-13117 > URL: https://issues.apache.org/jira/browse/SPARK-13117 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jeremiah Jordan >Priority: Minor > > When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP > except the WebUI. The WebUI should use the SPARK_LOCAL_IP not always use > 0.0.0.0 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6160) ChiSqSelector should keep test statistic info
[ https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173163#comment-15173163 ] Gayathri Murali edited comment on SPARK-6160 at 3/1/16 3:57 AM: [~josephkb] How should the test stat results be persisted? Option 1 : is to persist the test stats as data member of ChiSqSelectorModel class, in which case it needs to be initialized using the constructor.This would mean modifying the code where the object of the class is being instantiated. Option 2 : To create an auxiliary class which would just store the test stat results. was (Author: gayathrimurali): [~josephkb] Should the test statistics result be stored as a text/parquet file? or Can it just be stored in a local array? > ChiSqSelector should keep test statistic info > - > > Key: SPARK-6160 > URL: https://issues.apache.org/jira/browse/SPARK-6160 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > It is useful to have the test statistics explaining selected features, but > these data are thrown out when constructing the ChiSqSelectorModel. The data > are expensive to recompute, so the ChiSqSelectorModel should store and expose > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6160) ChiSqSelector should keep test statistic info
[ https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173163#comment-15173163 ] Gayathri Murali edited comment on SPARK-6160 at 3/1/16 3:53 AM: [~josephkb] Should the test statistics result be stored as a text/parquet file? or Can it just be stored in a local array? was (Author: gayathrimurali): [~josephkb] Should the test statistics result be stored as a text/parquet file? or Can it just be stored in a local array? > ChiSqSelector should keep test statistic info > - > > Key: SPARK-6160 > URL: https://issues.apache.org/jira/browse/SPARK-6160 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > It is useful to have the test statistics explaining selected features, but > these data are thrown out when constructing the ChiSqSelectorModel. The data > are expensive to recompute, so the ChiSqSelectorModel should store and expose > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jeanlyn updated SPARK-13586: Priority: Minor (was: Major) > add config to skip generate down time batch when restart StreamingContext > - > > Key: SPARK-13586 > URL: https://issues.apache.org/jira/browse/SPARK-13586 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: jeanlyn >Priority: Minor > > If we restart streaming, which using checkpoint and has stopped for hours, it > will generate a lot of batch to the queue, and it need to take a while to > handle this batches. So i propose to add a config to control whether generate > down time batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13586: Assignee: Apache Spark > add config to skip generate down time batch when restart StreamingContext > - > > Key: SPARK-13586 > URL: https://issues.apache.org/jira/browse/SPARK-13586 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: jeanlyn >Assignee: Apache Spark > > If we restart streaming, which using checkpoint and has stopped for hours, it > will generate a lot of batch to the queue, and it need to take a while to > handle this batches. So i propose to add a config to control whether generate > down time batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13586: Assignee: (was: Apache Spark) > add config to skip generate down time batch when restart StreamingContext > - > > Key: SPARK-13586 > URL: https://issues.apache.org/jira/browse/SPARK-13586 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: jeanlyn > > If we restart streaming, which using checkpoint and has stopped for hours, it > will generate a lot of batch to the queue, and it need to take a while to > handle this batches. So i propose to add a config to control whether generate > down time batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173177#comment-15173177 ] Apache Spark commented on SPARK-13586: -- User 'jeanlyn' has created a pull request for this issue: https://github.com/apache/spark/pull/11440 > add config to skip generate down time batch when restart StreamingContext > - > > Key: SPARK-13586 > URL: https://issues.apache.org/jira/browse/SPARK-13586 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: jeanlyn > > If we restart streaming, which using checkpoint and has stopped for hours, it > will generate a lot of batch to the queue, and it need to take a while to > handle this batches. So i propose to add a config to control whether generate > down time batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext
jeanlyn created SPARK-13586: --- Summary: add config to skip generate down time batch when restart StreamingContext Key: SPARK-13586 URL: https://issues.apache.org/jira/browse/SPARK-13586 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.6.0 Reporter: jeanlyn If we restart streaming, which using checkpoint and has stopped for hours, it will generate a lot of batch to the queue, and it need to take a while to handle this batches. So i propose to add a config to control whether generate down time batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info
[ https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173163#comment-15173163 ] Gayathri Murali commented on SPARK-6160: [~josephkb] Should the test statistics result be stored as a text/parquet file? or Can it just be stored in a local array? > ChiSqSelector should keep test statistic info > - > > Key: SPARK-6160 > URL: https://issues.apache.org/jira/browse/SPARK-6160 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > It is useful to have the test statistics explaining selected features, but > these data are thrown out when constructing the ChiSqSelectorModel. The data > are expensive to recompute, so the ChiSqSelectorModel should store and expose > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12719) SQL generation support for generators (including UDTF)
[ https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173140#comment-15173140 ] Xiao Li commented on SPARK-12719: - Actually, I involved my teammate to work on this. The PR is close to finish. He will submit it this week. : ) Thanks! > SQL generation support for generators (including UDTF) > -- > > Key: SPARK-12719 > URL: https://issues.apache.org/jira/browse/SPARK-12719 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12721) SQL generation support for script transformation
[ https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173138#comment-15173138 ] Xiao Li commented on SPARK-12721: - Sorry, this is delayed until Spark-13535 is resolved. Thanks! > SQL generation support for script transformation > > > Key: SPARK-12721 > URL: https://issues.apache.org/jira/browse/SPARK-12721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set
[ https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173133#comment-15173133 ] Xiao Li commented on SPARK-12720: - Yeah. This is what I hope. Let me use this PR to do a quick proof of concept. Will submit a PR for you to review. Maybe tonight. : ) > SQL generation support for cube, rollup, and grouping set > - > > Key: SPARK-12720 > URL: https://issues.apache.org/jira/browse/SPARK-12720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13580) Driver makes no progress when Executor's akka thread exits due to OOM.
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated SPARK-13580: --- Summary: Driver makes no progress when Executor's akka thread exits due to OOM. (was: Driver makes no progress after failed to remove broadcast on Executor) > Driver makes no progress when Executor's akka thread exits due to OOM. > -- > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, > stderrfiltered.txt.gz > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set
[ https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173070#comment-15173070 ] Yin Huai commented on SPARK-12720: -- [~smilegator] Will the approach of handling expand in the PR be also applicable to handle multi-distinct aggregation queries? > SQL generation support for cube, rollup, and grouping set > - > > Key: SPARK-12720 > URL: https://issues.apache.org/jira/browse/SPARK-12720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13581) LibSVM throws MatchError
[ https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Odersky updated SPARK-13581: -- Description: When running an action on a DataFrame obtained by reading from a libsvm file a MatchError is thrown, however doing the same on a cached DataFrame works fine. {code} val df = sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") //file is in spark repository df.select(df("features")).show() //MatchError df.cache() df.select(df("features")).show() //OK {code} The exception stack trace is the following: {code} scala.MatchError: 1.0 (of class java.lang.Double) [info] at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207) [info] at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192) [info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142) [info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) [info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) [info] at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) [info] at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56) {code} This issue first appeared in commit {{1dac964c1}}, in PR [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622. [~jeffzhang], do you have any insight of what could be going on? cc [~iyounus] was: When running an action on a DataFrame obtained by reading from a libsvm file a MatchError is thrown, however doing the same on a cached DataFrame works fine. {code} val df = sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") //file is df.select(df("features")).show() //MatchError df.cache() df.select(df("features")).show() //OK {code} The exception stack trace is the following: {code} scala.MatchError: 1.0 (of class java.lang.Double) [info] at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207) [info] at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192) [info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142) [info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) [info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) [info] at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) [info] at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56) {code} This issue first appeared in commit {{1dac964c1}}, in PR [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622. [~jeffzhang], do you have any insight of what could be going on? cc [~iyounus] > LibSVM throws MatchError > > > Key: SPARK-13581 > URL: https://issues.apache.org/jira/browse/SPARK-13581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jakob Odersky >Assignee: Jeff Zhang >Priority: Minor > > When running an action on a DataFrame obtained by reading from a libsvm file > a MatchError is thrown, however doing the same on a cached DataFrame works > fine. > {code} > val df = > sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") > //file is in spark repository > df.select(df("features")).show() //MatchError > df.cache() > df.select(df("features")).show() //OK > {code} > The exception stack trace is the following: > {code} > scala.MatchError: 1.0 (of class java.lang.Double) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$
[jira] [Commented] (SPARK-13581) LibSVM throws MatchError
[ https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173058#comment-15173058 ] Jakob Odersky commented on SPARK-13581: --- It's in spark "data/mllib/sample_libsvm_data.txt" > LibSVM throws MatchError > > > Key: SPARK-13581 > URL: https://issues.apache.org/jira/browse/SPARK-13581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jakob Odersky >Assignee: Jeff Zhang >Priority: Minor > > When running an action on a DataFrame obtained by reading from a libsvm file > a MatchError is thrown, however doing the same on a cached DataFrame works > fine. > {code} > val df = > sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") > //file is > df.select(df("features")).show() //MatchError > df.cache() > df.select(df("features")).show() //OK > {code} > The exception stack trace is the following: > {code} > scala.MatchError: 1.0 (of class java.lang.Double) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56) > {code} > This issue first appeared in commit {{1dac964c1}}, in PR > [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622. > [~jeffzhang], do you have any insight of what could be going on? > cc [~iyounus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13581) LibSVM throws MatchError
[ https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173052#comment-15173052 ] Jeff Zhang commented on SPARK-13581: [~jodersky] Can you attach the data file ? I guess it it small. > LibSVM throws MatchError > > > Key: SPARK-13581 > URL: https://issues.apache.org/jira/browse/SPARK-13581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jakob Odersky >Assignee: Jeff Zhang >Priority: Minor > > When running an action on a DataFrame obtained by reading from a libsvm file > a MatchError is thrown, however doing the same on a cached DataFrame works > fine. > {code} > val df = > sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") > //file is > df.select(df("features")).show() //MatchError > df.cache() > df.select(df("features")).show() //OK > {code} > The exception stack trace is the following: > {code} > scala.MatchError: 1.0 (of class java.lang.Double) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56) > {code} > This issue first appeared in commit {{1dac964c1}}, in PR > [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622. > [~jeffzhang], do you have any insight of what could be going on? > cc [~iyounus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13583) Support `UnusedImports` Java checkstyle rule
[ https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173040#comment-15173040 ] Dongjoon Hyun commented on SPARK-13583: --- I changed the content of this issue since `lint-java` is not executed automatically by Jenkins as of today. Thanks, [~zsxwing] > Support `UnusedImports` Java checkstyle rule > > > Key: SPARK-13583 > URL: https://issues.apache.org/jira/browse/SPARK-13583 > Project: Spark > Issue Type: Task >Reporter: Dongjoon Hyun >Priority: Trivial > > After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review > by saving much time. > This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` > rule to `checkstyle.xml` and fixing all existing unused imports. > {code:title=checkstyle.xml|borderStyle=solid} > + > {code} > Unfortunately, `dev/lint-java` is not tested by Jenkins. ( > https://github.com/apache/spark/blob/master/dev/run-tests.py#L546 ) > This will also help Spark contributors to check by themselves before > submitting their PRs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13583) Support `UnusedImports` Java checkstyle rule
[ https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-13583: -- Description: After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time. This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` rule to `checkstyle.xml` and fixing all existing unused imports. {code:title=checkstyle.xml|borderStyle=solid} + {code} Unfortunately, `dev/lint-java` is not tested by Jenkins. ( https://github.com/apache/spark/blob/master/dev/run-tests.py#L546 ) This will also help Spark contributors to check by themselves before submitting their PRs. was: After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time. This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` rule to `checkstyle.xml` and fixing all existing unused imports. {code:title=checkstyle.xml|borderStyle=solid} + {code} This will also prevent the upcoming PR from having unused imports. Summary: Support `UnusedImports` Java checkstyle rule (was: Enforce `UnusedImports` Java checkstyle rule) > Support `UnusedImports` Java checkstyle rule > > > Key: SPARK-13583 > URL: https://issues.apache.org/jira/browse/SPARK-13583 > Project: Spark > Issue Type: Task >Reporter: Dongjoon Hyun >Priority: Trivial > > After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review > by saving much time. > This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` > rule to `checkstyle.xml` and fixing all existing unused imports. > {code:title=checkstyle.xml|borderStyle=solid} > + > {code} > Unfortunately, `dev/lint-java` is not tested by Jenkins. ( > https://github.com/apache/spark/blob/master/dev/run-tests.py#L546 ) > This will also help Spark contributors to check by themselves before > submitting their PRs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13585) addPyFile behavior change between 1.6 and before
Santhosh Gorantla Ramakrishna created SPARK-13585: - Summary: addPyFile behavior change between 1.6 and before Key: SPARK-13585 URL: https://issues.apache.org/jira/browse/SPARK-13585 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.0 Reporter: Santhosh Gorantla Ramakrishna Priority: Minor addPyFile in earlier versions would remove the .py file if it already existed. In 1.6, it throws an exception "__.py exists and does not match contents of __.py". This might be because the underlying scala code needs an overwrite parameter, and this is being defaulted to false when called from python. private def copyFile( url: String, sourceFile: File, destFile: File, fileOverwrite: Boolean, removeSourceFile: Boolean = false): Unit = { Would be good if addPyFile takes a parameter to set the overwrite and default to false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172786#comment-15172786 ] Shixiong Zhu edited comment on SPARK-13580 at 3/1/16 12:06 AM: --- It hapens in an Akka thread. You can add {{-XX:OnOutOfMemoryError='kill %p'}} to the executor java options to force the executor exit for OOM Or just upgrade to 1.6.0 which doesn't use Akka any more. was (Author: zsxwing): It hapens in an Akka thread. You can add {{-XX:OnOutOfMemoryError='kill -9 %p'}} to the executor java options to force the executor exit for OOM Or just upgrade to 1.6.0 which doesn't use Akka any more. > Driver makes no progress after failed to remove broadcast on Executor > - > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, > stderrfiltered.txt.gz > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu closed SPARK-13580. Resolution: Not A Bug > Driver makes no progress after failed to remove broadcast on Executor > - > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, > stderrfiltered.txt.gz > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13584) ContinuousQueryManagerSuite floods the logs with garbage
[ https://issues.apache.org/jira/browse/SPARK-13584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13584: Assignee: (was: Apache Spark) > ContinuousQueryManagerSuite floods the logs with garbage > > > Key: SPARK-13584 > URL: https://issues.apache.org/jira/browse/SPARK-13584 > Project: Spark > Issue Type: Test >Reporter: Shixiong Zhu > > We should clean up the following outputs > {code} > [info] ContinuousQueryManagerSuite: > 16:30:20.473 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 > in stage 0.0 (TID 1) > java.lang.ArithmeticException: / by zero > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > 16:30:20.506 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in > stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.SparkContext$$anonfun$ru
[jira] [Commented] (SPARK-13584) ContinuousQueryManagerSuite floods the logs with garbage
[ https://issues.apache.org/jira/browse/SPARK-13584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172910#comment-15172910 ] Apache Spark commented on SPARK-13584: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/11439 > ContinuousQueryManagerSuite floods the logs with garbage > > > Key: SPARK-13584 > URL: https://issues.apache.org/jira/browse/SPARK-13584 > Project: Spark > Issue Type: Test >Reporter: Shixiong Zhu > > We should clean up the following outputs > {code} > [info] ContinuousQueryManagerSuite: > 16:30:20.473 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 > in stage 0.0 (TID 1) > java.lang.ArithmeticException: / by zero > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > 16:30:20.506 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in > stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.r
[jira] [Assigned] (SPARK-13584) ContinuousQueryManagerSuite floods the logs with garbage
[ https://issues.apache.org/jira/browse/SPARK-13584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13584: Assignee: Apache Spark > ContinuousQueryManagerSuite floods the logs with garbage > > > Key: SPARK-13584 > URL: https://issues.apache.org/jira/browse/SPARK-13584 > Project: Spark > Issue Type: Test >Reporter: Shixiong Zhu >Assignee: Apache Spark > > We should clean up the following outputs > {code} > [info] ContinuousQueryManagerSuite: > 16:30:20.473 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 > in stage 0.0 (TID 1) > java.lang.ArithmeticException: / by zero > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > 16:30:20.506 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in > stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark
[jira] [Created] (SPARK-13584) ContinuousQueryManagerSuite floods the logs with garbage
Shixiong Zhu created SPARK-13584: Summary: ContinuousQueryManagerSuite floods the logs with garbage Key: SPARK-13584 URL: https://issues.apache.org/jira/browse/SPARK-13584 Project: Spark Issue Type: Test Reporter: Shixiong Zhu We should clean up the following outputs {code} [info] ContinuousQueryManagerSuite: 16:30:20.473 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.lang.ArithmeticException: / by zero at org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303) at org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) at org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) at scala.collection.AbstractIterator.to(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) at org.apache.spark.scheduler.Task.run(Task.scala:81) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 16:30:20.506 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero at org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303) at org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) at org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) at scala.collection.AbstractIterator.to(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) at org.apache.spark.scheduler.Task.run(Task.scala:81) at org
[jira] [Assigned] (SPARK-13583) Enforce `UnusedImports` Java checkstyle rule
[ https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13583: Assignee: (was: Apache Spark) > Enforce `UnusedImports` Java checkstyle rule > > > Key: SPARK-13583 > URL: https://issues.apache.org/jira/browse/SPARK-13583 > Project: Spark > Issue Type: Task >Reporter: Dongjoon Hyun >Priority: Trivial > > After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review > by saving much time. > This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` > rule to `checkstyle.xml` and fixing all existing unused imports. > {code:title=checkstyle.xml|borderStyle=solid} > + > {code} > This will also prevent the upcoming PR from having unused imports. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13583) Enforce `UnusedImports` Java checkstyle rule
[ https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172882#comment-15172882 ] Apache Spark commented on SPARK-13583: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11438 > Enforce `UnusedImports` Java checkstyle rule > > > Key: SPARK-13583 > URL: https://issues.apache.org/jira/browse/SPARK-13583 > Project: Spark > Issue Type: Task >Reporter: Dongjoon Hyun >Priority: Trivial > > After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review > by saving much time. > This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` > rule to `checkstyle.xml` and fixing all existing unused imports. > {code:title=checkstyle.xml|borderStyle=solid} > + > {code} > This will also prevent the upcoming PR from having unused imports. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172881#comment-15172881 ] Zhong Wang edited comment on SPARK-13337 at 2/29/16 11:40 PM: -- It doesn't help in my case, because it doesn't support null-safe joins. It would be great if there is an interface like: {code} def join(right: DataFrame, usingColumns: Seq[String], joinType: String, nullSafe:Boolean): DataFrame {code} The current join-using-column interface works great if the joining tables doesn't contain null values: it can eliminate the null columns generated from outer joins automatically. The general joining methods in your example support null-safe joins perfectly, but it cannot automatically eliminate the null columns, which are generated from outer joins. Sorry that it is a little bit complicated here. Please let me know if you need a concrete example. was (Author: zwang): It doesn't help in my case, because it doesn't support null-safe joins. It would be great if there is an interface like: {code} def join(right: DataFrame, usingColumns: Seq[String], joinType: String, nullSafe:Boolean): DataFrame {code} It works great if the joining tables doesn't contain null values: it can eliminate the null columns generated from outer joins automatically. The general joining methods in your example support null-safe joins perfectly, but it cannot automatically eliminate the null columns, which are generated from outer joins. Sorry that it is a little bit complicated here. Please let me know if you need a concrete example. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13583) Enforce `UnusedImports` Java checkstyle rule
[ https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13583: Assignee: Apache Spark > Enforce `UnusedImports` Java checkstyle rule > > > Key: SPARK-13583 > URL: https://issues.apache.org/jira/browse/SPARK-13583 > Project: Spark > Issue Type: Task >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Trivial > > After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review > by saving much time. > This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` > rule to `checkstyle.xml` and fixing all existing unused imports. > {code:title=checkstyle.xml|borderStyle=solid} > + > {code} > This will also prevent the upcoming PR from having unused imports. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172881#comment-15172881 ] Zhong Wang commented on SPARK-13337: It doesn't help in my case, because it doesn't support null-safe joins. It would be great if there is an interface like: {code} def join(right: DataFrame, usingColumns: Seq[String], joinType: String, nullSafe:Boolean): DataFrame {code} It works great if the joining tables doesn't contain null values: it can eliminate the null columns generated from outer joins automatically. The general joining methods in your example support null-safe joins perfectly, but it cannot automatically eliminate the null columns, which are generated from outer joins. Sorry that it is a little bit complicated here. Please let me know if you need a concrete example. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13583) Enforce `UnusedImports` Java checkstyle rule
Dongjoon Hyun created SPARK-13583: - Summary: Enforce `UnusedImports` Java checkstyle rule Key: SPARK-13583 URL: https://issues.apache.org/jira/browse/SPARK-13583 Project: Spark Issue Type: Task Reporter: Dongjoon Hyun Priority: Trivial After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time. This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` rule to `checkstyle.xml` and fixing all existing unused imports. {code:title=checkstyle.xml|borderStyle=solid} + {code} This will also prevent the upcoming PR from having unused imports. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13582) Improve performance of parquet reader with dictionary encoding
[ https://issues.apache.org/jira/browse/SPARK-13582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13582: Assignee: Apache Spark (was: Davies Liu) > Improve performance of parquet reader with dictionary encoding > -- > > Key: SPARK-13582 > URL: https://issues.apache.org/jira/browse/SPARK-13582 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > Right now, we replace the ids with value from a dictionary before accessing a > column. We could defer that, especially when some rows are filtered out, we > will not lookup this dictionary for those rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13582) Improve performance of parquet reader with dictionary encoding
[ https://issues.apache.org/jira/browse/SPARK-13582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13582: Assignee: Davies Liu (was: Apache Spark) > Improve performance of parquet reader with dictionary encoding > -- > > Key: SPARK-13582 > URL: https://issues.apache.org/jira/browse/SPARK-13582 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we replace the ids with value from a dictionary before accessing a > column. We could defer that, especially when some rows are filtered out, we > will not lookup this dictionary for those rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13582) Improve performance of parquet reader with dictionary encoding
[ https://issues.apache.org/jira/browse/SPARK-13582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172841#comment-15172841 ] Apache Spark commented on SPARK-13582: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11437 > Improve performance of parquet reader with dictionary encoding > -- > > Key: SPARK-13582 > URL: https://issues.apache.org/jira/browse/SPARK-13582 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we replace the ids with value from a dictionary before accessing a > column. We could defer that, especially when some rows are filtered out, we > will not lookup this dictionary for those rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13582) Improve performance of parquet reader with dictionary encoding
Davies Liu created SPARK-13582: -- Summary: Improve performance of parquet reader with dictionary encoding Key: SPARK-13582 URL: https://issues.apache.org/jira/browse/SPARK-13582 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Right now, we replace the ids with value from a dictionary before accessing a column. We could defer that, especially when some rows are filtered out, we will not lookup this dictionary for those rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13571) Track current database in SQL/HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-13571: -- Comment: was deleted (was: User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11433) > Track current database in SQL/HiveContext > - > > Key: SPARK-13571 > URL: https://issues.apache.org/jira/browse/SPARK-13571 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > > We already have internal APIs for Hive to do this. We should do it for > SQLContext too so we can merge these code paths one day. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13571) Track current database in SQL/HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172825#comment-15172825 ] Apache Spark commented on SPARK-13571: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11433 > Track current database in SQL/HiveContext > - > > Key: SPARK-13571 > URL: https://issues.apache.org/jira/browse/SPARK-13571 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > > We already have internal APIs for Hive to do this. We should do it for > SQLContext too so we can merge these code paths one day. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13581) LibSVM throws MatchError
[ https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13581: -- Assignee: Jeff Zhang I guess the input it's hoping to treat as a vector type is just a double from the SVM input. As to why, I don't know; seems like a legit bug though. > LibSVM throws MatchError > > > Key: SPARK-13581 > URL: https://issues.apache.org/jira/browse/SPARK-13581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jakob Odersky >Assignee: Jeff Zhang >Priority: Minor > > When running an action on a DataFrame obtained by reading from a libsvm file > a MatchError is thrown, however doing the same on a cached DataFrame works > fine. > {code} > val df = > sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") > //file is > df.select(df("features")).show() //MatchError > df.cache() > df.select(df("features")).show() //OK > {code} > The exception stack trace is the following: > {code} > scala.MatchError: 1.0 (of class java.lang.Double) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56) > {code} > This issue first appeared in commit {{1dac964c1}}, in PR > [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622. > [~jeffzhang], do you have any insight of what could be going on? > cc [~iyounus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172810#comment-15172810 ] Liyin Tang commented on SPARK-13580: Thanks [~zsxwing] for the investigation! That's very helpful ! > Driver makes no progress after failed to remove broadcast on Executor > - > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, > stderrfiltered.txt.gz > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172796#comment-15172796 ] Xiao Li commented on SPARK-13337: - Sorry, I do not get your point. Join-using-columns does not help in your case, right? It just removes the overlapping columns but it does not filter the values in the results. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172786#comment-15172786 ] Shixiong Zhu commented on SPARK-13580: -- It hapens in an Akka thread. You can add {{-XX:OnOutOfMemoryError='kill -9 %p'}} to the executor java options to force the executor exit for OOM Or just upgrade to 1.6.0 which doesn't use Akka any more. > Driver makes no progress after failed to remove broadcast on Executor > - > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, > stderrfiltered.txt.gz > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13581) LibSVM throws MatchError
Jakob Odersky created SPARK-13581: - Summary: LibSVM throws MatchError Key: SPARK-13581 URL: https://issues.apache.org/jira/browse/SPARK-13581 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Jakob Odersky Priority: Minor When running an action on a DataFrame obtained by reading from a libsvm file a MatchError is thrown, however doing the same on a cached DataFrame works fine. {code} val df = sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") //file is df.select(df("features")).show() //MatchError df.cache() df.select(df("features")).show() //OK {code} The exception stack trace is the following: {code} scala.MatchError: 1.0 (of class java.lang.Double) [info] at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207) [info] at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192) [info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142) [info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) [info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) [info] at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) [info] at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56) {code} This issue first appeared in commit {{1dac964c1}}, in PR [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622. [~jeffzhang], do you have any insight of what could be going on? cc [~iyounus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13430) Expose ml summary function in PySpark for classification and regression models
[ https://issues.apache.org/jira/browse/SPARK-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172775#comment-15172775 ] Bryan Cutler commented on SPARK-13430: -- I can work on adding this > Expose ml summary function in PySpark for classification and regression models > -- > > Key: SPARK-13430 > URL: https://issues.apache.org/jira/browse/SPARK-13430 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Reporter: Shubhanshu Mishra > Labels: classification, java, ml, mllib, pyspark, regression, > scala, sparkr > > I think model summary interface which is available in Spark's scala, Java and > R interfaces should also be available in the python interface. > Similar to #SPARK-11494 > https://issues.apache.org/jira/browse/SPARK-11494 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12817) Remove CacheManager and replace it with new BlockManager.getOrElseUpdate method
[ https://issues.apache.org/jira/browse/SPARK-12817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172773#comment-15172773 ] Apache Spark commented on SPARK-12817: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11436 > Remove CacheManager and replace it with new BlockManager.getOrElseUpdate > method > --- > > Key: SPARK-12817 > URL: https://issues.apache.org/jira/browse/SPARK-12817 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > CacheManager directly calls MemoryStore.unrollSafely() and has its own logic > for handling graceful fallback to disk when cached data does not fit in > memory. However, this logic also exists inside of the MemoryStore itself, so > this appears to be unnecessary duplication. > Thanks to the addition of block-level read/write locks, we can refactor the > code to remove the CacheManager and replace it with an atomic getOrElseUpdate > BlockManager method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12817) Remove CacheManager and replace it with new BlockManager.getOrElseUpdate method
[ https://issues.apache.org/jira/browse/SPARK-12817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-12817: --- Description: CacheManager directly calls MemoryStore.unrollSafely() and has its own logic for handling graceful fallback to disk when cached data does not fit in memory. However, this logic also exists inside of the MemoryStore itself, so this appears to be unnecessary duplication. Thanks to the addition of block-level read/write locks, we can refactor the code to remove the CacheManager and replace it with an atomic getOrElseUpdate BlockManager method. was: CacheManager directly calls MemoryStore.unrollSafely() and has its own logic for handling graceful fallback to disk when cached data does not fit in memory. However, this logic also exists inside of the MemoryStore itself, so this appears to be unnecessary duplication. We can remove this duplication and delete a significant amount of BlockManager code which existed only to support this CacheManager code. > Remove CacheManager and replace it with new BlockManager.getOrElseUpdate > method > --- > > Key: SPARK-12817 > URL: https://issues.apache.org/jira/browse/SPARK-12817 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > CacheManager directly calls MemoryStore.unrollSafely() and has its own logic > for handling graceful fallback to disk when cached data does not fit in > memory. However, this logic also exists inside of the MemoryStore itself, so > this appears to be unnecessary duplication. > Thanks to the addition of block-level read/write locks, we can refactor the > code to remove the CacheManager and replace it with an atomic getOrElseUpdate > BlockManager method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12817) Remove CacheManager and replace it with new BlockManager.getOrElseUpdate method
[ https://issues.apache.org/jira/browse/SPARK-12817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-12817: --- Summary: Remove CacheManager and replace it with new BlockManager.getOrElseUpdate method (was: Simplify CacheManager code and remove unused BlockManager methods) > Remove CacheManager and replace it with new BlockManager.getOrElseUpdate > method > --- > > Key: SPARK-12817 > URL: https://issues.apache.org/jira/browse/SPARK-12817 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > CacheManager directly calls MemoryStore.unrollSafely() and has its own logic > for handling graceful fallback to disk when cached data does not fit in > memory. However, this logic also exists inside of the MemoryStore itself, so > this appears to be unnecessary duplication. > We can remove this duplication and delete a significant amount of > BlockManager code which existed only to support this CacheManager code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172709#comment-15172709 ] Zhong Wang edited comment on SPARK-13337 at 2/29/16 10:05 PM: -- For an outer join, it is difficult to eliminate the null columns from the result, because the null columns can come from both tables. The `join-using-column` interface can automatically eliminate those columns, which are very convenient. Sorry that I missed this point in my last reply. was (Author: zwang): For an outer join, it is difficult to eliminate the null columns from the result. The `join-using-column` interface can automatically eliminate those columns, which are very convenient. Sorry that I missed this point in my last reply. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172709#comment-15172709 ] Zhong Wang commented on SPARK-13337: For an outer join, it is difficult to eliminate the null columns from the result. The `join-using-column` interface can automatically eliminate those columns, which are very convenient. Sorry that I missed this point in my last reply. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172689#comment-15172689 ] Shixiong Zhu commented on SPARK-13580: -- OOM in the executor side: Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "sparkExecutor-akka.actor.default-dispatcher-22" > Driver makes no progress after failed to remove broadcast on Executor > - > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, > stderrfiltered.txt.gz > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172672#comment-15172672 ] Jingwei Lu commented on SPARK-13580: Attached the executor log for #11.[~zsxwing] > Driver makes no progress after failed to remove broadcast on Executor > - > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, > stderrfiltered.txt.gz > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jingwei Lu updated SPARK-13580: --- Attachment: stderrfiltered.txt.gz > Driver makes no progress after failed to remove broadcast on Executor > - > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, > stderrfiltered.txt.gz > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172658#comment-15172658 ] Shixiong Zhu edited comment on SPARK-13580 at 2/29/16 9:38 PM: --- Could you post the executor #11 log? was (Author: zsxwing): Could you post the executor log? > Driver makes no progress after failed to remove broadcast on Executor > - > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172658#comment-15172658 ] Shixiong Zhu commented on SPARK-13580: -- Could you post the executor log? > Driver makes no progress after failed to remove broadcast on Executor > - > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter
[ https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172654#comment-15172654 ] Steve Loughran commented on SPARK-10063: sorry! HADOOP-9565 > Remove DirectParquetOutputCommitter > --- > > Key: SPARK-10063 > URL: https://issues.apache.org/jira/browse/SPARK-10063 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > When we use DirectParquetOutputCommitter on S3 and speculation is enabled, > there is a chance that we can loss data. > Here is the code to reproduce the problem. > {code} > import org.apache.spark.sql.functions._ > val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: > Int, partitionId: Int, attemptNumber: Int) => { > if (partitionId == 0 && i == 5) { > if (attemptNumber > 0) { > Thread.sleep(15000) > throw new Exception("new exception") > } else { > Thread.sleep(1) > } > } > > i > }) > val df = sc.parallelize((1 to 100), 20).mapPartitions { iter => > val context = org.apache.spark.TaskContext.get() > val partitionId = context.partitionId > val attemptNumber = context.attemptNumber > iter.map(i => (i, partitionId, attemptNumber)) > }.toDF("i", "partitionId", "attemptNumber") > df > .select(failSpeculativeTask($"i", $"partitionId", > $"attemptNumber").as("i"), $"partitionId", $"attemptNumber") > .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter") > sqlContext.read.load("/home/yin/outputCommitter").count > // The result is 99 and 5 is missing from the output. > {code} > What happened is that the original task finishes first and uploads its output > file to S3, then the speculative task somehow fails. Because we have to call > output stream's close method, which uploads data to S3, we actually uploads > the partial result generated by the failed speculative task to S3 and this > file overwrites the correct file generated by the original task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
[ https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated SPARK-13580: --- Attachment: executor_jstack driver_log.txt driver_jstack.txt > Driver makes no progress after failed to remove broadcast on Executor > - > > Key: SPARK-13580 > URL: https://issues.apache.org/jira/browse/SPARK-13580 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liyin Tang > Attachments: driver_jstack.txt, driver_log.txt, executor_jstack > > > From Driver's log: it failed to remove broadcast data due to RPC timeout > exception from executor #11. And it also failed to get thread dump from > executor #11 due to akka.actor.ActorNotFound exception. > After that, driver waited for executor #11 to finish one task for that job. > All the other tasks are finished for that job. > However, from the executor#11's log, it didn't get that task (it got 9 other > tasks and finished them) > Since then, there is no progress in the streaming job. > I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor
Liyin Tang created SPARK-13580: -- Summary: Driver makes no progress after failed to remove broadcast on Executor Key: SPARK-13580 URL: https://issues.apache.org/jira/browse/SPARK-13580 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.5.2 Reporter: Liyin Tang >From Driver's log: it failed to remove broadcast data due to RPC timeout >exception from executor #11. And it also failed to get thread dump from >executor #11 due to akka.actor.ActorNotFound exception. After that, driver waited for executor #11 to finish one task for that job. All the other tasks are finished for that job. However, from the executor#11's log, it didn't get that task (it got 9 other tasks and finished them) Since then, there is no progress in the streaming job. I have attached the driver's log and jstack, executor's jstack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13255) Integrate vectorized parquet scan with whole stage codegen.
[ https://issues.apache.org/jira/browse/SPARK-13255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172632#comment-15172632 ] Apache Spark commented on SPARK-13255: -- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/11435 > Integrate vectorized parquet scan with whole stage codegen. > --- > > Key: SPARK-13255 > URL: https://issues.apache.org/jira/browse/SPARK-13255 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Nong Li > > The generated whole stage codegen is intended to be run over batches of rows. > This task is to integrate ColumnarBatches with whole stage codegen. > The resulting generated code should look something like: > {code} > Iterator input; > void process() { > while (input.hasNext()) { > ColumnarBatch batch = input.next(); > for (Row: batch) { > // Current function > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13478) Fetching delegation tokens for Hive fails when using proxy users
[ https://issues.apache.org/jira/browse/SPARK-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-13478. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.0.0 > Fetching delegation tokens for Hive fails when using proxy users > > > Key: SPARK-13478 > URL: https://issues.apache.org/jira/browse/SPARK-13478 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0, 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.0.0 > > > If you use spark-submit's proxy user support, the code that fetches > delegation tokens for the Hive Metastore fails. It seems like the Hive > library tries to connect to the Metastore as the proxy user, and it doesn't > have a Kerberos TGT for that user, so it fails. > I don't know whether the same issue exists in the HBase code, but I'll make a > similar change so that both behave similarly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13123) Add wholestage codegen for sort
[ https://issues.apache.org/jira/browse/SPARK-13123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-13123: - Assignee: Sameer Agarwal (was: Nong Li) > Add wholestage codegen for sort > --- > > Key: SPARK-13123 > URL: https://issues.apache.org/jira/browse/SPARK-13123 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Nong Li >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > > It should just implement CodegenSupport. It's future work to have this > operator use codegen more effectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13579) Stop building assemblies for Spark
Marcelo Vanzin created SPARK-13579: -- Summary: Stop building assemblies for Spark Key: SPARK-13579 URL: https://issues.apache.org/jira/browse/SPARK-13579 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 2.0.0 Reporter: Marcelo Vanzin See parent bug for more details. This change needs to wait for the other sub-tasks to be finished, so that the code knows what to do when there's only a bunch of jars to work with. This should cover both maven and sbt builds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13123) Add wholestage codegen for sort
[ https://issues.apache.org/jira/browse/SPARK-13123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-13123. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11359 [https://github.com/apache/spark/pull/11359] > Add wholestage codegen for sort > --- > > Key: SPARK-13123 > URL: https://issues.apache.org/jira/browse/SPARK-13123 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > Fix For: 2.0.0 > > > It should just implement CodegenSupport. It's future work to have this > operator use codegen more effectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13578) Make launcher lib and user scripts handle jar directories instead of single assembly file
Marcelo Vanzin created SPARK-13578: -- Summary: Make launcher lib and user scripts handle jar directories instead of single assembly file Key: SPARK-13578 URL: https://issues.apache.org/jira/browse/SPARK-13578 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin See parent bug for details. This step is necessary before we can remove the assembly from the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13577) Allow YARN to handle multiple jars, archive when uploading Spark dependencies
Marcelo Vanzin created SPARK-13577: -- Summary: Allow YARN to handle multiple jars, archive when uploading Spark dependencies Key: SPARK-13577 URL: https://issues.apache.org/jira/browse/SPARK-13577 Project: Spark Issue Type: Sub-task Components: YARN Reporter: Marcelo Vanzin See parent bug for more details. Before we remove assemblies from Spark, we need the YARN backend to understand how to find and upload multiple jars containing the Spark code. as a feature request made during spec review, we should also allow the Spark code to be provided as an archive that would be uploaded as a single file to the cluster, but exploded when downloaded to the containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13575) Remove streaming backends' assemblies
[ https://issues.apache.org/jira/browse/SPARK-13575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-13575: --- Component/s: (was: YARN) (was: Spark Core) Streaming > Remove streaming backends' assemblies > - > > Key: SPARK-13575 > URL: https://issues.apache.org/jira/browse/SPARK-13575 > Project: Spark > Issue Type: Sub-task > Components: Build, Streaming >Reporter: Marcelo Vanzin > > See parent bug for details. This task covers removing assemblies for > streaming backends. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13576) Make examples jar not be an assembly
Marcelo Vanzin created SPARK-13576: -- Summary: Make examples jar not be an assembly Key: SPARK-13576 URL: https://issues.apache.org/jira/browse/SPARK-13576 Project: Spark Issue Type: Sub-task Components: Examples Reporter: Marcelo Vanzin See parent bug for details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13575) Remove streaming backends' assemblies
Marcelo Vanzin created SPARK-13575: -- Summary: Remove streaming backends' assemblies Key: SPARK-13575 URL: https://issues.apache.org/jira/browse/SPARK-13575 Project: Spark Issue Type: Sub-task Reporter: Marcelo Vanzin See parent bug for details. This task covers removing assemblies for streaming backends. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype
[ https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12941: - Fix Version/s: 1.5.3 1.4.2 > Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR > datatype > -- > > Key: SPARK-12941 > URL: https://issues.apache.org/jira/browse/SPARK-12941 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 > Environment: Apache Spark 1.4.2.2 >Reporter: Jose Martinez Poblete >Assignee: Thomas Sebastian > Fix For: 1.4.2, 1.5.3, 2.0.0, 1.6.2 > > > When exporting data from Spark to Oracle, string datatypes are translated to > TEXT for Oracle, this is leading to the following error > {noformat} > java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype > {noformat} > As per the following code: > https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144 > See also: > http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7253) Add example of belief propagation with GraphX
[ https://issues.apache.org/jira/browse/SPARK-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7253. -- Resolution: Workaround An example was provided using the GraphFrames API (https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala). I marked this issue as "Workaround". > Add example of belief propagation with GraphX > - > > Key: SPARK-7253 > URL: https://issues.apache.org/jira/browse/SPARK-7253 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Joseph K. Bradley > > It would nice to document (via an example) how to use GraphX to do belief > propagation. It's probably too much right now to talk about a full-fledged > graphical model library (and that would belong in MLlib anyways), but a > simple example of a graphical model + BP would be nice to add to GraphX. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3665) Java API for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3665. -- Resolution: Workaround GraphFrames (https://github.com/graphframes/graphframes) wraps GraphX algorithms under the DataFrames API and its Scala interface is compatible with Java. The project is in active development, currently as a 3rd-party package. I'm marking this issue as "Workaround" because it is much easier to support Java under the DataFrames API. > Java API for GraphX > --- > > Key: SPARK-3665 > URL: https://issues.apache.org/jira/browse/SPARK-3665 > Project: Spark > Issue Type: Improvement > Components: GraphX, Java API >Affects Versions: 1.0.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > > The Java API will wrap the Scala API in a similar manner as JavaRDD. > Components will include: > # JavaGraph > #- removes optional param from persist, subgraph, mapReduceTriplets, > Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply > #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices > #- merges multiple parameters lists > #- incorporates GraphOps > # JavaVertexRDD > # JavaEdgeRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3789) [GRAPHX] Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3789. -- Resolution: Workaround GraphFrames (https://github.com/graphframes/graphframes) wraps GraphX algorithms under the DataFrames API and it provides Python interface. The project is in active development, currently as a 3rd-party package. So I'm marking this issue as "Workaround" because it is much easier to support Python under the DataFrames API. > [GRAPHX] Python bindings for GraphX > --- > > Key: SPARK-3789 > URL: https://issues.apache.org/jira/browse/SPARK-3789 > Project: Spark > Issue Type: New Feature > Components: GraphX, PySpark >Reporter: Ameet Talwalkar >Assignee: Kushal Datta > Attachments: PyGraphX_design_doc.pdf > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7256) Add Graph abstraction which uses DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7256. -- Resolution: Later GraphFrames (https://github.com/graphframes/graphframes) implemented this idea and it is in active development. So I marked this issue as later. > Add Graph abstraction which uses DataFrame > -- > > Key: SPARK-7256 > URL: https://issues.apache.org/jira/browse/SPARK-7256 > Project: Spark > Issue Type: New Feature > Components: GraphX, SQL >Reporter: Joseph K. Bradley >Priority: Critical > Labels: dataframe, graphx > > RDD is to DataFrame as Graph is to ??? (this JIRA). > It would be very useful long-term to have a Graph type which uses 2 > DataFrames instead of 2 RDDs. > The immediate benefit I have in mind is taking advantage of Spark SQL > datasources and storage formats. > This could also be an opportunity to make an API which is more Java- and > Python-friendly. > CC: [~ankurdave] [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13574) Improve parquet dictionary decoding for strings
[ https://issues.apache.org/jira/browse/SPARK-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172518#comment-15172518 ] Apache Spark commented on SPARK-13574: -- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/11434 > Improve parquet dictionary decoding for strings > --- > > Key: SPARK-13574 > URL: https://issues.apache.org/jira/browse/SPARK-13574 > Project: Spark > Issue Type: Improvement >Reporter: Nong Li >Priority: Minor > > Currently, the parquet reader will copy the dictionary value for each data > value. This is bad for string columns as we explode the dictionary during > decode. We should instead, have the data values point to the safe backing > memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org