[jira] [Created] (SPARK-18847) PageRank gives incorrect results for graphs with sinks
Andrew Ray created SPARK-18847: -- Summary: PageRank gives incorrect results for graphs with sinks Key: SPARK-18847 URL: https://issues.apache.org/jira/browse/SPARK-18847 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.0.2, 1.6.3, 1.5.2, 1.4.1, 1.3.1, 1.2.2, 1.1.1, 1.0.2 Reporter: Andrew Ray Sink vertices (those with no outgoing edges) should evenly distribute their rank to the entire graph but in the current implementation it is just lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18471) In treeAggregate, generate (big) zeros instead of sending them.
[ https://issues.apache.org/jira/browse/SPARK-18471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18471: -- Assignee: Anthony Truchet > In treeAggregate, generate (big) zeros instead of sending them. > --- > > Key: SPARK-18471 > URL: https://issues.apache.org/jira/browse/SPARK-18471 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Reporter: Anthony Truchet >Assignee: Anthony Truchet >Priority: Minor > Fix For: 2.2.0 > > > When using optimization routine like LBFGS, treeAggregate curently sends the > zero vector as part of the closure. This zero can be huge (e.g. ML vectors > with millions of zeros) but can be easily generated. > Several option are possible (upcoming patches to come soon for some of them). > On is to provide a treeAggregateWithZeroGenerator method (either in core on > in MLlib) which wrap treeAggregate in an option and generate the zero if None. > Another one is to rewrite treeAggregate to wrap an underlying implementation > which use a zero generator directly. > There might be other better alternative we have not spotted... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18471) In treeAggregate, generate (big) zeros instead of sending them.
[ https://issues.apache.org/jira/browse/SPARK-18471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18471. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16037 [https://github.com/apache/spark/pull/16037] > In treeAggregate, generate (big) zeros instead of sending them. > --- > > Key: SPARK-18471 > URL: https://issues.apache.org/jira/browse/SPARK-18471 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Reporter: Anthony Truchet >Priority: Minor > Fix For: 2.2.0 > > > When using optimization routine like LBFGS, treeAggregate curently sends the > zero vector as part of the closure. This zero can be huge (e.g. ML vectors > with millions of zeros) but can be easily generated. > Several option are possible (upcoming patches to come soon for some of them). > On is to provide a treeAggregateWithZeroGenerator method (either in core on > in MLlib) which wrap treeAggregate in an option and generate the zero if None. > Another one is to rewrite treeAggregate to wrap an underlying implementation > which use a zero generator directly. > There might be other better alternative we have not spotted... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM
[ https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18715: -- Assignee: Wayne Zhang Priority: Major (was: Critical) > Fix wrong AIC calculation in Binomial GLM > - > > Key: SPARK-18715 > URL: https://issues.apache.org/jira/browse/SPARK-18715 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: patch > Fix For: 2.2.0 > > Original Estimate: 120h > Remaining Estimate: 120h > > The AIC calculation in Binomial GLM seems to be wrong when there are weights. > The result is different from that in R. > The current implementation is: > {code} > -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => > weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) > }.sum() > {code} > Suggest changing this to > {code} > -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => > val wt = math.round(weight).toInt > if (wt == 0){ > 0.0 > } else { > dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) > } > }.sum() > {code} > > > The following is an example to illustrate the problem. > {code} > val dataset = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(0.5, Vectors.dense(12, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(0.5, Vectors.dense(16, 1.0)) > ).toDF().withColumn("weight", col("label") + 1.0) > val glr = new GeneralizedLinearRegression() > .setFamily("binomial") > .setWeightCol("weight") > .setRegParam(0) > val model = glr.fit(dataset) > model.summary.aic > {code} > This calculation shows the AIC is 14.189026847171382. To verify whether this > is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik > = 5.660918. > {code} > da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",") > 0,18,1,1 > 0.5,12,0,1.5 > 1,15,0,2 > 0,13,2,1 > 0,15,1,1 > 0.5,16,1,1.5 > da <- as.data.frame(da) > f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w) > AIC(f) > -2 * logLik(f) > {code} > Now, I check whether the proposed change is correct. The following calculates > -2 * LogLik manually and get 5.6609177228379055, the same as that in R. > {code} > val predictions = model.transform(dataset) > -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case > Row(y: Double, mu: Double, weight: Double) => > val wt = math.round(weight).toInt > if (wt == 0){ > 0.0 > } else { > dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) > } > }.sum() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM
[ https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18715. --- Resolution: Fixed Issue resolved by pull request 16149 [https://github.com/apache/spark/pull/16149] > Fix wrong AIC calculation in Binomial GLM > - > > Key: SPARK-18715 > URL: https://issues.apache.org/jira/browse/SPARK-18715 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Priority: Critical > Labels: patch > Fix For: 2.2.0 > > Original Estimate: 120h > Remaining Estimate: 120h > > The AIC calculation in Binomial GLM seems to be wrong when there are weights. > The result is different from that in R. > The current implementation is: > {code} > -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => > weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) > }.sum() > {code} > Suggest changing this to > {code} > -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => > val wt = math.round(weight).toInt > if (wt == 0){ > 0.0 > } else { > dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) > } > }.sum() > {code} > > > The following is an example to illustrate the problem. > {code} > val dataset = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(0.5, Vectors.dense(12, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(0.5, Vectors.dense(16, 1.0)) > ).toDF().withColumn("weight", col("label") + 1.0) > val glr = new GeneralizedLinearRegression() > .setFamily("binomial") > .setWeightCol("weight") > .setRegParam(0) > val model = glr.fit(dataset) > model.summary.aic > {code} > This calculation shows the AIC is 14.189026847171382. To verify whether this > is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik > = 5.660918. > {code} > da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",") > 0,18,1,1 > 0.5,12,0,1.5 > 1,15,0,2 > 0,13,2,1 > 0,15,1,1 > 0.5,16,1,1.5 > da <- as.data.frame(da) > f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w) > AIC(f) > -2 * logLik(f) > {code} > Now, I check whether the proposed change is correct. The following calculates > -2 * LogLik manually and get 5.6609177228379055, the same as that in R. > {code} > val predictions = model.transform(dataset) > -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case > Row(y: Double, mu: Double, weight: Double) => > val wt = math.round(weight).toInt > if (wt == 0){ > 0.0 > } else { > dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) > } > }.sum() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18845) PageRank has incorrect initialization value that leads to slow convergence
[ https://issues.apache.org/jira/browse/SPARK-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746316#comment-15746316 ] Sean Owen commented on SPARK-18845: --- See https://issues.apache.org/jira/browse/SPARK-7005 ? > PageRank has incorrect initialization value that leads to slow convergence > -- > > Key: SPARK-18845 > URL: https://issues.apache.org/jira/browse/SPARK-18845 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2 >Reporter: Andrew Ray > > All variants of PageRank in GraphX have incorrect initialization value that > leads to slow convergence. In the current implementations ranks are seeded > with the reset probability when it should be 1. This appears to have been > introduced a long time ago in > https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90 > This also hides the fact that source vertices (vertices with no incoming > edges) are not updated. This is because source vertices generally* have > pagerank equal to the reset probability. Therefore both need to be fixed at > once. > PR will be added shortly > *when there are no sinks -- but that's a separate bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18846) Fix flakiness in SchedulerIntegrationSuite
Imran Rashid created SPARK-18846: Summary: Fix flakiness in SchedulerIntegrationSuite Key: SPARK-18846 URL: https://issues.apache.org/jira/browse/SPARK-18846 Project: Spark Issue Type: Test Components: Scheduler Affects Versions: 2.1.0 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor There is a small race in {{SchedulerIntegrationSuite}} that has failed at least one set of tests. The test assumes that the taskscheduler thread processing that last task will finish before the DAGScheduler processes the task event and notifies the job waiter, but that is not 100% guaranteed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18845) PageRank has incorrect initialization value that leads to slow convergence
Andrew Ray created SPARK-18845: -- Summary: PageRank has incorrect initialization value that leads to slow convergence Key: SPARK-18845 URL: https://issues.apache.org/jira/browse/SPARK-18845 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.0.2, 1.6.3, 1.5.2, 1.4.1, 1.3.1, 1.2.2 Reporter: Andrew Ray All variants of PageRank in GraphX have incorrect initialization value that leads to slow convergence. In the current implementations ranks are seeded with the reset probability when it should be 1. This appears to have been introduced a long time ago in https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90 This also hides the fact that source vertices (vertices with no incoming edges) are not updated. This is because source vertices generally* have pagerank equal to the reset probability. Therefore both need to be fixed at once. PR will be added shortly *when there are no sinks -- but that's a separate bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746293#comment-15746293 ] Zak Patterson commented on SPARK-18844: --- I'm not familiar with the python API much, but it seems to me that the two methods available for scala (precision and recall) are not available in python? https://github.com/apache/spark/blob/v2.1.0-rc2/python/pyspark/mllib/evaluation.py#L29 > Add more binary classification metrics to BinaryClassificationMetrics > - > > Key: SPARK-18844 > URL: https://issues.apache.org/jira/browse/SPARK-18844 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.2 >Reporter: Zak Patterson >Priority: Minor > Labels: evaluation > Fix For: 2.0.2 > > Original Estimate: 5h > Remaining Estimate: 5h > > BinaryClassificationMetrics only implements Precision (positive predictive > value) and recall (true positive rate). It should implement more > comprehensive metrics. > Moreover, the instance variables storing computed counts are marked private, > and there are no accessors for them. So if one desired to add this > functionality, one would have to duplicate this calculation, which is not > trivial: > https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 > Currently Implemented Metrics > --- > * Precision (PPV): `precisionByThreshold` > * Recall (Sensitivity, true positive rate): `recallByThreshold` > Desired additional metrics > --- > * False omission rate: `forByThreshold` > * False discovery rate: `fdrByThreshold` > * Negative predictive value: `npvByThreshold` > * False negative rate: `fnrByThreshold` > * True negative rate (Specificity): `specificityByThreshold` > * False positive rate: `fprByThreshold` > Alternatives > --- > The `createCurve` method is marked private. If it were marked public, and the > trait BinaryClassificationMetricComputer were also marked public, then it > would be easy to define new computers to get whatever the user wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746217#comment-15746217 ] Shixiong Zhu commented on SPARK-18838: -- [~sitalke...@gmail.com] Instead of your proposal, I prefer to make ExecutorAllocationManager not depend on the listener bus, which may drop critical events. > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is > causing job failure. For example, a significant delay in receiving the > `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to > remove an executor which is not idle. The event processor in `ListenerBus` > is a single thread which loops through all the Listeners for each event and > processes each event synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > > The single threaded processor often becomes the bottleneck for large jobs. > In addition to that, if one of the Listener is very slow, all the listeners > will pay the price of delay incurred by the slow listener. > To solve the above problems, we plan to have a per listener single threaded > executor service and separate event queue. That way we are not bottlenecked > by the single threaded event processor and also critical listeners will not > be penalized by the slow listeners. The downside of this approach is separate > event queue per listener will increase the driver memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746212#comment-15746212 ] Sean Owen commented on SPARK-18844: --- Yeah I think we discussed something like this before, and the drawback was just filling up the API with variations that are mostly not used. False positive and specificity might see some use. This would have to go in MulticlassMetrics too, and the Python API of both, for completeness. Still that doesn't mean it's not doable. I tend to agree that it makes sense if anything to expose the 'computer' API, but then it's not clear how to translate that to multiclass and Python. > Add more binary classification metrics to BinaryClassificationMetrics > - > > Key: SPARK-18844 > URL: https://issues.apache.org/jira/browse/SPARK-18844 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.2 >Reporter: Zak Patterson >Priority: Minor > Labels: evaluation > Fix For: 2.0.2 > > Original Estimate: 5h > Remaining Estimate: 5h > > BinaryClassificationMetrics only implements Precision (positive predictive > value) and recall (true positive rate). It should implement more > comprehensive metrics. > Moreover, the instance variables storing computed counts are marked private, > and there are no accessors for them. So if one desired to add this > functionality, one would have to duplicate this calculation, which is not > trivial: > https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 > Currently Implemented Metrics > --- > * Precision (PPV): `precisionByThreshold` > * Recall (Sensitivity, true positive rate): `recallByThreshold` > Desired additional metrics > --- > * False omission rate: `forByThreshold` > * False discovery rate: `fdrByThreshold` > * Negative predictive value: `npvByThreshold` > * False negative rate: `fnrByThreshold` > * True negative rate (Specificity): `specificityByThreshold` > * False positive rate: `fprByThreshold` > Alternatives > --- > The `createCurve` method is marked private. If it were marked public, and the > trait BinaryClassificationMetricComputer were also marked public, then it > would be easy to define new computers to get whatever the user wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18281) toLocalIterator yields time out error on pyspark2
[ https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746023#comment-15746023 ] Mike Dusenberry edited comment on SPARK-18281 at 12/13/16 7:56 PM: --- Here's another interesting finding. The first (original) example fails with the timeout. However, if you create the DataFrame, do something with it, and then create the iterator, it will work. {code} df = spark.createDataFrame([[1],[2],[3]]) it = df.toLocalIterator() # FAILS HERE row = next(it) {code} {code} df = spark.createDataFrame([[1],[2],[3]]) df.count() it = df.toLocalIterator() # No longer fails row = next(it) {code} This leads me to believe there may be something wrong with the creation of the DataFrame. was (Author: mwdus...@us.ibm.com): Here's another interesting finding. The first (original) example fails with the timeout. However, if you create the DataFrame, do something with it, and then create the iterator, it will work. {code} df = spark.createDataFrame([[1],[2],[3]]) it = df.toLocalIterator() # FAILS HERE row = next(it) {code} {code} df = spark.createDataFrame([[1],[2],[3]]) df.count() it = df.toLocalIterator() # No longer fails row = next(it) {code} > toLocalIterator yields time out error on pyspark2 > - > > Key: SPARK-18281 > URL: https://issues.apache.org/jira/browse/SPARK-18281 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 > Environment: Ubuntu 14.04.5 LTS > Driver: AWS M4.XLARGE > Slaves: AWS M4.4.XLARGE > mesos 1.0.1 > spark 2.0.1 > pyspark >Reporter: Luke Miner > > I run the example straight out of the api docs for toLocalIterator and it > gives a time out exception: > {code} > from pyspark import SparkContext > sc = SparkContext() > rdd = sc.parallelize(range(10)) > [x for x in rdd.toLocalIterator()] > {code} > conf file: > spark.driver.maxResultSize 6G > spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G > -XX:+HeapDumpOnOutOfMemoryError > spark.executor.memory 16G > spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz > spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem > spark.hadoop.fs.s3a.buffer.dir /raid0/spark > spark.hadoop.fs.s3n.buffer.dir /raid0/spark > spark.hadoop.fs.s3a.connection.timeout 50 > spark.hadoop.fs.s3n.multipart.uploads.enabled true > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 > spark.hadoop.parquet.block.size 2147483648 > spark.hadoop.parquet.enable.summary-metadatafalse > spark.jars.packages > com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 > spark.local.dir /raid0/spark > spark.mesos.coarse false > spark.mesos.constraints priority:1 > spark.network.timeout 600 > spark.rpc.message.maxSize500 > spark.speculation false > spark.sql.parquet.mergeSchema false > spark.sql.planner.externalSort true > spark.submit.deployMode client > spark.task.cpus 1 > Exception here: > {code} > --- > timeout Traceback (most recent call last) > in () > 2 sc = SparkContext() > 3 rdd = sc.parallelize(range(10)) > > 4 [x for x in rdd.toLocalIterator()] > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in > _load_from_socket(port, serializer) > 140 try: > 141 rf = sock.makefile("rb", 65536) > --> 142 for item in serializer.load_stream(rf): > 143 yield item > 144 finally: > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > load_stream(self, stream) > 137 while True: > 138 try: > --> 139 yield self._read_with_length(stream) > 140 except EOFError: > 141 return > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > _read_with_length(self, stream) > 154 > 155 def _read_with_length(self, stream): > --> 156 length = read_int(stream) > 157 if length == SpecialLengths.END_OF_DATA_SECTION: > 158 raise EOFError > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > read_int(stream) > 541 > 542 def read_int(stream): > --> 543 length = stream.read(4) > 544 if not length: > 545 raise EOFError > /usr/lib/python2.7/socket.pyc in read(self, size) > 378 # fragmentation issues on many platforms. > 379 try: > --> 380 data = self._sock.recv(left) > 381 except error, e: > 382 if e.args[0] == EINTR: > timeout: timed out > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2
[ https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746023#comment-15746023 ] Mike Dusenberry commented on SPARK-18281: - Here's another interesting finding. The first (original) example fails with the timeout. However, if you create the DataFrame, do something with it, and then create the iterator, it will work. {code} df = spark.createDataFrame([[1],[2],[3]]) it = df.toLocalIterator() # FAILS HERE row = next(it) {code} {code} df = spark.createDataFrame([[1],[2],[3]]) df.count() it = df.toLocalIterator() # No longer fails row = next(it) {code} > toLocalIterator yields time out error on pyspark2 > - > > Key: SPARK-18281 > URL: https://issues.apache.org/jira/browse/SPARK-18281 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 > Environment: Ubuntu 14.04.5 LTS > Driver: AWS M4.XLARGE > Slaves: AWS M4.4.XLARGE > mesos 1.0.1 > spark 2.0.1 > pyspark >Reporter: Luke Miner > > I run the example straight out of the api docs for toLocalIterator and it > gives a time out exception: > {code} > from pyspark import SparkContext > sc = SparkContext() > rdd = sc.parallelize(range(10)) > [x for x in rdd.toLocalIterator()] > {code} > conf file: > spark.driver.maxResultSize 6G > spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G > -XX:+HeapDumpOnOutOfMemoryError > spark.executor.memory 16G > spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz > spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem > spark.hadoop.fs.s3a.buffer.dir /raid0/spark > spark.hadoop.fs.s3n.buffer.dir /raid0/spark > spark.hadoop.fs.s3a.connection.timeout 50 > spark.hadoop.fs.s3n.multipart.uploads.enabled true > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 > spark.hadoop.parquet.block.size 2147483648 > spark.hadoop.parquet.enable.summary-metadatafalse > spark.jars.packages > com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 > spark.local.dir /raid0/spark > spark.mesos.coarse false > spark.mesos.constraints priority:1 > spark.network.timeout 600 > spark.rpc.message.maxSize500 > spark.speculation false > spark.sql.parquet.mergeSchema false > spark.sql.planner.externalSort true > spark.submit.deployMode client > spark.task.cpus 1 > Exception here: > {code} > --- > timeout Traceback (most recent call last) > in () > 2 sc = SparkContext() > 3 rdd = sc.parallelize(range(10)) > > 4 [x for x in rdd.toLocalIterator()] > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in > _load_from_socket(port, serializer) > 140 try: > 141 rf = sock.makefile("rb", 65536) > --> 142 for item in serializer.load_stream(rf): > 143 yield item > 144 finally: > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > load_stream(self, stream) > 137 while True: > 138 try: > --> 139 yield self._read_with_length(stream) > 140 except EOFError: > 141 return > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > _read_with_length(self, stream) > 154 > 155 def _read_with_length(self, stream): > --> 156 length = read_int(stream) > 157 if length == SpecialLengths.END_OF_DATA_SECTION: > 158 raise EOFError > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > read_int(stream) > 541 > 542 def read_int(stream): > --> 543 length = stream.read(4) > 544 if not length: > 545 raise EOFError > /usr/lib/python2.7/socket.pyc in read(self, size) > 378 # fragmentation issues on many platforms. > 379 try: > --> 380 data = self._sock.recv(left) > 381 except error, e: > 382 if e.args[0] == EINTR: > timeout: timed out > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2
[ https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746002#comment-15746002 ] Mike Dusenberry commented on SPARK-18281: - [~viirya] Thanks for taking on this bug! I tried out PR, and I'm still running into a socket timeout error for the example I gave above: {code} Traceback (most recent call last): File "", line 1, in File "/home/mwdusenb/spark/python/pyspark/sql/dataframe.py", line 416, in toLocalIterator peek = next(iter) File "/home/mwdusenb/spark/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/home/mwdusenb/spark/python/pyspark/serializers.py", line 144, in load_stream yield self._read_with_length(stream) File "/home/mwdusenb/spark/python/pyspark/serializers.py", line 161, in _read_with_length length = read_int(stream) File "/home/mwdusenb/spark/python/pyspark/serializers.py", line 555, in read_int length = stream.read(4) File "/opt/anaconda3/lib/python3.5/socket.py", line 575, in readinto return self._sock.recv_into(b) socket.timeout: timed out {code} Interestingly, it looks like the {{it = df.toLocalIterator()}} line launches a very large number of {{toLocalIterator}} jobs, and then the Python socket times out while those jobs are running. > toLocalIterator yields time out error on pyspark2 > - > > Key: SPARK-18281 > URL: https://issues.apache.org/jira/browse/SPARK-18281 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 > Environment: Ubuntu 14.04.5 LTS > Driver: AWS M4.XLARGE > Slaves: AWS M4.4.XLARGE > mesos 1.0.1 > spark 2.0.1 > pyspark >Reporter: Luke Miner > > I run the example straight out of the api docs for toLocalIterator and it > gives a time out exception: > {code} > from pyspark import SparkContext > sc = SparkContext() > rdd = sc.parallelize(range(10)) > [x for x in rdd.toLocalIterator()] > {code} > conf file: > spark.driver.maxResultSize 6G > spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G > -XX:+HeapDumpOnOutOfMemoryError > spark.executor.memory 16G > spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz > spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem > spark.hadoop.fs.s3a.buffer.dir /raid0/spark > spark.hadoop.fs.s3n.buffer.dir /raid0/spark > spark.hadoop.fs.s3a.connection.timeout 50 > spark.hadoop.fs.s3n.multipart.uploads.enabled true > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 > spark.hadoop.parquet.block.size 2147483648 > spark.hadoop.parquet.enable.summary-metadatafalse > spark.jars.packages > com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 > spark.local.dir /raid0/spark > spark.mesos.coarse false > spark.mesos.constraints priority:1 > spark.network.timeout 600 > spark.rpc.message.maxSize500 > spark.speculation false > spark.sql.parquet.mergeSchema false > spark.sql.planner.externalSort true > spark.submit.deployMode client > spark.task.cpus 1 > Exception here: > {code} > --- > timeout Traceback (most recent call last) > in () > 2 sc = SparkContext() > 3 rdd = sc.parallelize(range(10)) > > 4 [x for x in rdd.toLocalIterator()] > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in > _load_from_socket(port, serializer) > 140 try: > 141 rf = sock.makefile("rb", 65536) > --> 142 for item in serializer.load_stream(rf): > 143 yield item > 144 finally: > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > load_stream(self, stream) > 137 while True: > 138 try: > --> 139 yield self._read_with_length(stream) > 140 except EOFError: > 141 return > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > _read_with_length(self, stream) > 154 > 155 def _read_with_length(self, stream): > --> 156 length = read_int(stream) > 157 if length == SpecialLengths.END_OF_DATA_SECTION: > 158 raise EOFError > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > read_int(stream) > 541 > 542 def read_int(stream): > --> 543 length = stream.read(4) > 544 if not length: > 545 raise EOFError > /usr/lib/python2.7/socket.pyc in read(self, size) > 378 # fragmentation issues on many platforms. > 379 try: > --> 380 data = self._sock.recv(left) > 381 except error, e: > 382 if e.args[0] == EINTR: > timeout: timed out > {code} -- This
[jira] [Updated] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zak Patterson updated SPARK-18844: -- Description: BinaryClassificationMetrics only implements Precision (positive predictive value) and recall (true positive rate). It should implement more comprehensive metrics. Moreover, the instance variables storing computed counts are marked private, and there are no accessors for them. So if one desired to add this functionality, one would have to duplicate this calculation, which is not trivial: https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 Currently Implemented Metrics --- * Precision (PPV): `precisionByThreshold` * Recall (Sensitivity, true positive rate): `recallByThreshold` Desired additional metrics --- * False omission rate: `forByThreshold` * False discovery rate: `fdrByThreshold` * Negative predictive value: `npvByThreshold` * False negative rate: `fnrByThreshold` * True negative rate (Specificity): `specificityByThreshold` * False positive rate: `fprByThreshold` Alternatives --- The `createCurve` method is marked private. If it were marked public, and the trait BinaryClassificationMetricComputer were also marked public, then it would be easy to define new computers to get whatever the user wanted. was: BinaryClassificationMetrics only implements Precision (positive predictive value) and recall (true positive rate). It should implement more comprehensive metrics. Moreover, the instance variables storing computed counts are marked private, and there are no accessors for them. So if one desired to add this functionality, one would have to duplicate this calculation, which is not trivial: https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 Currently Implemented Metrics --- * Precision (PPV): `precisionByThreshold` * Recall (Sensitivity, true positive rate): `recallByThreshold` Desired additional metrics --- * False omission rate: `fprByThreshold` * False discovery rate: `fdrByThreshold` * Negative predictive value: `npvByThreshold` * False negative rate: `fnrByThreshold` * True negative rate (Specificity): `specificityByThreshold` * False positive rate: `fprByThreshold` Alternatives --- The `createCurve` method is marked private. If it were marked public, and the trait BinaryClassificationMetricComputer were also marked public, then it would be easy to define new computers to get whatever the user wanted. > Add more binary classification metrics to BinaryClassificationMetrics > - > > Key: SPARK-18844 > URL: https://issues.apache.org/jira/browse/SPARK-18844 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.2 >Reporter: Zak Patterson >Priority: Minor > Labels: evaluation > Fix For: 2.0.2 > > Original Estimate: 5h > Remaining Estimate: 5h > > BinaryClassificationMetrics only implements Precision (positive predictive > value) and recall (true positive rate). It should implement more > comprehensive metrics. > Moreover, the instance variables storing computed counts are marked private, > and there are no accessors for them. So if one desired to add this > functionality, one would have to duplicate this calculation, which is not > trivial: > https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 > Currently Implemented Metrics > --- > * Precision (PPV): `precisionByThreshold` > * Recall (Sensitivity, true positive rate): `recallByThreshold` > Desired additional metrics > --- > * False omission rate: `forByThreshold` > * False discovery rate: `fdrByThreshold` > * Negative predictive value: `npvByThreshold` > * False negative rate: `fnrByThreshold` > * True negative rate (Specificity): `specificityByThreshold` > * False positive rate: `fprByThreshold` > Alternatives > --- > The `createCurve` method is marked private. If it were marked public, and the > trait BinaryClassificationMetricComputer were also marked public, then it > would be easy to define new computers to get whatever the user wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zak Patterson updated SPARK-18844: -- Remaining Estimate: 5h (was: 5m) Original Estimate: 5h (was: 5m) > Add more binary classification metrics to BinaryClassificationMetrics > - > > Key: SPARK-18844 > URL: https://issues.apache.org/jira/browse/SPARK-18844 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.2 >Reporter: Zak Patterson >Priority: Minor > Labels: evaluation > Fix For: 2.0.2 > > Original Estimate: 5h > Remaining Estimate: 5h > > BinaryClassificationMetrics only implements Precision (positive predictive > value) and recall (true positive rate). It should implement more > comprehensive metrics. > Moreover, the instance variables storing computed counts are marked private, > and there are no accessors for them. So if one desired to add this > functionality, one would have to duplicate this calculation, which is not > trivial: > https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 > Currently Implemented Metrics > --- > * Precision (PPV): `precisionByThreshold` > * Recall (Sensitivity, true positive rate): `recallByThreshold` > Desired additional metrics > --- > * False omission rate: `fprByThreshold` > * False discovery rate: `fdrByThreshold` > * Negative predictive value: `npvByThreshold` > * False negative rate: `fnrByThreshold` > * True negative rate (Specificity): `specificityByThreshold` > * False positive rate: `fprByThreshold` > Alternatives > --- > The `createCurve` method is marked private. If it were marked public, and the > trait BinaryClassificationMetricComputer were also marked public, then it > would be easy to define new computers to get whatever the user wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics
Zak Patterson created SPARK-18844: - Summary: Add more binary classification metrics to BinaryClassificationMetrics Key: SPARK-18844 URL: https://issues.apache.org/jira/browse/SPARK-18844 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.0.2 Reporter: Zak Patterson Priority: Minor Fix For: 2.0.2 BinaryClassificationMetrics only implements Precision (positive predictive value) and recall (true positive rate). It should implement more comprehensive metrics. Moreover, the instance variables storing computed counts are marked private, and there are no accessors for them. So if one desired to add this functionality, one would have to duplicate this calculation, which is not trivial: https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 Currently Implemented Metrics --- * Precision (PPV): `precisionByThreshold` * Recall (Sensitivity, true positive rate): `recallByThreshold` Desired additional metrics --- * False omission rate: `fprByThreshold` * False discovery rate: `fdrByThreshold` * Negative predictive value: `npvByThreshold` * False negative rate: `fnrByThreshold` * True negative rate (Specificity): `specificityByThreshold` * False positive rate: `fprByThreshold` Alternatives --- The `createCurve` method is marked private. If it were marked public, and the trait BinaryClassificationMetricComputer were also marked public, then it would be easy to define new computers to get whatever the user wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745944#comment-15745944 ] Matt Cheah commented on SPARK-18278: [~rxin] - thanks for thinking about this! The concerns around testing and support burden are certainly valid. The alternatives also come with their own sets of concerns though. If we publish the scheduler as a library: The current code in the schedulers is not marked as public API. We would need to refactor the scheduler code to make an API the Apache project would support for third party use (like the K8s integration). There are (at least) two places this needs to be done: * CoarseGrainedSchedulerBackend would need to become extendable, since all of the schedulers (standalone, Mesos, yarn-client, and yarn-cluster) currently extend this fairly complex class. The CoarseGrainedSchedulerBackend code invokes its pluggable methods (doRequestTotalExecutors and doKillExecutors) with particular expectations, and hence these expectations would also have to remain stable as long as it is a public API. * SparkSubmit would need to support 3rd party cluster managers. Currently SparkSubmit's code includes special case handling for the bundled cluster managers, for example in YARN mode spark-submit accepts --queue to specify the queue to run the job with. Thus we would need to make the spark-submit argument handling pluggable as well for other cluster managers parameters. Off the top of my head, I could think of numerous ways we could expose both of these as plugins, but it's not immediately obvious what the best option is. If we fork the project: Maintaining a fork places burden on the fork maintainers to keep the fork up to date with the mainline releases. It also makes it unclear what the relationship between this feature and its associated fork is with the direction of the Spark project as a whole, and what the timeline is for eventual re-integration of the fork. Is there a prior example of this approach working in practice in the Spark community? In either case (library or forking), there's also the question of how we encourage alpha testing and early usage of this feature. If the code is not on the mainline branch, there needs to be alternative channels outside of the Spark releases themselves to announce that this feature is available and that we would like feedback on it. It would also be ideal for the code reviews to be visible early on, so that everyone that watches the Spark repository can catch the updates and progress of this feature. Having said all of this, I think these issues can be navigated. If I had to choose between maintaining a fork versus cleaning up the scheduler to make a public API, I would choose the latter in the interest of clarifying the relationship between the K8s effort and the mainline project, as well as for making the scheduler code cleaner in general. However it's not immediately clear if the effort required to make these refactors is worthwhile when we could include the K8s scheduler in the Apache releases as an experimental feature, ignore its bugs and test failures for the next few releases (that is, problems in the K8s-related code should never block releases), and ship this as we currently do with YARN and Mesos. I'd like to hear everyone's thoughts regarding the tradeoffs we are making between these different approaches of pushing this feature forward. > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment
[ https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745879#comment-15745879 ] Marcelo Vanzin commented on SPARK-18840: [~jerryshao] are you planning to fix this for 2.0/1.6 also? Or should I close the bug? > HDFSCredentialProvider throws exception in non-HDFS security environment > > > Key: SPARK-18840 > URL: https://issues.apache.org/jira/browse/SPARK-18840 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.3, 2.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation > token should be existed, this is ok for HDFS environment, but for some cloud > environment like Azure, HDFS is not required, so it will throw exception: > {code} > java.util.NoSuchElementException: head of empty list > at scala.collection.immutable.Nil$.head(List.scala:337) > at scala.collection.immutable.Nil$.head(List.scala:334) > at > org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627) > {code} > We should also consider this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18835) Do not expose shaded types in JavaTypeInference API
[ https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-18835: --- Fix Version/s: 2.2.0 > Do not expose shaded types in JavaTypeInference API > --- > > Key: SPARK-18835 > URL: https://issues.apache.org/jira/browse/SPARK-18835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > Currently, {{inferDataType(TypeToken)}} is called from a different maven > module, and because we shade Guava, that sometimes leads to errors (e.g. when > running tests using maven): > {noformat} > udf3Test(test.org.apache.spark.sql.JavaUDFSuite) Time elapsed: 0.084 sec > <<< ERROR! > java.lang.NoSuchMethodError: > org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2; > at > test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107) > Results : > Tests in error: > JavaUDFSuite.udf3Test:107 ยป NoSuchMethod > org.apache.spark.sql.catalyst.JavaTyp... > {noformat} > Instead, we shouldn't expose Guava types in these APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment
[ https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-18840: --- Fix Version/s: 2.2.0 2.1.1 > HDFSCredentialProvider throws exception in non-HDFS security environment > > > Key: SPARK-18840 > URL: https://issues.apache.org/jira/browse/SPARK-18840 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.3, 2.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation > token should be existed, this is ok for HDFS environment, but for some cloud > environment like Azure, HDFS is not required, so it will throw exception: > {code} > java.util.NoSuchElementException: head of empty list > at scala.collection.immutable.Nil$.head(List.scala:337) > at scala.collection.immutable.Nil$.head(List.scala:334) > at > org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627) > {code} > We should also consider this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment
[ https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-18840: --- Assignee: Saisai Shao > HDFSCredentialProvider throws exception in non-HDFS security environment > > > Key: SPARK-18840 > URL: https://issues.apache.org/jira/browse/SPARK-18840 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.3, 2.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation > token should be existed, this is ok for HDFS environment, but for some cloud > environment like Azure, HDFS is not required, so it will throw exception: > {code} > java.util.NoSuchElementException: head of empty list > at scala.collection.immutable.Nil$.head(List.scala:337) > at scala.collection.immutable.Nil$.head(List.scala:334) > at > org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627) > {code} > We should also consider this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18843) Fix timeout in awaitResultInForkJoinSafely
[ https://issues.apache.org/jira/browse/SPARK-18843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18843: Assignee: Shixiong Zhu (was: Apache Spark) > Fix timeout in awaitResultInForkJoinSafely > -- > > Key: SPARK-18843 > URL: https://issues.apache.org/jira/browse/SPARK-18843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Master has the fix in https://github.com/apache/spark/pull/16230. However, > since we don't merge this PR into master because it's too risky, we should at > least fix the timeout value for 2.0 and 2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18843) Fix timeout in awaitResultInForkJoinSafely
[ https://issues.apache.org/jira/browse/SPARK-18843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745871#comment-15745871 ] Apache Spark commented on SPARK-18843: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16268 > Fix timeout in awaitResultInForkJoinSafely > -- > > Key: SPARK-18843 > URL: https://issues.apache.org/jira/browse/SPARK-18843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Master has the fix in https://github.com/apache/spark/pull/16230. However, > since we don't merge this PR into master because it's too risky, we should at > least fix the timeout value for 2.0 and 2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18843) Fix timeout in awaitResultInForkJoinSafely
[ https://issues.apache.org/jira/browse/SPARK-18843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18843: Assignee: Apache Spark (was: Shixiong Zhu) > Fix timeout in awaitResultInForkJoinSafely > -- > > Key: SPARK-18843 > URL: https://issues.apache.org/jira/browse/SPARK-18843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Master has the fix in https://github.com/apache/spark/pull/16230. However, > since we don't merge this PR into master because it's too risky, we should at > least fix the timeout value for 2.0 and 2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18786) pySpark SQLContext.getOrCreate(sc) take stopped sparkContext
[ https://issues.apache.org/jira/browse/SPARK-18786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745867#comment-15745867 ] Bryan Cutler commented on SPARK-18786: -- The problem is that {{SQLContext.getOrCreate(sc)}} will not reset even though a different {{SparkContext}} is used. With Spark 2.0, we are moving towards {{SparkSession}} so I don't think this is worth fixing. Using {{SparkSession}} like below doesn't seem to have this problem. {noformat} import sys sys.path.insert(1, 'spark/python/') sys.path.insert(1, 'spark/python/lib/py4j-0.9-src.zip') from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() spark.read.json(spark.sparkContext.parallelize(['{{ "name": "Adam" }}'])) spark.stop() spark = SparkSession.builder.getOrCreate() spark.read.json(spark.sparkContext.parallelize(['{{ "name": "Adam" }}'])) {noformat} > pySpark SQLContext.getOrCreate(sc) take stopped sparkContext > > > Key: SPARK-18786 > URL: https://issues.apache.org/jira/browse/SPARK-18786 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Alex Liu > > The following steps to reproduce the issue > {code} > import sys > sys.path.insert(1, 'spark/python/') > sys.path.insert(1, 'spark/python/lib/py4j-0.9-src.zip') > from pyspark import SparkContext, SQLContext > sc = SparkContext.getOrCreate() > sqlContext = SQLContext.getOrCreate(sc) > sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show() > sc.stop() > sc = SparkContext.getOrCreate() > sqlContext = SQLContext.getOrCreate(sc) > sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show() > {code} > It has the following errors after the last command > {code} > >>> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show() > Traceback (most recent call last): > > File "", line 1, in > File "spark/python/pyspark/sql/dataframe.py", line 257, in show > print(self._jdf.showString(n, truncate)) > File "spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in > __call__ > File "spark/python/pyspark/sql/utils.py", line 45, in deco > return f(*a, **kw) > File "spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in > get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o435.showString. > : java.lang.IllegalStateException: Cannot call methods on a stopped > SparkContext. > This stopped SparkContext was created at: > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59) > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > java.lang.reflect.Constructor.newInstance(Constructor.java:422) > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > py4j.Gateway.invoke(Gateway.java:214) > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) > py4j.GatewayConnection.run(GatewayConnection.java:209) > java.lang.Thread.run(Thread.java:745) > The currently active SparkContext was created at: > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59) > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > java.lang.reflect.Constructor.newInstance(Constructor.java:422) > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > py4j.Gateway.invoke(Gateway.java:214) > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) > py4j.GatewayConnection.run(GatewayConnection.java:209) > java.lang.Thread.run(Thread.java:745) > > at > org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:106) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1325) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:126) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at
[jira] [Created] (SPARK-18843) Fix timeout in awaitResultInForkJoinSafely
Shixiong Zhu created SPARK-18843: Summary: Fix timeout in awaitResultInForkJoinSafely Key: SPARK-18843 URL: https://issues.apache.org/jira/browse/SPARK-18843 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2, 2.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Master has the fix in https://github.com/apache/spark/pull/16230. However, since we don't merge this PR into master because it's too risky, we should at least fix the timeout value for 2.0 and 2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18835) Do not expose shaded types in JavaTypeInference API
[ https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-18835. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.1.1 > Do not expose shaded types in JavaTypeInference API > --- > > Key: SPARK-18835 > URL: https://issues.apache.org/jira/browse/SPARK-18835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.1.1 > > > Currently, {{inferDataType(TypeToken)}} is called from a different maven > module, and because we shade Guava, that sometimes leads to errors (e.g. when > running tests using maven): > {noformat} > udf3Test(test.org.apache.spark.sql.JavaUDFSuite) Time elapsed: 0.084 sec > <<< ERROR! > java.lang.NoSuchMethodError: > org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2; > at > test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107) > Results : > Tests in error: > JavaUDFSuite.udf3Test:107 ยป NoSuchMethod > org.apache.spark.sql.catalyst.JavaTyp... > {noformat} > Instead, we shouldn't expose Guava types in these APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-13747. -- Resolution: Fixed Fix Version/s: (was: 2.0.2) (was: 2.1.0) 2.2.0 Issue resolved by pull request 16230 [https://github.com/apache/spark/pull/16230] > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18675) CTAS for hive serde table should work for all hive versions
[ https://issues.apache.org/jira/browse/SPARK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18675. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16104 [https://github.com/apache/spark/pull/16104] > CTAS for hive serde table should work for all hive versions > --- > > Key: SPARK-18675 > URL: https://issues.apache.org/jira/browse/SPARK-18675 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?
[ https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745633#comment-15745633 ] Shivaram Venkataraman commented on SPARK-18823: --- Thanks [~masip85] for verifying this. I think as [~felixcheung] pointed out there are two separate issues we can file as feature requests 1. Supporting assignment of DataFrame columns in `[` and `[[` -- This should be pretty straight forward I'd guess 2. Supporting assignment of a local R column using `$` and / or `[[` -- This one I'm less sure about because it will involve determining types, serializing data from local R and splitting into existing DataFrame etc. Also at a higher level if the DataFrame has a 100M rows then it might not be efficient to ship that much data etc. > Assignation by column name variable not available or bug? > - > > Key: SPARK-18823 > URL: https://issues.apache.org/jira/browse/SPARK-18823 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 2.0.2 > Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr > 4. Or databricks (community.cloud.databricks.com) . >Reporter: Vicente Masip > Fix For: 2.0.2 > > Original Estimate: 24h > Remaining Estimate: 24h > > I really don't know if this is a bug or can be done with some function: > Sometimes is very important to assign something to a column which name has to > be access trough a variable. Normally, I have always used it with doble > brackets likes this out of SparkR problems: > # df could be faithful normal data frame or data table. > # accesing by variable name: > myname = "waiting" > df[[myname]] <- c(1:nrow(df)) > # or even column number > df[[2]] <- df$eruptions > The error is not caused by the right side of the "<-" operator of assignment. > The problem is that I can't assign to a column name using a variable or > column number as I do in this examples out of spark. Doesn't matter if I am > modifying or creating column. Same problem. > I have also tried to use this with no results: > val df2 = withColumn(df,"tmp", df$eruptions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows
[ https://issues.apache.org/jira/browse/SPARK-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-18842: - Priority: Major (was: Minor) > De-duplicate paths in classpaths in processes for local-cluster mode to work > around the length limitation on Windows > > > Key: SPARK-18842 > URL: https://issues.apache.org/jira/browse/SPARK-18842 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Hyukjin Kwon > > Currently, some tests are being failed and hanging on Windows due to this > problem. For the reason in SPARK-18718, some tests using {{local-cluster}} > mode were disabled on Windows due to the length limitation by paths given to > classpaths. > The limitation seems roughly 32K (see > https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and > https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows) > but executors were being launched with the command such as > https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in > (only) tests. > This length is roughly 40K due to the class paths. However, it seems there > are duplicates more than half. So, if we de-duplicate this paths, it is > reduced to roughly 20K. > Maybe, we should consider as some more paths are added in the future but it > seems better than disabling all the tests for now with minimised changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3012) Standardized Distance Functions between two Vectors for MLlib
[ https://issues.apache.org/jira/browse/SPARK-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745388#comment-15745388 ] Dhaval Modi commented on SPARK-3012: I have implemented Mahalanobis Distance in Spark 1.6.2+ using Breeze 0.12 libraries. One can refer below scala code at: https://github.com/dhmodi/MahalanobisDistance > Standardized Distance Functions between two Vectors for MLlib > - > > Key: SPARK-3012 > URL: https://issues.apache.org/jira/browse/SPARK-3012 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > Most of the clustering algorithms need distance functions between two Vectors. > We should include the standardized distance function library in MLlib. > I think that the standardized distance functions help us to implement more > machine learning algorithms efficiently. > h3. For example > - Chebyshev Distance > - Cosine Distance > - Euclidean Distance > - Mahalanobis Distance > - Manhattan Distance > - Minkowski Distance > - SquaredEuclidean Distance > - Tanimoto Distance > - Weighted Distance > - WeightedEuclidean Distance > - WeightedManhattan Distance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows
[ https://issues.apache.org/jira/browse/SPARK-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745350#comment-15745350 ] Hyukjin Kwon commented on SPARK-18842: -- This completes the parent task and I can proceed far more. > De-duplicate paths in classpaths in processes for local-cluster mode to work > around the length limitation on Windows > > > Key: SPARK-18842 > URL: https://issues.apache.org/jira/browse/SPARK-18842 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, some tests are being failed and hanging on Windows due to this > problem. For the reason in SPARK-18718, some tests using {{local-cluster}} > mode were disabled on Windows due to the length limitation by paths given to > classpaths. > The limitation seems roughly 32K (see > https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and > https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows) > but executors were being launched with the command such as > https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in > (only) tests. > This length is roughly 40K due to the class paths. However, it seems there > are duplicates more than half. So, if we de-duplicate this paths, it is > reduced to roughly 20K. > Maybe, we should consider as some more paths are added in the future but it > seems better than disabling all the tests for now with minimised changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18841) PushProjectionThroughUnion exception when there are same column
[ https://issues.apache.org/jira/browse/SPARK-18841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18841: Assignee: (was: Apache Spark) > PushProjectionThroughUnion exception when there are same column > --- > > Key: SPARK-18841 > URL: https://issues.apache.org/jira/browse/SPARK-18841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Song Jun > > {noformat} > DROP TABLE IF EXISTS p1 ; > DROP TABLE IF EXISTS p2 ; > DROP TABLE IF EXISTS p3 ; > CREATE TABLE p1 (col STRING) ; > CREATE TABLE p2 (col STRING) ; > CREATE TABLE p3 (col STRING) ; > set spark.sql.crossJoin.enabled = true; > SELECT > 1 as cste, > col > FROM ( > SELECT > col as col > FROM ( > SELECT > p1.col as col > FROM p1 > LEFT JOIN p2 > UNION ALL > SELECT > col > FROM p3 > ) T1 > ) T2 > ; > {noformat} > it will throw exception: > {noformat} > key not found: col#16 > java.util.NoSuchElementException: key not found: col#16 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18841) PushProjectionThroughUnion exception when there are same column
[ https://issues.apache.org/jira/browse/SPARK-18841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745339#comment-15745339 ] Apache Spark commented on SPARK-18841: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/16267 > PushProjectionThroughUnion exception when there are same column > --- > > Key: SPARK-18841 > URL: https://issues.apache.org/jira/browse/SPARK-18841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Song Jun > > {noformat} > DROP TABLE IF EXISTS p1 ; > DROP TABLE IF EXISTS p2 ; > DROP TABLE IF EXISTS p3 ; > CREATE TABLE p1 (col STRING) ; > CREATE TABLE p2 (col STRING) ; > CREATE TABLE p3 (col STRING) ; > set spark.sql.crossJoin.enabled = true; > SELECT > 1 as cste, > col > FROM ( > SELECT > col as col > FROM ( > SELECT > p1.col as col > FROM p1 > LEFT JOIN p2 > UNION ALL > SELECT > col > FROM p3 > ) T1 > ) T2 > ; > {noformat} > it will throw exception: > {noformat} > key not found: col#16 > java.util.NoSuchElementException: key not found: col#16 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18841) PushProjectionThroughUnion exception when there are same column
[ https://issues.apache.org/jira/browse/SPARK-18841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18841: Assignee: Apache Spark > PushProjectionThroughUnion exception when there are same column > --- > > Key: SPARK-18841 > URL: https://issues.apache.org/jira/browse/SPARK-18841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Song Jun >Assignee: Apache Spark > > {noformat} > DROP TABLE IF EXISTS p1 ; > DROP TABLE IF EXISTS p2 ; > DROP TABLE IF EXISTS p3 ; > CREATE TABLE p1 (col STRING) ; > CREATE TABLE p2 (col STRING) ; > CREATE TABLE p3 (col STRING) ; > set spark.sql.crossJoin.enabled = true; > SELECT > 1 as cste, > col > FROM ( > SELECT > col as col > FROM ( > SELECT > p1.col as col > FROM p1 > LEFT JOIN p2 > UNION ALL > SELECT > col > FROM p3 > ) T1 > ) T2 > ; > {noformat} > it will throw exception: > {noformat} > key not found: col#16 > java.util.NoSuchElementException: key not found: col#16 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378) > at > org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18841) PushProjectionThroughUnion exception when there are same column
[ https://issues.apache.org/jira/browse/SPARK-18841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-18841: -- Description: {noformat} DROP TABLE IF EXISTS p1 ; DROP TABLE IF EXISTS p2 ; DROP TABLE IF EXISTS p3 ; CREATE TABLE p1 (col STRING) ; CREATE TABLE p2 (col STRING) ; CREATE TABLE p3 (col STRING) ; set spark.sql.crossJoin.enabled = true; SELECT 1 as cste, col FROM ( SELECT col as col FROM ( SELECT p1.col as col FROM p1 LEFT JOIN p2 UNION ALL SELECT col FROM p3 ) T1 ) T2 ; {noformat} it will throw exception: {noformat} key not found: col#16 java.util.NoSuchElementException: key not found: col#16 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376) {noformat} was: DROP TABLE IF EXISTS p1 ; DROP TABLE IF EXISTS p2 ; DROP TABLE IF EXISTS p3 ; CREATE TABLE p1 (col STRING) ; CREATE TABLE p2 (col STRING) ; CREATE TABLE p3 (col STRING) ; set spark.sql.crossJoin.enabled = true; SELECT 1 as cste, col FROM ( SELECT col as col FROM ( SELECT p1.col as col FROM p1 LEFT JOIN p2 UNION ALL SELECT col FROM p3 ) T1 ) T2 ; it will throw exception: key not found: col#16 java.util.NoSuchElementException: key not found: col#16 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
[jira] [Assigned] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows
[ https://issues.apache.org/jira/browse/SPARK-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18842: Assignee: Apache Spark > De-duplicate paths in classpaths in processes for local-cluster mode to work > around the length limitation on Windows > > > Key: SPARK-18842 > URL: https://issues.apache.org/jira/browse/SPARK-18842 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > Currently, some tests are being failed and hanging on Windows due to this > problem. For the reason in SPARK-18718, some tests using {{local-cluster}} > mode were disabled on Windows due to the length limitation by paths given to > classpaths. > The limitation seems roughly 32K (see > https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and > https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows) > but executors were being launched with the command such as > https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in > (only) tests. > This length is roughly 40K due to the class paths. However, it seems there > are duplicates more than half. So, if we de-duplicate this paths, it is > reduced to roughly 20K. > Maybe, we should consider as some more paths are added in the future but it > seems better than disabling all the tests for now with minimised changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows
[ https://issues.apache.org/jira/browse/SPARK-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18842: Assignee: (was: Apache Spark) > De-duplicate paths in classpaths in processes for local-cluster mode to work > around the length limitation on Windows > > > Key: SPARK-18842 > URL: https://issues.apache.org/jira/browse/SPARK-18842 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, some tests are being failed and hanging on Windows due to this > problem. For the reason in SPARK-18718, some tests using {{local-cluster}} > mode were disabled on Windows due to the length limitation by paths given to > classpaths. > The limitation seems roughly 32K (see > https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and > https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows) > but executors were being launched with the command such as > https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in > (only) tests. > This length is roughly 40K due to the class paths. However, it seems there > are duplicates more than half. So, if we de-duplicate this paths, it is > reduced to roughly 20K. > Maybe, we should consider as some more paths are added in the future but it > seems better than disabling all the tests for now with minimised changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows
[ https://issues.apache.org/jira/browse/SPARK-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745319#comment-15745319 ] Apache Spark commented on SPARK-18842: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16266 > De-duplicate paths in classpaths in processes for local-cluster mode to work > around the length limitation on Windows > > > Key: SPARK-18842 > URL: https://issues.apache.org/jira/browse/SPARK-18842 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, some tests are being failed and hanging on Windows due to this > problem. For the reason in SPARK-18718, some tests using {{local-cluster}} > mode were disabled on Windows due to the length limitation by paths given to > classpaths. > The limitation seems roughly 32K (see > https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and > https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows) > but executors were being launched with the command such as > https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in > (only) tests. > This length is roughly 40K due to the class paths. However, it seems there > are duplicates more than half. So, if we de-duplicate this paths, it is > reduced to roughly 20K. > Maybe, we should consider as some more paths are added in the future but it > seems better than disabling all the tests for now with minimised changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18609) [SQL] column mixup with CROSS JOIN
[ https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745309#comment-15745309 ] Song Jun commented on SPARK-18609: -- open another jira SPARK-18841 > [SQL] column mixup with CROSS JOIN > -- > > Key: SPARK-18609 > URL: https://issues.apache.org/jira/browse/SPARK-18609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Furcy Pin > > Reproduced on spark-sql v2.0.2 and on branch master. > {code} > DROP TABLE IF EXISTS p1 ; > DROP TABLE IF EXISTS p2 ; > CREATE TABLE p1 (col TIMESTAMP) ; > CREATE TABLE p2 (col TIMESTAMP) ; > set spark.sql.crossJoin.enabled = true; > -- EXPLAIN > WITH CTE AS ( > SELECT > s2.col as col > FROM p1 > CROSS JOIN ( > SELECT > e.col as col > FROM p2 E > ) s2 > ) > SELECT > T1.col as c1, > T2.col as c2 > FROM CTE T1 > CROSS JOIN CTE T2 > ; > {code} > This returns the following stacktrace : > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: col#21 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at >
[jira] [Created] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows
Hyukjin Kwon created SPARK-18842: Summary: De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows Key: SPARK-18842 URL: https://issues.apache.org/jira/browse/SPARK-18842 Project: Spark Issue Type: Sub-task Components: Spark Core, Tests Reporter: Hyukjin Kwon Priority: Minor Currently, some tests are being failed and hanging on Windows due to this problem. For the reason in SPARK-18718, some tests using {{local-cluster}} mode were disabled on Windows due to the length limitation by paths given to classpaths. The limitation seems roughly 32K (see https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows) but executors were being launched with the command such as https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in (only) tests. This length is roughly 40K due to the class paths. However, it seems there are duplicates more than half. So, if we de-duplicate this paths, it is reduced to roughly 20K. Maybe, we should consider as some more paths are added in the future but it seems better than disabling all the tests for now with minimised changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18841) PushProjectionThroughUnion exception when there are same column
Song Jun created SPARK-18841: Summary: PushProjectionThroughUnion exception when there are same column Key: SPARK-18841 URL: https://issues.apache.org/jira/browse/SPARK-18841 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2, 2.1.0 Reporter: Song Jun DROP TABLE IF EXISTS p1 ; DROP TABLE IF EXISTS p2 ; DROP TABLE IF EXISTS p3 ; CREATE TABLE p1 (col STRING) ; CREATE TABLE p2 (col STRING) ; CREATE TABLE p3 (col STRING) ; set spark.sql.crossJoin.enabled = true; SELECT 1 as cste, col FROM ( SELECT col as col FROM ( SELECT p1.col as col FROM p1 LEFT JOIN p2 UNION ALL SELECT col FROM p3 ) T1 ) T2 ; it will throw exception: key not found: col#16 java.util.NoSuchElementException: key not found: col#16 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17890) scala.ScalaReflectionException
[ https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744964#comment-15744964 ] Arkadiusz Komarzewski commented on SPARK-17890: --- I hit this error recently, checked 2.1 (build 2016_12_12_03_37-63693c1-bin) - it's still there. I can try fixing this - can you give me some tips on what to look for? > scala.ScalaReflectionException > -- > > Key: SPARK-17890 > URL: https://issues.apache.org/jira/browse/SPARK-17890 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: x86_64 GNU/Linux > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) >Reporter: Khalid Reid >Priority: Minor > Labels: newbie > > Hello, > I am seeing an error message in spark-shell when I map a DataFrame to a > Seq\[Foo\]. However, things work fine when I use flatMap. > {noformat} > scala> case class Foo(value:String) > defined class Foo > scala> val df = sc.parallelize(List(1,2,3)).toDF > df: org.apache.spark.sql.DataFrame = [value: int] > scala> df.map{x => Seq.empty[Foo]} > scala.ScalaReflectionException: object $line14.$read not found. > at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162) > at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22) > at $typecreator1$1.apply(:29) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > ... 48 elided > scala> df.flatMap{_ => Seq.empty[Foo]} //flatMap works > res2: org.apache.spark.sql.Dataset[Foo] = [value: string] > {noformat} > I am seeing the same error reported > [here|https://issues.apache.org/jira/browse/SPARK-8465?jql=text%20~%20%22scala.ScalaReflectionException%22] > when I use spark-submit. > I am new to Spark but I don't expect this to throw an exception. > Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18796) StreamingQueryManager should not hold a lock when starting a query
[ https://issues.apache.org/jira/browse/SPARK-18796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18796: -- Assignee: Shixiong Zhu > StreamingQueryManager should not hold a lock when starting a query > -- > > Key: SPARK-18796 > URL: https://issues.apache.org/jira/browse/SPARK-18796 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.0 > > > Otherwise, the user cannot start any queries when a query is starting. If a > query takes a long time to start, the user experience will be pretty bad. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18797) Update spark.logit in sparkr-vignettes
[ https://issues.apache.org/jira/browse/SPARK-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18797: -- Assignee: Miao Wang > Update spark.logit in sparkr-vignettes > -- > > Key: SPARK-18797 > URL: https://issues.apache.org/jira/browse/SPARK-18797 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.1.1, 2.2.0 > > > spark.logit is added in 2.1. We need to update spark-vignettes to reflect the > changes. This is part of SparkR QA work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns
[ https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18642: -- Assignee: Dongjoon Hyun > Spark SQL: Catalyst is scanning undesired columns > - > > Key: SPARK-18642 > URL: https://issues.apache.org/jira/browse/SPARK-18642 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 > Environment: Ubuntu 14.04 > Spark: Local Mode >Reporter: Mohit >Assignee: Dongjoon Hyun > Labels: performance > Fix For: 2.0.0 > > > When doing a left-join between two tables, say A and B, Catalyst has > information about the projection required for table B. Only the required > columns should be scanned. > Code snippet below explains the scenario: > scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA") > dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string] > scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB") > dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string] > scala> dfA.registerTempTable("A") > scala> dfB.registerTempTable("B") > scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid > where B.bid<2").explain > == Physical Plan == > Project [aid#15,bid#17] > +- Filter (bid#17 < 2) >+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None > :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: > file:/home/mohit/ruleA > +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: > file:/home/mohit/ruleB > This is a watered-down example from a production issue which has a huge > performance impact. > External reference: > http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
[ https://issues.apache.org/jira/browse/SPARK-18752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18752: -- Assignee: Marcelo Vanzin > "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user > -- > > Key: SPARK-18752 > URL: https://issues.apache.org/jira/browse/SPARK-18752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.2.0 > > > We ran into an issue with the HiveShim code that calls "loadTable" and > "loadPartition" while testing with some recent changes in upstream Hive. > The semantics in Hive changed slightly, and if you provide the wrong value > for "isSrcLocal" you now can end up with an invalid table: the Hive code will > move the temp directory to the final destination instead of moving its > children. > The problem in Spark is that HiveShim.scala tries to figure out the value of > "isSrcLocal" based on where the source and target directories are; that's not > correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA > LOCAL" would set it to "true"). So we need to propagate that information from > the user query down to HiveShim. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?
[ https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744740#comment-15744740 ] Vicente Masip commented on SPARK-18823: --- Yes. I've been able to do it with your suggestion. In my case problem, i use it into a loop trough col names.(i: iterator) df <- withColumn(df, myColNames[i], cast(df[[i]],"boolean")) Happy to find the solution. More friendly if [[ is available at the left side. Anyway, if this example is inside documentation, it would be a solution too. Should be clear, that is available at the right side, but not at the left. And that if you need it, WithColumn is the solution. I think too that x$y <- t$q assignation is available,it should be with all its consequences. Am I wrong? > Assignation by column name variable not available or bug? > - > > Key: SPARK-18823 > URL: https://issues.apache.org/jira/browse/SPARK-18823 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 2.0.2 > Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr > 4. Or databricks (community.cloud.databricks.com) . >Reporter: Vicente Masip > Fix For: 2.0.2 > > Original Estimate: 24h > Remaining Estimate: 24h > > I really don't know if this is a bug or can be done with some function: > Sometimes is very important to assign something to a column which name has to > be access trough a variable. Normally, I have always used it with doble > brackets likes this out of SparkR problems: > # df could be faithful normal data frame or data table. > # accesing by variable name: > myname = "waiting" > df[[myname]] <- c(1:nrow(df)) > # or even column number > df[[2]] <- df$eruptions > The error is not caused by the right side of the "<-" operator of assignment. > The problem is that I can't assign to a column name using a variable or > column number as I do in this examples out of spark. Doesn't matter if I am > modifying or creating column. Same problem. > I have also tried to use this with no results: > val df2 = withColumn(df,"tmp", df$eruptions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18650) race condition in FileScanRDD.scala
[ https://issues.apache.org/jira/browse/SPARK-18650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744695#comment-15744695 ] Soumabrata Chakraborty edited comment on SPARK-18650 at 12/13/16 9:43 AM: -- I am facing the same issue while trying to load a csv. I am using org.apache.spark:spark-core_2.10:2.0.2 and org.apache.spark:spark-sql_2.10:2.0.2 using the following to read the csv Dataset dataset = sparkSession.read().schema(schema).options(options).csv(path) ; The options contain "header":"false" and "sep":"|" The issue is also reproduced on versions 2.0.0 and 2.0.1 in addition to 2.0.2 was (Author: soumabrata): I am facing the same issue while trying to load a csv. I am using org.apache.spark:spark-core_2.10:2.0.2 and using the following to read the csv Dataset dataset = sparkSession.read().schema(schema).options(options).csv(path) ; The options contain "header":"false" and "sep":"|" > race condition in FileScanRDD.scala > --- > > Key: SPARK-18650 > URL: https://issues.apache.org/jira/browse/SPARK-18650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: scala 2.11 > macos 10.11.6 >Reporter: Jay Goldman > > I am attempting to create a DataSet from a single CSV file : > val ss: SparkSession = > val ddr = ss.read.option("path", path) > ... (choose between xml vs csv parsing) > var df = ddr.option("sep", ",") > .option("quote", "\"") > .option("escape", "\"") // want to retain backslashes (\) ... > .option("delimiter", ",") > .option("comment", "#") > .option("header", "true") > .option("format", "csv") >ddr.csv(path) > df.count() returns 2 times the number of lines in the CSV file - i.e., each > line of the input file shows up as 2 rows in df. > moreover df.distinct.count has the correct rows. > There appears to be a problem in FileScanRDD.compute. I am using spark > version 2.0.1 with scala 2.11. I am not going to include the entire contents > of FileScanRDD.scala here. > In FileScanRDD.compute there is the following: > private[this] val files = split.asInstanceOf[FilePartition].files.toIterator > If i put a breakpoint in either FileScanRDD.compute or > FIleScanRDD.nextIterator the resulting dataset has the correct number of rows. > Moreover, the code in FileScanRDD.scala is: > private def nextIterator(): Boolean = { > updateBytesReadWithFileSize() > if (files.hasNext) { // breakpoint here => works > currentFile = files.next() // breakpoint here => fails > > } > else { } > > } > if i put a breakpoint on the files.hasNext line all is well; however, if i > put a breakpoint on the files.next() line the code will fail when i continue > because the files iterator has become empty (see stack trace below). > Disabling the breakpoint winds up creating a Dataset with each line of the > csv file duplicated. > So it appears that multiple threads are using the files iterator or the > underling split value (an RDDPartition) and timing wise on my system 2 > workers wind up processing the same file, with the resulting DataSet having 2 > copies of each of the input lines. > This code is not active when parsing an XML file. > here is stack trace: > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:111) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at
[jira] [Commented] (SPARK-18650) race condition in FileScanRDD.scala
[ https://issues.apache.org/jira/browse/SPARK-18650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744695#comment-15744695 ] Soumabrata Chakraborty commented on SPARK-18650: I am facing the same issue while trying to load a csv. I am using org.apache.spark:spark-core_2.10:2.0.2 and using the following to read the csv Dataset dataset = sparkSession.read().schema(schema).options(options).csv(path) ; The options contain "header":"false" and "sep":"|" > race condition in FileScanRDD.scala > --- > > Key: SPARK-18650 > URL: https://issues.apache.org/jira/browse/SPARK-18650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: scala 2.11 > macos 10.11.6 >Reporter: Jay Goldman > > I am attempting to create a DataSet from a single CSV file : > val ss: SparkSession = > val ddr = ss.read.option("path", path) > ... (choose between xml vs csv parsing) > var df = ddr.option("sep", ",") > .option("quote", "\"") > .option("escape", "\"") // want to retain backslashes (\) ... > .option("delimiter", ",") > .option("comment", "#") > .option("header", "true") > .option("format", "csv") >ddr.csv(path) > df.count() returns 2 times the number of lines in the CSV file - i.e., each > line of the input file shows up as 2 rows in df. > moreover df.distinct.count has the correct rows. > There appears to be a problem in FileScanRDD.compute. I am using spark > version 2.0.1 with scala 2.11. I am not going to include the entire contents > of FileScanRDD.scala here. > In FileScanRDD.compute there is the following: > private[this] val files = split.asInstanceOf[FilePartition].files.toIterator > If i put a breakpoint in either FileScanRDD.compute or > FIleScanRDD.nextIterator the resulting dataset has the correct number of rows. > Moreover, the code in FileScanRDD.scala is: > private def nextIterator(): Boolean = { > updateBytesReadWithFileSize() > if (files.hasNext) { // breakpoint here => works > currentFile = files.next() // breakpoint here => fails > > } > else { } > > } > if i put a breakpoint on the files.hasNext line all is well; however, if i > put a breakpoint on the files.next() line the code will fail when i continue > because the files iterator has become empty (see stack trace below). > Disabling the breakpoint winds up creating a Dataset with each line of the > csv file duplicated. > So it appears that multiple threads are using the files iterator or the > underling split value (an RDDPartition) and timing wise on my system 2 > workers wind up processing the same file, with the resulting DataSet having 2 > copies of each of the input lines. > This code is not active when parsing an XML file. > here is stack trace: > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:111) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/30 09:31:07 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0
[jira] [Comment Edited] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment
[ https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744370#comment-15744370 ] Saisai Shao edited comment on SPARK-18840 at 12/13/16 9:30 AM: --- This problem also existed in branch 1.6, but the fix is a little different compared to master. was (Author: jerryshao): This problem also existed in branch 1.6, but the fix is a little complicated compared to master. > HDFSCredentialProvider throws exception in non-HDFS security environment > > > Key: SPARK-18840 > URL: https://issues.apache.org/jira/browse/SPARK-18840 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.3, 2.1.0 >Reporter: Saisai Shao >Priority: Minor > > Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation > token should be existed, this is ok for HDFS environment, but for some cloud > environment like Azure, HDFS is not required, so it will throw exception: > {code} > java.util.NoSuchElementException: head of empty list > at scala.collection.immutable.Nil$.head(List.scala:337) > at scala.collection.immutable.Nil$.head(List.scala:334) > at > org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627) > {code} > We should also consider this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment
[ https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18840: Assignee: (was: Apache Spark) > HDFSCredentialProvider throws exception in non-HDFS security environment > > > Key: SPARK-18840 > URL: https://issues.apache.org/jira/browse/SPARK-18840 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.3, 2.1.0 >Reporter: Saisai Shao >Priority: Minor > > Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation > token should be existed, this is ok for HDFS environment, but for some cloud > environment like Azure, HDFS is not required, so it will throw exception: > {code} > java.util.NoSuchElementException: head of empty list > at scala.collection.immutable.Nil$.head(List.scala:337) > at scala.collection.immutable.Nil$.head(List.scala:334) > at > org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627) > {code} > We should also consider this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment
[ https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744670#comment-15744670 ] Apache Spark commented on SPARK-18840: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/16265 > HDFSCredentialProvider throws exception in non-HDFS security environment > > > Key: SPARK-18840 > URL: https://issues.apache.org/jira/browse/SPARK-18840 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.3, 2.1.0 >Reporter: Saisai Shao >Priority: Minor > > Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation > token should be existed, this is ok for HDFS environment, but for some cloud > environment like Azure, HDFS is not required, so it will throw exception: > {code} > java.util.NoSuchElementException: head of empty list > at scala.collection.immutable.Nil$.head(List.scala:337) > at scala.collection.immutable.Nil$.head(List.scala:334) > at > org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627) > {code} > We should also consider this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment
[ https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18840: Assignee: Apache Spark > HDFSCredentialProvider throws exception in non-HDFS security environment > > > Key: SPARK-18840 > URL: https://issues.apache.org/jira/browse/SPARK-18840 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.3, 2.1.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation > token should be existed, this is ok for HDFS environment, but for some cloud > environment like Azure, HDFS is not required, so it will throw exception: > {code} > java.util.NoSuchElementException: head of empty list > at scala.collection.immutable.Nil$.head(List.scala:337) > at scala.collection.immutable.Nil$.head(List.scala:334) > at > org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627) > {code} > We should also consider this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18837) Very long stage descriptions do not wrap in the UI
[ https://issues.apache.org/jira/browse/SPARK-18837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18837: -- Summary: Very long stage descriptions do not wrap in the UI (was: It will not hidden if job or stage description too long) > Very long stage descriptions do not wrap in the UI > -- > > Key: SPARK-18837 > URL: https://issues.apache.org/jira/browse/SPARK-18837 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Priority: Minor > Attachments: ui-2.0.0.gif, ui-2.1.0.gif > > > *previous*: > !ui-2.0.0.gif! > *current*: > !ui-2.1.0.gif! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18837) It will not hidden if job or stage description too long
[ https://issues.apache.org/jira/browse/SPARK-18837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18837: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) I agree that at least this needs to be wrappable > It will not hidden if job or stage description too long > --- > > Key: SPARK-18837 > URL: https://issues.apache.org/jira/browse/SPARK-18837 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Priority: Minor > Attachments: ui-2.0.0.gif, ui-2.1.0.gif > > > *previous*: > !ui-2.0.0.gif! > *current*: > !ui-2.1.0.gif! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18835) Do not expose shaded types in JavaTypeInference API
[ https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744600#comment-15744600 ] Sean Owen commented on SPARK-18835: --- (Ideally we should fix this for 2.1.0 because it causes the tests to consistently fail in at least a couple envs) > Do not expose shaded types in JavaTypeInference API > --- > > Key: SPARK-18835 > URL: https://issues.apache.org/jira/browse/SPARK-18835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Currently, {{inferDataType(TypeToken)}} is called from a different maven > module, and because we shade Guava, that sometimes leads to errors (e.g. when > running tests using maven): > {noformat} > udf3Test(test.org.apache.spark.sql.JavaUDFSuite) Time elapsed: 0.084 sec > <<< ERROR! > java.lang.NoSuchMethodError: > org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2; > at > test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107) > Results : > Tests in error: > JavaUDFSuite.udf3Test:107 ยป NoSuchMethod > org.apache.spark.sql.catalyst.JavaTyp... > {noformat} > Instead, we shouldn't expose Guava types in these APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18804) Join doesn't work in Spark on Bigger tables
[ https://issues.apache.org/jira/browse/SPARK-18804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-18804. - > Join doesn't work in Spark on Bigger tables > --- > > Key: SPARK-18804 > URL: https://issues.apache.org/jira/browse/SPARK-18804 > Project: Spark > Issue Type: Question > Components: Input/Output >Affects Versions: 1.6.1 >Reporter: Gopal Nagar > > Hi All, > Spark1.6.1 has been installed on a AWS EMR 3 node cluster which has 32 GB RAM > and 80 GB storage each node. I am trying to join two tables (1.2 GB & 900 MB > ) have rows 4607818 & 14273378 respectively. It's running in client mode on > Yarn cluster manager. > If i put the limit as 100 in select query it works fine. But if i try to join > on entire data set, Query runs for 3-4 hours and finally gets terminated. I > can see always 18 GB free on each nodes. > I have tried increasing no of executers/cores/partitions. But still doesn't > work. This has been tried in PySpark and submitted using Spark Submit command > but doesn't run. Please advise. > Join Query > -- > select * FROM table1 as t1 join table2 as t2 on t1.col = t2.col limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18804) Join doesn't work in Spark on Bigger tables
[ https://issues.apache.org/jira/browse/SPARK-18804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18804. --- Resolution: Not A Problem (Please don't reopen if the discussion has not meaningfully changed. JIRA isn't for questions/discussion -- mailing lists are.) I already indicated that you need to investigate basics of why the failure occurred. That detail is not here. > Join doesn't work in Spark on Bigger tables > --- > > Key: SPARK-18804 > URL: https://issues.apache.org/jira/browse/SPARK-18804 > Project: Spark > Issue Type: Question > Components: Input/Output >Affects Versions: 1.6.1 >Reporter: Gopal Nagar > > Hi All, > Spark1.6.1 has been installed on a AWS EMR 3 node cluster which has 32 GB RAM > and 80 GB storage each node. I am trying to join two tables (1.2 GB & 900 MB > ) have rows 4607818 & 14273378 respectively. It's running in client mode on > Yarn cluster manager. > If i put the limit as 100 in select query it works fine. But if i try to join > on entire data set, Query runs for 3-4 hours and finally gets terminated. I > can see always 18 GB free on each nodes. > I have tried increasing no of executers/cores/partitions. But still doesn't > work. This has been tried in PySpark and submitted using Spark Submit command > but doesn't run. Please advise. > Join Query > -- > select * FROM table1 as t1 join table2 as t2 on t1.col = t2.col limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?
[ https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744519#comment-15744519 ] Shivaram Venkataraman commented on SPARK-18823: --- Ah I see your point - `withColumn` does work for this use case ? I agree that adding this to `[ <- ` or `[[ <- ` would be a better user experience {code} > df2 <- withColumn(df, "tmp1", df$eruptions) > head(df2) eruptions waiting tmp1 1 3.600 79 3.600 ... {code} > Assignation by column name variable not available or bug? > - > > Key: SPARK-18823 > URL: https://issues.apache.org/jira/browse/SPARK-18823 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 2.0.2 > Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr > 4. Or databricks (community.cloud.databricks.com) . >Reporter: Vicente Masip > Fix For: 2.0.2 > > Original Estimate: 24h > Remaining Estimate: 24h > > I really don't know if this is a bug or can be done with some function: > Sometimes is very important to assign something to a column which name has to > be access trough a variable. Normally, I have always used it with doble > brackets likes this out of SparkR problems: > # df could be faithful normal data frame or data table. > # accesing by variable name: > myname = "waiting" > df[[myname]] <- c(1:nrow(df)) > # or even column number > df[[2]] <- df$eruptions > The error is not caused by the right side of the "<-" operator of assignment. > The problem is that I can't assign to a column name using a variable or > column number as I do in this examples out of spark. Doesn't matter if I am > modifying or creating column. Same problem. > I have also tried to use this with no results: > val df2 = withColumn(df,"tmp", df$eruptions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org