[jira] [Resolved] (SPARK-22348) The table cache providing ColumnarBatch should also do partition batch pruning
[ https://issues.apache.org/jira/browse/SPARK-22348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22348. - Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.3.0 > The table cache providing ColumnarBatch should also do partition batch pruning > -- > > Key: SPARK-22348 > URL: https://issues.apache.org/jira/browse/SPARK-22348 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > We enable table cache {{InMemoryTableScanExec}} to provide {{ColumnarBatch}} > now. But the cached batches are retrieved without pruning. In this case, we > still need to do partition batch pruning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22344) Prevent R CMD check from using /tmp
[ https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-22344: - Affects Version/s: 2.3.0 1.6.3 2.2.0 > Prevent R CMD check from using /tmp > --- > > Key: SPARK-22344 > URL: https://issues.apache.org/jira/browse/SPARK-22344 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0 >Reporter: Shivaram Venkataraman > > When R CMD check is run on the SparkR package it leaves behind files in /tmp > which is a violation of CRAN policy. We should instead write to Rtmpdir. > Notes from CRAN are below > {code} > Checking this leaves behind dirs >hive/$USER >$USER > and files named like >b4f6459b-0624-4100-8358-7aa7afbda757_resources > in /tmp, in violation of the CRAN Policy. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21616) SparkR 2.3.0 migration guide, release note
[ https://issues.apache.org/jira/browse/SPARK-21616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218104#comment-16218104 ] Felix Cheung commented on SPARK-21616: -- True, I don't know if we are tracking changes to programming guide for "migration guide" section that way though > SparkR 2.3.0 migration guide, release note > -- > > Key: SPARK-21616 > URL: https://issues.apache.org/jira/browse/SPARK-21616 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > > From looking at changes since 2.2.0, this/these should be documented in the > migration guide / release note for the 2.3.0 release, as it is behavior > changes -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe
[ https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218061#comment-16218061 ] Apache Spark commented on SPARK-15474: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19571 > ORC data source fails to write and read back empty dataframe > - > > Key: SPARK-15474 > URL: https://issues.apache.org/jira/browse/SPARK-15474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.1, 2.2.0 >Reporter: Hyukjin Kwon > > Currently ORC data source fails to write and read empty data. > The code below: > {code} > val emptyDf = spark.range(10).limit(0) > emptyDf.write > .format("orc") > .save(path.getCanonicalPath) > val copyEmptyDf = spark.read > .format("orc") > .load(path.getCanonicalPath) > copyEmptyDf.show() > {code} > throws an exception below: > {code} > Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114) > {code} > Note that this is a different case with the data below > {code} > val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) > {code} > In this case, any writer is not initialised and created. (no calls of > {{WriterContainer.writeRows()}}. > For Parquet and JSON, it works but ORC does not. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22335: Assignee: (was: Apache Spark) > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22335: Assignee: Apache Spark > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas >Assignee: Apache Spark > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218035#comment-16218035 ] Apache Spark commented on SPARK-22335: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/19570 > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218027#comment-16218027 ] Liang-Chi Hsieh commented on SPARK-22335: - [~CBribiescas] The column position in the schema of a Dataset doesn't necessarily match the fields in the typed objects. The document of {{as}} has explained it: {code} Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of `U`: - When `U` is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by `spark.sql.caseSensitive`). {code} {{unionByName}} resolves columns by name. For the typed objects, I think it can satisfy this kind of usage. I remember that this is not the first ticket opened for union behavior on Datasets. The first thing might be that we can better describe this in the document of {{union}}. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22348) The table cache providing ColumnarBatch should also do partition batch pruning
[ https://issues.apache.org/jira/browse/SPARK-22348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22348: Assignee: Apache Spark > The table cache providing ColumnarBatch should also do partition batch pruning > -- > > Key: SPARK-22348 > URL: https://issues.apache.org/jira/browse/SPARK-22348 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > We enable table cache {{InMemoryTableScanExec}} to provide {{ColumnarBatch}} > now. But the cached batches are retrieved without pruning. In this case, we > still need to do partition batch pruning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22348) The table cache providing ColumnarBatch should also do partition batch pruning
[ https://issues.apache.org/jira/browse/SPARK-22348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22348: Assignee: (was: Apache Spark) > The table cache providing ColumnarBatch should also do partition batch pruning > -- > > Key: SPARK-22348 > URL: https://issues.apache.org/jira/browse/SPARK-22348 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Liang-Chi Hsieh > > We enable table cache {{InMemoryTableScanExec}} to provide {{ColumnarBatch}} > now. But the cached batches are retrieved without pruning. In this case, we > still need to do partition batch pruning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22348) The table cache providing ColumnarBatch should also do partition batch pruning
[ https://issues.apache.org/jira/browse/SPARK-22348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217992#comment-16217992 ] Apache Spark commented on SPARK-22348: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/19569 > The table cache providing ColumnarBatch should also do partition batch pruning > -- > > Key: SPARK-22348 > URL: https://issues.apache.org/jira/browse/SPARK-22348 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Liang-Chi Hsieh > > We enable table cache {{InMemoryTableScanExec}} to provide {{ColumnarBatch}} > now. But the cached batches are retrieved without pruning. In this case, we > still need to do partition batch pruning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22348) The table cache providing ColumnarBatch should also do partition batch pruning
Liang-Chi Hsieh created SPARK-22348: --- Summary: The table cache providing ColumnarBatch should also do partition batch pruning Key: SPARK-22348 URL: https://issues.apache.org/jira/browse/SPARK-22348 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Liang-Chi Hsieh We enable table cache {{InMemoryTableScanExec}} to provide {{ColumnarBatch}} now. But the cached batches are retrieved without pruning. In this case, we still need to do partition batch pruning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217979#comment-16217979 ] Peng Meng commented on SPARK-22277: --- For problem 1 and 2, could you please post the test code. For problem 1, one possible case is all the feature ChiSquare statistics value is the same, no matter you select which feature, the result is right. To code is helpful for analysis of the problem, > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at > org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t > at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at > org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at > sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > +--+---+--+ > |prediction|clicked|
[jira] [Updated] (SPARK-17074) generate equi-height histogram for column
[ https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-17074: - Affects Version/s: (was: 2.0.0) 2.3.0 > generate equi-height histogram for column > - > > Key: SPARK-17074 > URL: https://issues.apache.org/jira/browse/SPARK-17074 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.3.0 >Reporter: Ron Hu > > Equi-height histogram is effective in handling skewed data distribution. > For equi-height histogram, the heights of all bins(intervals) are the same. > The default number of bins we use is 254. > Now we use a two-step method to generate an equi-height histogram: > 1. use percentile_approx to get percentiles (end points of the equi-height > bin intervals); > 2. use a new aggregate function to get distinct counts in each of these bins. > Note that this method takes two table scans. In the future we may provide > other algorithms which need only one table scan. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Porter updated SPARK-22347: --- Description: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF when the `F.when` condition is false. Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, then the error is not raised and all rows resolve to `null`, which is the expected result. was: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF when the `F.when` condition is false. Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, then the error is not raised and all rows resolve to `null`. > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Priority: Minor > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. > Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, > then the error is not raised and all rows resolve to `null`, which is the > expected result. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Porter updated SPARK-22347: --- Description: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF when the `F.when` condition is false. Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, then the error is not raised and all rows resolve to `null`, which is the expected result. was: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF when the `F.when` condition is false. Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, then the error is not raised and all rows resolve to `null`, which is the expected result. > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Priority: Minor > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. > Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, > then the error is not raised and all rows resolve to `null`, which is the > expected result. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Porter updated SPARK-22347: --- Description: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF when the `F.when` condition is false. Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, then the error is not raised and all rows resolve to `null`. was: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF when the `F.when` condition is false. > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Priority: Minor > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. > Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, > then the error is not raised and all rows resolve to `null`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Porter updated SPARK-22347: --- Description: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF when the `F.when` condition is false. was: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF since the condition fails. > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Priority: Minor > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Porter updated SPARK-22347: --- Description: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF since the condition fails. was: Here's a simple example on how to reproduce this: {code:python} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF since the condition fails. > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Priority: Minor > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF since the condition fails. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Porter updated SPARK-22347: --- Description: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF since the condition fails. was: Here's a simple example on how to reproduce this: {code} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF since the condition fails. > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Priority: Minor > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF since the condition fails. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
Nicolas Porter created SPARK-22347: -- Summary: UDF is evaluated when 'F.when' condition is false Key: SPARK-22347 URL: https://issues.apache.org/jira/browse/SPARK-22347 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Reporter: Nicolas Porter Priority: Minor Here's a simple example on how to reproduce this: {code:python} from pyspark.sql import functions as F, Row, types def Divide10(): def fn(value): return 10 / int(value) return F.udf(fn, types.IntegerType()) df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() x = F.col('x') df2 = df.select(F.when((x > 0), Divide10()(x))) df2.show(200) {code} This raises a division by zero error, even if `F.when` is trying to filter out all cases where `x <= 0`. I believe the correct behavior should be not to evaluate the UDF since the condition fails. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [#SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [#SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, . Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, . Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to
[jira] [Created] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
Bago Amirbekian created SPARK-22346: --- Summary: Update VectorAssembler to work with StreamingDataframes Key: SPARK-22346 URL: https://issues.apache.org/jira/browse/SPARK-22346 Project: Spark Issue Type: Improvement Components: ML, Structured Streaming Affects Versions: 2.2.0 Reporter: Bago Amirbekian Priority: Critical The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217835#comment-16217835 ] Leif Walsh commented on SPARK-22340: Ok, this is fairly straightforward. The problem is that from the Python side, {{setJobGroup}} isn't thread-local, it's global. Here is a tight reproducer: {noformat} import concurrent.futures import threading import time executor = concurrent.futures.ThreadPoolExecutor() latch_1 = threading.Event() latch_2 = threading.Event() def wait(x): time.sleep(x) return x def multiple_job_groups(): sc.setJobGroup('imajobgroup', 'helloitme') groups = [] groups.append(get_job_group()) sc.parallelize([1, 1]).map(wait).collect() latch_2.set() latch_1.wait() groups.append(get_job_group()) sc.parallelize([1, 1]).map(wait).collect() groups.append(get_job_group()) return groups def another_job_group(): latch_2.wait() sc.setJobGroup('another', 'itnotme') sc.parallelize([1, 1]).map(wait).collect() latch_1.set() future_1 = executor.submit(multiple_job_groups) future_2 = executor.submit(another_job_group) future_1.result() {noformat} The result is that {{another_job_group}} modifies the local property in between the first and second executions of {{multiple_job_groups}}'s jobs, and we get this result: {noformat} ['imajobgroup', 'another', 'another'] {noformat} I think I can "solve" this by wrapping {{SparkContext}} with a lock (to sequence the execution of {{setJobGroup}} and something in py4j that will release the lock during JVM execution, which feels Very Dangerous. Would greatly appreciate it if we could do something to really solve this inside pyspark, but will attempt the Dangerous on my side for now. > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Walsh > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22345) Sort-merge join generates incorrect code for CodegenFallback filter conditions
[ https://issues.apache.org/jira/browse/SPARK-22345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217772#comment-16217772 ] Apache Spark commented on SPARK-22345: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/19568 > Sort-merge join generates incorrect code for CodegenFallback filter conditions > -- > > Key: SPARK-22345 > URL: https://issues.apache.org/jira/browse/SPARK-22345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Ryan Blue > > I have a job that is producing incorrect results from a sort-merge join with > a filter on an expression with a Hive UDF. The results are correct when > codgen is turned off. > I tracked the problem to the evaluation of the Hive UDF, here: > {code:lang=java} > Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1); > {code} > The UDF references columns from both left and right, so this was clearly > generated incorrectly. I think it is that the expression used for Hive UDFs > is a CodegenFallback, which uses eval instead of local variables. The > non-codegen implementations pass a joined row into eval and creating a joined > row in codegen fixes the problem. I'll post a PR in a minute. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22345) Sort-merge join generates incorrect code for CodegenFallback filter conditions
[ https://issues.apache.org/jira/browse/SPARK-22345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22345: Assignee: Apache Spark > Sort-merge join generates incorrect code for CodegenFallback filter conditions > -- > > Key: SPARK-22345 > URL: https://issues.apache.org/jira/browse/SPARK-22345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Ryan Blue >Assignee: Apache Spark > > I have a job that is producing incorrect results from a sort-merge join with > a filter on an expression with a Hive UDF. The results are correct when > codgen is turned off. > I tracked the problem to the evaluation of the Hive UDF, here: > {code:lang=java} > Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1); > {code} > The UDF references columns from both left and right, so this was clearly > generated incorrectly. I think it is that the expression used for Hive UDFs > is a CodegenFallback, which uses eval instead of local variables. The > non-codegen implementations pass a joined row into eval and creating a joined > row in codegen fixes the problem. I'll post a PR in a minute. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22345) Sort-merge join generates incorrect code for CodegenFallback filter conditions
[ https://issues.apache.org/jira/browse/SPARK-22345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22345: Assignee: (was: Apache Spark) > Sort-merge join generates incorrect code for CodegenFallback filter conditions > -- > > Key: SPARK-22345 > URL: https://issues.apache.org/jira/browse/SPARK-22345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Ryan Blue > > I have a job that is producing incorrect results from a sort-merge join with > a filter on an expression with a Hive UDF. The results are correct when > codgen is turned off. > I tracked the problem to the evaluation of the Hive UDF, here: > {code:lang=java} > Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1); > {code} > The UDF references columns from both left and right, so this was clearly > generated incorrectly. I think it is that the expression used for Hive UDFs > is a CodegenFallback, which uses eval instead of local variables. The > non-codegen implementations pass a joined row into eval and creating a joined > row in codegen fixes the problem. I'll post a PR in a minute. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22345) Sort-merge join generates incorrect code for CodegenFallback filter conditions
[ https://issues.apache.org/jira/browse/SPARK-22345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-22345: -- Description: I have a job that is producing incorrect results from a sort-merge join with a filter on an expression with a Hive UDF. The results are correct when codgen is turned off. I tracked the problem to the evaluation of the Hive UDF, here: {code:lang=java} Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1); {code} The UDF references columns from both left and right, so this was clearly generated incorrectly. I think it is that the expression used for Hive UDFs is a CodegenFallback, which uses eval instead of local variables. The non-codegen implementations pass a joined row into eval and creating a joined row in codegen fixes the problem. I'll post a PR in a minute. was: I have a job that is producing incorrect results from a sort-merge join with a filter on an expression with a Hive UDF. The results are correct when codgen is turned off. I tracked the problem to the evaluation of the Hive UDF, here: ``` Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1); ``` The UDF references columns from both left and right, so this was clearly generated incorrectly. I think it is that the expression used for Hive UDFs is a CodegenFallback, which uses eval instead of local variables. The non-codegen implementations pass a joined row into eval and creating a joined row in codegen fixes the problem. I'll post a PR in a minute. > Sort-merge join generates incorrect code for CodegenFallback filter conditions > -- > > Key: SPARK-22345 > URL: https://issues.apache.org/jira/browse/SPARK-22345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Ryan Blue > > I have a job that is producing incorrect results from a sort-merge join with > a filter on an expression with a Hive UDF. The results are correct when > codgen is turned off. > I tracked the problem to the evaluation of the Hive UDF, here: > {code:lang=java} > Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1); > {code} > The UDF references columns from both left and right, so this was clearly > generated incorrectly. I think it is that the expression used for Hive UDFs > is a CodegenFallback, which uses eval instead of local variables. The > non-codegen implementations pass a joined row into eval and creating a joined > row in codegen fixes the problem. I'll post a PR in a minute. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22316) Cannot Select ReducedAggregator Column
[ https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217666#comment-16217666 ] Russell Spitzer commented on SPARK-22316: - [~hvanhovell] This was the ticket I told you about :) > Cannot Select ReducedAggregator Column > -- > > Key: SPARK-22316 > URL: https://issues.apache.org/jira/browse/SPARK-22316 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Russell Spitzer >Priority: Minor > > Given a dataset which has been run through reduceGroups like this > {code} > case class Person(name: String, age: Int) > case class Customer(id: Int, person: Person) > val ds = spark.createDataset(Seq(Customer(1,Person("russ", 85 > val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x ) > {code} > We end up with a Dataset with the schema > {code} > org.apache.spark.sql.types.StructType = > StructType( > StructField(value,IntegerType,false), > StructField(ReduceAggregator(Customer), > StructType(StructField(id,IntegerType,false), > StructField(person, > StructType(StructField(name,StringType,true), > StructField(age,IntegerType,false)) >,true)) > ,true)) > {code} > The column names are > {code} > Array(value, ReduceAggregator(Customer)) > {code} > But you cannot select the "ReduceAggregatorColumn" > {code} > grouped.select(grouped.columns(1)) > org.apache.spark.sql.AnalysisException: cannot resolve > '`ReduceAggregator(Customer)`' given input columns: [value, > ReduceAggregator(Customer)];; > 'Project ['ReduceAggregator(Customer)] > +- Aggregate [value#338], [value#338, > reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, > Some(newInstance(class Customer)), Some(class Customer), > Some(StructType(StructField(id,IntegerType,false), > StructField(person,StructType(StructField(name,StringType,true), > StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 > AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) > null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, > true]._2)).id AS id#195, person, if > (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, > true]._2)).person)) null else named_struct(name, staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, > true]._2)).person).name, true), age, > assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, > true]._2)).person).age) AS person#196) AS _2#341, newInstance(class > scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS > id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, > true])).person)) null else named_struct(name, staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > assertnotnull(assertnotnull(assertnotnull(input[0, Customer, > true])).person).name, true), age, > assertnotnull(assertnotnull(assertnotnull(input[0, Customer, > true])).person).age) AS person#196, StructField(id,IntegerType,false), > StructField(person,StructType(StructField(name,StringType,true), > StructField(age,IntegerType,false)),true), true, 0, 0) AS > ReduceAggregator(Customer)#346] >+- AppendColumns , class Customer, > [StructField(id,IntegerType,false), > StructField(person,StructType(StructField(name,StringType,true), > StructField(age,IntegerType,false)),true)], newInstance(class Customer), > [input[0, int, false] AS value#338] > +- LocalRelation [id#197, person#198] > {code} > You can work around this by using "toDF" to rename the column > {code} > scala> grouped.toDF("key", "reduced").select("reduced") > res55: org.apache.spark.sql.DataFrame = [reduced: struct struct>] > {code} > I think that all invocations of > {code} > ds.select(ds.columns(i)) > {code} > For all valid i < columns size should work. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22345) Sort-merge join generates incorrect code for CodegenFallback filter conditions
Ryan Blue created SPARK-22345: - Summary: Sort-merge join generates incorrect code for CodegenFallback filter conditions Key: SPARK-22345 URL: https://issues.apache.org/jira/browse/SPARK-22345 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: Ryan Blue I have a job that is producing incorrect results from a sort-merge join with a filter on an expression with a Hive UDF. The results are correct when codgen is turned off. I tracked the problem to the evaluation of the Hive UDF, here: ``` Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1); ``` The UDF references columns from both left and right, so this was clearly generated incorrectly. I think it is that the expression used for Hive UDFs is a CodegenFallback, which uses eval instead of local variables. The non-codegen implementations pass a joined row into eval and creating a joined row in codegen fixes the problem. I'll post a PR in a minute. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217535#comment-16217535 ] Leif Walsh commented on SPARK-22340: This is less spooky than I initially thought, I will explain later tonight. > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Walsh > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217532#comment-16217532 ] Dongjoon Hyun commented on SPARK-22335: --- To be clear, I have no objection on your idea here now. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217524#comment-16217524 ] Dongjoon Hyun commented on SPARK-22335: --- Hm. I see your point. What I meant was the following in DataFrame. What you're asking is 'name' matching in DataFrame; "a" to "a" and "b" to "b". {code} scala> val abDf = sc.parallelize(List(("aThing",0))).toDF("a", "b") scala> val baDf = sc.parallelize(List((0,"aThing"))).toDF("b", "a") scala> abDf.union(baDf).show +--+--+ | a| b| +--+--+ |aThing| 0| | 0|aThing| +--+--+ {code} > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217497#comment-16217497 ] Carlos Bribiescas commented on SPARK-22335: --- I'm not sure I understand what you're asking. I do agree that DS should be consistent with DF when possible, but in this case the more specific functionality (typing) doesn't apply to DF. Again, sorry if I didn't answer your question I didn't quite get what you were asking. Can you clarify? Here is another example that maybe helps. {code:java} case class AB(a : String, b : Int) val abDs = sc.parallelize(List(("aThing",0))).toDF("a", "b").as[AB] val baDs = sc.parallelize(List((0,"aThing"))).toDF("b", "a").as[AB] abDs.show() // works baDs.show() // works abDs.union(baDs).show() // Real error to do with types abDs.rdd.union(baDs.rdd).toDF().as[AB].show() // Works which is inconsistent with last statement IMO {code} > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22291: Assignee: Apache Spark > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter >Assignee: Apache Spark > Labels: patch, postgresql, sql > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at >
[jira] [Assigned] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22291: Assignee: (was: Apache Spark) > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter > Labels: patch, postgresql, sql > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at >
[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217462#comment-16217462 ] Apache Spark commented on SPARK-22291: -- User 'jmchung' has created a pull request for this issue: https://github.com/apache/spark/pull/19567 > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter > Labels: patch, postgresql, sql > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at
[jira] [Updated] (SPARK-22324) Upgrade Arrow to version 0.8.0
[ https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-22324: - Description: Arrow version 0.8.0 is slated for release in early November, but I'd like to start discussing to help get all the work that's being done synced up. Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test envs will need to be upgraded as well that will take a fair amount of work and planning. One topic I'd like to discuss is if pyarrow should be an installation requirement for pyspark, i.e. when a user pip installs pyspark, it will also install pyarrow. If not, then is there a minimum version that needs to be supported? We currently have 0.4.1 installed on Jenkins. There are a number of improvements and cleanups in the current code that can happen depending on what we decide (I'll link them all here later, but off the top of my head): * Decimal bug fix and improved support * Improved internal casting between pyarrow and pandas (can clean up some workarounds), this will also verify data bounds if the user specifies a type and data overflows. see https://github.com/apache/spark/pull/19459#discussion_r146421804 * Better type checking when converting Spark types to Arrow * Timestamp conversion to microseconds (for Spark internal format) * Full support for using validity mask with 'object' types https://github.com/apache/spark/pull/18664#discussion_r146567335 was: Arrow version 0.8.0 is slated for release in early November, but I'd like to start discussing to help get all the work that's being done synced up. Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test envs will need to be upgraded as well that will take a fair amount of work and planning. One topic I'd like to discuss is if pyarrow should be an installation requirement for pyspark, i.e. when a user pip installs pyspark, it will also install pyarrow. If not, then is there a minimum version that needs to be supported? We currently have 0.4.1 installed on Jenkins. There are a number of improvements and cleanups in the current code that can happen depending on what we decide (I'll link them all here later, but off the top of my head): * Decimal bug fix and improved support * Improved internal casting between pyarrow and pandas (can clean up some workarounds), this will also verify data bounds if the user specifies a type and data overflows. see https://github.com/apache/spark/pull/19459#discussion_r146421804 * Better type checking when converting Spark types to Arrow * Timestamp conversion to microseconds (for Spark internal format) > Upgrade Arrow to version 0.8.0 > -- > > Key: SPARK-22324 > URL: https://issues.apache.org/jira/browse/SPARK-22324 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > Arrow version 0.8.0 is slated for release in early November, but I'd like to > start discussing to help get all the work that's being done synced up. > Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test > envs will need to be upgraded as well that will take a fair amount of work > and planning. > One topic I'd like to discuss is if pyarrow should be an installation > requirement for pyspark, i.e. when a user pip installs pyspark, it will also > install pyarrow. If not, then is there a minimum version that needs to be > supported? We currently have 0.4.1 installed on Jenkins. > There are a number of improvements and cleanups in the current code that can > happen depending on what we decide (I'll link them all here later, but off > the top of my head): > * Decimal bug fix and improved support > * Improved internal casting between pyarrow and pandas (can clean up some > workarounds), this will also verify data bounds if the user specifies a type > and data overflows. see > https://github.com/apache/spark/pull/19459#discussion_r146421804 > * Better type checking when converting Spark types to Arrow > * Timestamp conversion to microseconds (for Spark internal format) > * Full support for using validity mask with 'object' types > https://github.com/apache/spark/pull/18664#discussion_r146567335 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21043) Add unionByName API to Dataset
[ https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217415#comment-16217415 ] Reynold Xin commented on SPARK-21043: - Because some people expect union by position too. > Add unionByName API to Dataset > -- > > Key: SPARK-21043 > URL: https://issues.apache.org/jira/browse/SPARK-21043 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Takeshi Yamamuro > Fix For: 2.3.0 > > > It would be useful to add unionByName which resolves columns by name, in > addition to the existing union (which resolves by position). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed
[ https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217398#comment-16217398 ] Steve Loughran commented on SPARK-22240: no, spark 2.2 doesn't fix this. I have to explicitly define the schema as the list of headers -> String and then reader setup time drops to 6s. {code} 2017-10-24 19:19:31,593 [ScalaTest-main-running-S3ACommitBulkDataSuite] DEBUG fs.FsUrlStreamHandlerFactory (FsUrlStreamHandlerFactory.java:createURLStreamHandler(107)) - Unknown protocol jar, delegating to default implementation 2017-10-24 19:19:36,402 [ScalaTest-main-running-S3ACommitBulkDataSuite] INFO commit.S3ACommitBulkDataSuite (Logging.scala:logInfo(54)) - Duration of set up initial .csv load = 6,195,721,969 nS {code} I'm not sure if this is related to the original bug, though it is potentially part of the issue. As what I'm seeing is that either schema inference always takes place, or taking the first line of a .gz file is enough to force reading the entire .gz source file > S3 CSV number of partitions incorrectly computed > > > Key: SPARK-22240 > URL: https://issues.apache.org/jira/browse/SPARK-22240 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0 >Reporter: Arthur Baudry > > Reading CSV out of S3 using S3A protocol does not compute the number of > partitions correctly in Spark 2.2.0. > With Spark 2.2.0 I get only partition when loading a 14GB file > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 1 > {code} > While in Spark 2.0.2 I had: > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 115 > {code} > This introduces obvious performance issues in Spark 2.2.0. Maybe there is a > property that should be set to have the number of partitions computed > correctly. > I'm aware that the .option("multiline","true") is not supported in Spark > 2.0.2, it's not relevant here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12359) Add showString() to DataSet API.
[ https://issues.apache.org/jira/browse/SPARK-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217366#comment-16217366 ] Alexandre Dupriez edited comment on SPARK-12359 at 10/24/17 6:08 PM: - Looking at [this source|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala] of the {{Dataset}} it seems that {{showString}} is still package-private. What is the reason not to expose a similar method to the public? _Note_: the reason I would like to have such a method is to use the pretty-formatted output of the {{show}} method without being forced to inject it to the standard output. _Edit_: ok, found the PR for this change which was closed: https://github.com/apache/spark/pull/10130 was (Author: hangleton): Looking at [this source|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala] of the {{Dataset}} it seems that {{showString}} is still package-private. What is the reason not to expose a similar method to the public? _Note_: the reason I would like to have such a method is to use the pretty-formatted output of the {{show}} method without being forced to inject it to the standard output. > Add showString() to DataSet API. > > > Key: SPARK-12359 > URL: https://issues.apache.org/jira/browse/SPARK-12359 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dilip Biswal >Priority: Minor > > JIRA 12105 exposed showString and its variants as public API. This adds the > two APIs into DataSet. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12359) Add showString() to DataSet API.
[ https://issues.apache.org/jira/browse/SPARK-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217366#comment-16217366 ] Alexandre Dupriez edited comment on SPARK-12359 at 10/24/17 6:01 PM: - Looking at [this source|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala] of the {{Dataset}} it seems that {{showString}} is still package-private. What is the reason not to expose a similar method to the public? _Note_: the reason I would like to have such a method is to use the pretty-formatted output of the {{show}} method without being forced to inject it to the standard output. was (Author: hangleton): Looking at [this source|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala] of the {{Dataset}} it seems that {{showString}} is still package-private. What is the reason not to expose a similar method to the public? > Add showString() to DataSet API. > > > Key: SPARK-12359 > URL: https://issues.apache.org/jira/browse/SPARK-12359 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dilip Biswal >Priority: Minor > > JIRA 12105 exposed showString and its variants as public API. This adds the > two APIs into DataSet. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
[ https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22341: Assignee: Apache Spark > [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off > -- > > Key: SPARK-22341 > URL: https://issues.apache.org/jira/browse/SPARK-22341 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński >Assignee: Apache Spark > > I'm trying to run 2.3.0 (from master) on my yarn cluster. > The result is: > {code} > Exception in thread "main" org.apache.hadoop.security.AccessControlException: > Permission denied: user=yarn, access=EXECUTE, > inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} > I think the problem exists, because I'm not using yarn impersonation which > mean that all jobs on cluster are runned from user yarn. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
[ https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22341: Assignee: (was: Apache Spark) > [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off > -- > > Key: SPARK-22341 > URL: https://issues.apache.org/jira/browse/SPARK-22341 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński > > I'm trying to run 2.3.0 (from master) on my yarn cluster. > The result is: > {code} > Exception in thread "main" org.apache.hadoop.security.AccessControlException: > Permission denied: user=yarn, access=EXECUTE, > inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} > I think the problem exists, because I'm not using yarn impersonation which > mean that all jobs on cluster are runned from user yarn. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12359) Add showString() to DataSet API.
[ https://issues.apache.org/jira/browse/SPARK-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217366#comment-16217366 ] Alexandre Dupriez commented on SPARK-12359: --- Looking at [this source|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala] of the {{Dataset}} it seems that {{showString}} is still package-private. What is the reason not to expose a similar method to the public? > Add showString() to DataSet API. > > > Key: SPARK-12359 > URL: https://issues.apache.org/jira/browse/SPARK-12359 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dilip Biswal >Priority: Minor > > JIRA 12105 exposed showString and its variants as public API. This adds the > two APIs into DataSet. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
[ https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217368#comment-16217368 ] Apache Spark commented on SPARK-22341: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/19566 > [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off > -- > > Key: SPARK-22341 > URL: https://issues.apache.org/jira/browse/SPARK-22341 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński > > I'm trying to run 2.3.0 (from master) on my yarn cluster. > The result is: > {code} > Exception in thread "main" org.apache.hadoop.security.AccessControlException: > Permission denied: user=yarn, access=EXECUTE, > inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} > I think the problem exists, because I'm not using yarn impersonation which > mean that all jobs on cluster are runned from user yarn. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217357#comment-16217357 ] Dongjoon Hyun commented on SPARK-22335: --- [~CBribiescas]. Is this issue about types? For me, it looks like about *names*, not *types*. DataFrame also has names. In that point of view, Dataset had better be consistent with DataFrames. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217354#comment-16217354 ] Dan Blanchard commented on SPARK-16367: --- Right, sorry. I just meant to point out that the proposal you have up there has someone installing the {{wheelhouse}} package, which is unnecessary. > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Semet > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > This ticket proposes to allow users the ability to deploy their job as > "Wheels" packages. The Python community is strongly advocating to promote > this way of packaging and distributing Python application as a "standard way > of deploying Python App". In other word, this is the "Pythonic Way of > Deployment". > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* > In Python, the packaging standard is now the "wheels" file format, which goes > further that good old ".egg" files. With a wheel file (".whl"), the package > is already prepared for a given architecture. You can have several wheels for > a given package version, each specific to an architecture, or environment. > For example, look at https://pypi.python.org/pypi/numpy all the different > version of Wheel available. > The {{pip}} tools knows how to select the right wheel file matching the > current system, and how to install this package in a light speed (without > compilation). Said otherwise, package that requires compilation of a C > module, for instance "numpy", does *not* compile anything when installing > from wheel file. > {{pypi.pypthon.org}} already provided wheels for major python version. It the > wheel is not available, pip will compile it from source anyway. Mirroring of > Pypi is possible through projects such as http://doc.devpi.net/latest/ > (untested) or the Pypi mirror support on Artifactory (tested personnally). > {{pip}} also provides the ability to generate easily all wheels of all > packages used for a given project which is inside a "virtualenv". This is > called "wheelhouse". You can even don't mess with this compilation and > retrieve it directly from pypi.python.org. > *Use Case 1: no internet connectivity* > Here my first proposal for a deployment workflow, in the case where the Spark > cluster does not have any internet connectivity or access to a Pypi mirror. > In this case the simplest way to deploy a project with several dependencies > is to build and then send to complete "wheelhouse": > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6 # via pylint > autopep8==1.2.4 > click==6.6 # via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31 # via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid, pip-tools, pylint, unittest2 > spark-testing-base==0.0.7.post2 > traceback2==1.4.0 # via unittest2 > unittest2==1.1.0 # via spark-testing-base > wheel==0.29.0 > wrapt==1.10.8 # via astroid > {code} > -- write a setup.py with some entry points or package. Use >
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217333#comment-16217333 ] Ashwin Shankar commented on SPARK-18105: Hi [~davies] [~cloud_fan] We hit the same issue. What is the aforementioned workaround that went into 2.1? Any other workaround? > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed
[ https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217325#comment-16217325 ] Steve Loughran commented on SPARK-22240: I'm doing some testing with master & reading files off S3A, with s3a logging file IO in close() statements, and I'm seeing a full load of the file in the executor, even when schema inference is turned off {code} 2017-10-24 18:03:20,085 [ScalaTest-main-running-S3ACommitBulkDataSuite] INFO codegen.CodeGenerator (Logging.scala:logInfo(54)) - Code generated in 149.680004 ms 2017-10-24 18:03:20,423 [Executor task launch worker for task 0] INFO codegen.CodeGenerator (Logging.scala:logInfo(54)) - Code generated in 9.406645 ms 2017-10-24 18:03:20,434 [Executor task launch worker for task 0] DEBUG s3a.S3AFileSystem (S3AFileSystem.java:open(680)) - Opening 's3a://landsat-pds/scene_list.gz' for reading; input policy = sequential 2017-10-24 18:03:20,435 [Executor task launch worker for task 0] DEBUG s3a.S3AFileSystem (S3AFileSystem.java:innerGetFileStatus(2065)) - Getting path status for s3a://landsat-pds/scene_list.gz (scene_list.gz) 2017-10-24 18:03:21,175 [Executor task launch worker for task 0] DEBUG s3a.S3AFileSystem (S3AFileSystem.java:s3GetFileStatus(2133)) - Found exact file: normal file 2017-10-24 18:03:21,184 [Executor task launch worker for task 0] INFO compress.CodecPool (CodecPool.java:getDecompressor(184)) - Got brand-new decompressor [.gz] 2017-10-24 18:03:21,188 [Executor task launch worker for task 0] DEBUG s3a.S3AInputStream (S3AInputStream.java:reopen(174)) - reopen(s3a://landsat-pds/scene_list.gz) for read from new offset range[0-45603307], length=65536, streamPosition=0, nextReadPosition=0, policy=sequential 2017-10-24 18:03:53,447 [Executor task launch worker for task 0] DEBUG s3a.S3AInputStream (S3AInputStream.java:closeStream(490)) - Closing stream close() operation: soft 2017-10-24 18:03:53,447 [Executor task launch worker for task 0] DEBUG s3a.S3AInputStream (S3AInputStream.java:closeStream(503)) - Drained stream of 0 bytes 2017-10-24 18:03:53,448 [Executor task launch worker for task 0] DEBUG s3a.S3AInputStream (S3AInputStream.java:closeStream(524)) - Stream s3a://landsat-pds/scene_list.gz closed: close() operation; remaining=0 streamPos=45603307, nextReadPos=45603307, request range 0-45603307 length=45603307 2017-10-24 18:03:53,448 [Executor task launch worker for task 0] DEBUG s3a.S3AInputStream (S3AInputStream.java:close(463)) - Statistics of stream scene_list.gz StreamStatistics{OpenOperations=1, CloseOperations=1, Closed=1, Aborted=0, SeekOperations=0, ReadExceptions=0, ForwardSeekOperations=0, BackwardSeekOperations=0, BytesSkippedOnSeek=0, BytesBackwardsOnSeek=0, BytesRead=45603307, BytesRead excluding skipped=45603307, ReadOperations=5240, ReadFullyOperations=0, ReadsIncomplete=5240, BytesReadInClose=0, BytesDiscardedInAbort=0, InputPolicy=1, InputPolicySetCount=1} {code} that is, handing in a csv.gz triggers a full read., which is "observably slow" to read 400MB of data from across the planet. Load options were: {code} val csvOptions = Map( "header" -> "true", "ignoreLeadingWhiteSpace" -> "true", "ignoreTrailingWhiteSpace" -> "true", "timestampFormat" -> "-MM-dd HH:mm:ss.SSSZZZ", "inferSchema" -> "false", "mode" -> "DROPMALFORMED", ) {code} I'm going to set a schema to make this go away (it'll become a dataframe one transform later anyway), but I don't believe this used to occur. I'll try building against other spark versions to see. > S3 CSV number of partitions incorrectly computed > > > Key: SPARK-22240 > URL: https://issues.apache.org/jira/browse/SPARK-22240 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0 >Reporter: Arthur Baudry > > Reading CSV out of S3 using S3A protocol does not compute the number of > partitions correctly in Spark 2.2.0. > With Spark 2.2.0 I get only partition when loading a 14GB file > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 1 > {code} > While in Spark 2.0.2 I had: > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 115 > {code} > This introduces obvious performance issues
[jira] [Comment Edited] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217322#comment-16217322 ] Semet edited comment on SPARK-16367 at 10/24/17 5:29 PM: - Yes, I don't use it because it is a feature of {{pip}}: {code} pip wheel --wheel-dir wheelhouse . {code} It is described [in here|https://wheel.readthedocs.io/en/stable/]. was (Author: gae...@xeberon.net): Yes, I don't use it because it is a feature of {{pip}}: {{code} pip wheel --wheel-dir wheelhouse . {{code}} It is described [in here|https://wheel.readthedocs.io/en/stable/]. > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Semet > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > This ticket proposes to allow users the ability to deploy their job as > "Wheels" packages. The Python community is strongly advocating to promote > this way of packaging and distributing Python application as a "standard way > of deploying Python App". In other word, this is the "Pythonic Way of > Deployment". > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* > In Python, the packaging standard is now the "wheels" file format, which goes > further that good old ".egg" files. With a wheel file (".whl"), the package > is already prepared for a given architecture. You can have several wheels for > a given package version, each specific to an architecture, or environment. > For example, look at https://pypi.python.org/pypi/numpy all the different > version of Wheel available. > The {{pip}} tools knows how to select the right wheel file matching the > current system, and how to install this package in a light speed (without > compilation). Said otherwise, package that requires compilation of a C > module, for instance "numpy", does *not* compile anything when installing > from wheel file. > {{pypi.pypthon.org}} already provided wheels for major python version. It the > wheel is not available, pip will compile it from source anyway. Mirroring of > Pypi is possible through projects such as http://doc.devpi.net/latest/ > (untested) or the Pypi mirror support on Artifactory (tested personnally). > {{pip}} also provides the ability to generate easily all wheels of all > packages used for a given project which is inside a "virtualenv". This is > called "wheelhouse". You can even don't mess with this compilation and > retrieve it directly from pypi.python.org. > *Use Case 1: no internet connectivity* > Here my first proposal for a deployment workflow, in the case where the Spark > cluster does not have any internet connectivity or access to a Pypi mirror. > In this case the simplest way to deploy a project with several dependencies > is to build and then send to complete "wheelhouse": > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6 # via pylint > autopep8==1.2.4 > click==6.6 # via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31 # via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid,
[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217322#comment-16217322 ] Semet commented on SPARK-16367: --- Yes, I don't use it because it is a feature of {{pip}}: {{code} pip wheel --wheel-dir wheelhouse . {{code}} It is described [in here|https://wheel.readthedocs.io/en/stable/]. > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Semet > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > This ticket proposes to allow users the ability to deploy their job as > "Wheels" packages. The Python community is strongly advocating to promote > this way of packaging and distributing Python application as a "standard way > of deploying Python App". In other word, this is the "Pythonic Way of > Deployment". > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* > In Python, the packaging standard is now the "wheels" file format, which goes > further that good old ".egg" files. With a wheel file (".whl"), the package > is already prepared for a given architecture. You can have several wheels for > a given package version, each specific to an architecture, or environment. > For example, look at https://pypi.python.org/pypi/numpy all the different > version of Wheel available. > The {{pip}} tools knows how to select the right wheel file matching the > current system, and how to install this package in a light speed (without > compilation). Said otherwise, package that requires compilation of a C > module, for instance "numpy", does *not* compile anything when installing > from wheel file. > {{pypi.pypthon.org}} already provided wheels for major python version. It the > wheel is not available, pip will compile it from source anyway. Mirroring of > Pypi is possible through projects such as http://doc.devpi.net/latest/ > (untested) or the Pypi mirror support on Artifactory (tested personnally). > {{pip}} also provides the ability to generate easily all wheels of all > packages used for a given project which is inside a "virtualenv". This is > called "wheelhouse". You can even don't mess with this compilation and > retrieve it directly from pypi.python.org. > *Use Case 1: no internet connectivity* > Here my first proposal for a deployment workflow, in the case where the Spark > cluster does not have any internet connectivity or access to a Pypi mirror. > In this case the simplest way to deploy a project with several dependencies > is to build and then send to complete "wheelhouse": > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6 # via pylint > autopep8==1.2.4 > click==6.6 # via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31 # via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid, pip-tools, pylint, unittest2 > spark-testing-base==0.0.7.post2 > traceback2==1.4.0 # via unittest2 > unittest2==1.1.0 # via spark-testing-base > wheel==0.29.0 > wrapt==1.10.8 # via astroid > {code} > -- write a setup.py with some entry points or package. Use >
[jira] [Created] (SPARK-22344) Prevent R CMD check from using /tmp
Shivaram Venkataraman created SPARK-22344: - Summary: Prevent R CMD check from using /tmp Key: SPARK-22344 URL: https://issues.apache.org/jira/browse/SPARK-22344 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.2 Reporter: Shivaram Venkataraman When R CMD check is run on the SparkR package it leaves behind files in /tmp which is a violation of CRAN policy. We should instead write to Rtmpdir. Notes from CRAN are below {code} Checking this leaves behind dirs hive/$USER $USER and files named like b4f6459b-0624-4100-8358-7aa7afbda757_resources in /tmp, in violation of the CRAN Policy. {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217310#comment-16217310 ] Shivaram Venkataraman commented on SPARK-15799: --- I created https://issues.apache.org/jira/browse/SPARK-22344 > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng >Assignee: Shivaram Venkataraman > Fix For: 2.1.2 > > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217299#comment-16217299 ] Juan Rodríguez Hortalá commented on SPARK-22148: Hi, I've been working on this issue, and I would like to get your feedback on the following approach. The idea is that instead of failing in `TaskSetManager.abortIfCompletelyBlacklisted`, when a task cannot be scheduled in any executor but dynamic allocation is enabled, we will register this task with `ExecutorAllocationManager`. Then `ExecutorAllocationManager` will request additional executors for these "unscheduleable tasks" by increasing the value returned in `ExecutorAllocationManager.maxNumExecutorsNeeded`. This way we are counting these tasks twice, but this makes sense because the current executors don't have any slot for these tasks, so we actually want to get new executors that are able to run these tasks. To avoid a deadlock due to tasks being unscheduleable forever, we store the timestamp when a task was registered as unscheduleable, and in `ExecutorAllocationManager.schedule` we abort the application if there is some task that has been unscheduleable for a configurable age threshold. This way we give an opportunity to dynamic allocation to get more executors that are able to run the tasks, but we don't make the application wait forever. Attached is a patch with a draft for this approach. Looking forward to your feedback on this. > TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current > executors are blacklisted but dynamic allocation is enabled > - > > Key: SPARK-22148 > URL: https://issues.apache.org/jira/browse/SPARK-22148 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Juan Rodríguez Hortalá > Attachments: SPARK-22148_WIP.diff > > > Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and > the whole Spark job with `task X (partition Y) cannot run anywhere due to > node and executor blacklist. Blacklisting behavior can be configured via > spark.blacklist.*.` when all the available executors are blacklisted for a > pending Task or TaskSet. This makes sense for static allocation, where the > set of executors is fixed for the duration of the application, but this might > lead to unnecessary job failures when dynamic allocation is enabled. For > example, in a Spark application with a single job at a time, when a node > fails at the end of a stage attempt, all other executors will complete their > tasks, but the tasks running in the executors of the failing node will be > pending. Spark will keep waiting for those tasks for 2 minutes by default > (spark.network.timeout) until the heartbeat timeout is triggered, and then it > will blacklist those executors for that stage. At that point in time, other > executors would had been released after being idle for 1 minute by default > (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't > started yet and so there are no more tasks available (assuming the default of > spark.speculation = false). So Spark will fail because the only executors > available are blacklisted for that stage. > An alternative is requesting more executors to the cluster manager in this > situation. This could be retried a configurable number of times after a > configurable wait time between request attempts, so if the cluster manager > fails to provide a suitable executor then the job is aborted like in the > previous case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Rodríguez Hortalá updated SPARK-22148: --- Attachment: SPARK-22148_WIP.diff > TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current > executors are blacklisted but dynamic allocation is enabled > - > > Key: SPARK-22148 > URL: https://issues.apache.org/jira/browse/SPARK-22148 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Juan Rodríguez Hortalá > Attachments: SPARK-22148_WIP.diff > > > Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and > the whole Spark job with `task X (partition Y) cannot run anywhere due to > node and executor blacklist. Blacklisting behavior can be configured via > spark.blacklist.*.` when all the available executors are blacklisted for a > pending Task or TaskSet. This makes sense for static allocation, where the > set of executors is fixed for the duration of the application, but this might > lead to unnecessary job failures when dynamic allocation is enabled. For > example, in a Spark application with a single job at a time, when a node > fails at the end of a stage attempt, all other executors will complete their > tasks, but the tasks running in the executors of the failing node will be > pending. Spark will keep waiting for those tasks for 2 minutes by default > (spark.network.timeout) until the heartbeat timeout is triggered, and then it > will blacklist those executors for that stage. At that point in time, other > executors would had been released after being idle for 1 minute by default > (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't > started yet and so there are no more tasks available (assuming the default of > spark.speculation = false). So Spark will fail because the only executors > available are blacklisted for that stage. > An alternative is requesting more executors to the cluster manager in this > situation. This could be retried a configurable number of times after a > configurable wait time between request attempts, so if the cluster manager > fails to provide a suitable executor then the job is aborted like in the > previous case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217279#comment-16217279 ] Hossein Falaki commented on SPARK-15799: Is there a ticket to follow up on new policy violation issue? The package has been archived by CRAN. > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng >Assignee: Shivaram Venkataraman > Fix For: 2.1.2 > > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217253#comment-16217253 ] Dan Blanchard commented on SPARK-16367: --- You don't actually appear to use the {{wheelhouse}} Python package at all in this proposal. You use {{pip wheel --wheel-dir=/path/to/wheelhouse}} to create a directory with all the wheels for the project, and that just requires {{pip}} and {{wheel}}. {{wheelhouse}} is a separate package for performing operations on those directories, but you do not appear to use it. The whole wheelhouse-as-a-name-for-a-directory-of-wheels thing is just a pun that a lot of people like to use, separate from the {{wheelhouse}} package. > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Semet > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > This ticket proposes to allow users the ability to deploy their job as > "Wheels" packages. The Python community is strongly advocating to promote > this way of packaging and distributing Python application as a "standard way > of deploying Python App". In other word, this is the "Pythonic Way of > Deployment". > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* > In Python, the packaging standard is now the "wheels" file format, which goes > further that good old ".egg" files. With a wheel file (".whl"), the package > is already prepared for a given architecture. You can have several wheels for > a given package version, each specific to an architecture, or environment. > For example, look at https://pypi.python.org/pypi/numpy all the different > version of Wheel available. > The {{pip}} tools knows how to select the right wheel file matching the > current system, and how to install this package in a light speed (without > compilation). Said otherwise, package that requires compilation of a C > module, for instance "numpy", does *not* compile anything when installing > from wheel file. > {{pypi.pypthon.org}} already provided wheels for major python version. It the > wheel is not available, pip will compile it from source anyway. Mirroring of > Pypi is possible through projects such as http://doc.devpi.net/latest/ > (untested) or the Pypi mirror support on Artifactory (tested personnally). > {{pip}} also provides the ability to generate easily all wheels of all > packages used for a given project which is inside a "virtualenv". This is > called "wheelhouse". You can even don't mess with this compilation and > retrieve it directly from pypi.python.org. > *Use Case 1: no internet connectivity* > Here my first proposal for a deployment workflow, in the case where the Spark > cluster does not have any internet connectivity or access to a Pypi mirror. > In this case the simplest way to deploy a project with several dependencies > is to build and then send to complete "wheelhouse": > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6 # via pylint > autopep8==1.2.4 > click==6.6 # via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31 # via pytest >
[jira] [Commented] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
[ https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217226#comment-16217226 ] Marcelo Vanzin commented on SPARK-22341: I was messing with this area recently, so let me take a look... > [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off > -- > > Key: SPARK-22341 > URL: https://issues.apache.org/jira/browse/SPARK-22341 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński > > I'm trying to run 2.3.0 (from master) on my yarn cluster. > The result is: > {code} > Exception in thread "main" org.apache.hadoop.security.AccessControlException: > Permission denied: user=yarn, access=EXECUTE, > inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} > I think the problem exists, because I'm not using yarn impersonation which > mean that all jobs on cluster are runned from user yarn. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217221#comment-16217221 ] Semet edited comment on SPARK-13587 at 10/24/17 4:46 PM: - Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) proposal I made, even without having to modify pyspark at all. I even think you can package a wheelhouse using this {{--archive}} argument. The drawback is indeed your spark-submit has to send this package to each node (1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies description formats, each node would download by itself the dependencies... The strong argument for wheelhouse is that is only packages the libraries used by the project, not the complete environment. The drawback is that it may not work well with anaconda. was (Author: gae...@xeberon.net): Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) proposal I made, even without having to modify pyspark at all. I even think you can package a wheelhouse using this {{--archive}} argument. The drawback is indeed your spark-submit has to send this package to each node (1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies description formats, each node would download by itself the dependencies... > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217221#comment-16217221 ] Semet commented on SPARK-13587: --- Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) proposal I made, even without having to modify pyspark at all. I even think you can package a wheelhouse using this {{--archive}} argument. The drawback is indeed your spark-submit has to send this package to each node (1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies description formats, each node would download by itself the dependencies... > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217201#comment-16217201 ] Andriy Kushnir commented on SPARK-9686: --- [~rxin], I did a little research for this error. To invoke {{run()}} → {{runInternal()}} on any {{org.apache.hive.service.cli.operation.Operation}} (for example, {{GetSchemasOperation}}) we need {{IMetaStoreClient}}. Currently it's taken from {{HiveSession}} instance: {code:java} public class GetSchemasOperation extends MetadataOperation { @Override public void runInternal() throws HiveSQLException { IMetaStoreClient metastoreClient = getParentSession().getMetaStoreClient(); } } {code} All opened {{HiveSession}} s are handled by {{org.apache.hive.service.cli.session.SessionManager}} instance. {{SessionManager}}, among with others, implements {{org.apache.hive.service.Service}} interface, and all {{Service}}s initialized with same Hive configuration: {code:java} public interface Service { void init(HiveConf conf); } {code} When {{org.apache.spark.sql.hive.thriftserver.HiveThriftServer2}} initializes, all {{org.apache.hive.service.CompositeService}} s receive same {{HiveConf}}: {code:java} private[hive] class HiveThriftServer2(sqlContext: SQLContext) extends HiveServer2 with ReflectedCompositeService { override def init(hiveConf: HiveConf) { initCompositeService(hiveConf) } } object HiveThriftServer2 extends Logging { @DeveloperApi def startWithContext(sqlContext: SQLContext): Unit = { val server = new HiveThriftServer2(sqlContext) val executionHive = HiveUtils.newClientForExecution( sqlContext.sparkContext.conf, sqlContext.sessionState.newHadoopConf()) server.init(executionHive.conf) } } {code} So, {{HiveUtils#newClientForExecution()}} returns implementation of {{IMetaStoreClient}} which *ALWAYS* points to derby metastore (see dosctrings and comments in {{org.apache.spark.sql.hive.HiveUtils#newTemporaryConfiguration()}}) IMHO, to get correct metadata we need to additionally create another {{IMetaStoreClient}} with {{newClientForMetadata()}}, and pass it's {{HiveConf}} to underlying {{Service}} s. > Spark Thrift server doesn't return correct JDBC metadata > - > > Key: SPARK-9686 > URL: https://issues.apache.org/jira/browse/SPARK-9686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2 >Reporter: pin_zhang >Priority: Critical > Attachments: SPARK-9686.1.patch.txt > > > 1. Start start-thriftserver.sh > 2. connect with beeline > 3. create table > 4.show tables, the new created table returned > 5. > Class.forName("org.apache.hive.jdbc.HiveDriver"); > String URL = "jdbc:hive2://localhost:1/default"; >Properties info = new Properties(); > Connection conn = DriverManager.getConnection(URL, info); > ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(), >null, null, null); > Problem: >No tables with returned this API, that work in spark1.3 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22343) Add support for publishing Spark metrics into Prometheus
Janos Matyas created SPARK-22343: Summary: Add support for publishing Spark metrics into Prometheus Key: SPARK-22343 URL: https://issues.apache.org/jira/browse/SPARK-22343 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.2.0 Reporter: Janos Matyas I've created a PR for supporting Prometheus a metric sink for Spark - in the https://github.com/apache-spark-on-k8s/spark fork. Based on the maintainers of the project I should create an issue here as well, in order to be tracked upstream as well. See below the original text of the PR: _ Publishing Spark metrics into Prometheus - as discussed earlier in #384. Implemented a metrics sink that publishes Spark metrics into Prometheus via [Prometheus Pushgateway](https://prometheus.io/docs/instrumenting/pushing/). Metrics data published by Spark is based on [Dropwizard](http://metrics.dropwizard.io/). The format of Spark metrics is not supported natively by Prometheus thus these are converted using [DropwizardExports](https://prometheus.io/client_java/io/prometheus/client/dropwizard/DropwizardExports.html) prior pushing metrics to the pushgateway. Also the default Prometheus pushgateway client API implementation does not support metrics timestamp thus the client API has been ehanced to enrich metrics data with timestamp. _ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22301) Add rule to Optimizer for In with empty list of values
[ https://issues.apache.org/jira/browse/SPARK-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22301. - Resolution: Fixed Assignee: Marco Gaido Fix Version/s: 2.3.0 > Add rule to Optimizer for In with empty list of values > -- > > Key: SPARK-22301 > URL: https://issues.apache.org/jira/browse/SPARK-22301 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marco Gaido >Assignee: Marco Gaido > Fix For: 2.3.0 > > > For performance reason, we should resolve in operation on an empty list as > false in the optimizations phase. > For further reference, please look at the discussion on PRs: > https://github.com/apache/spark/pull/19522 and > https://github.com/apache/spark/pull/19494. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217134#comment-16217134 ] Cheburakshu commented on SPARK-22277: - There are 2 problems I faced because of which I quit using ChiSqSelector since it is producing inconsistent results. Also, I am not able to find one example in the documentation that uses selected features for training any model. Already, I opened a bug where the transform output is garbling content. 1. If I use numFeatures=5 and numFeatures=3 and examine the selectedFeatures indices, the three features are not a subset of 5 features. 2. If I use VectorIndexer/StringIndexer+OneHotEncoder before using in ChiSqSelector, the selectedFeatures indices of the model go out of bounds. I did not get it when you are saying that my code has some issue when I plucked it just out of Spark ChiSqSelector example. Are you suggesting that Spark example itself is wrong? https://spark.apache.org/docs/latest/ml-features.html#chisqselector [~cheburakshu] Your post code has some issue: {code} df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) {code} Are these features categorical features ? > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at > org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t > at >
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217115#comment-16217115 ] Nicholas Chammas commented on SPARK-13587: -- To follow-up on my [earlier comment|https://issues.apache.org/jira/browse/SPARK-13587?focusedCommentId=15740419=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15740419], I created a [completely self-contained sample repo|https://github.com/massmutual/sample-pyspark-application] demonstrating a technique for bundling PySpark app dependencies in an isolated way. It's the technique that Ben, I, and several others discussed here in this JIRA issue. https://github.com/massmutual/sample-pyspark-application The approach has advantages (like letting you ship a completely isolated Python environment, so you don't even need Python installed on the workers) and disadvantages (requires YARN; increases job startup time). Hope some of you find the sample repo useful until Spark adds more "first-class" support for Python dependency isolation. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22331) Make MLlib string params case-insensitive
[ https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216224#comment-16216224 ] Weichen Xu edited comment on SPARK-22331 at 10/24/17 2:19 PM: -- OK, it seems not breaking current code in spark, but it is possible to influence user extension code. e.g., some user create his own `evaluator` class and in `metricName` param he use `ParamValidators.inArray(Array("aa", "AA"))` as allowedParams.But `Param` is develop api so it also doesn't matter. OK I agree to make this change, if we do not find it break something. was (Author: weichenxu123): OK, it seems not breaking current code in spark, but it is possible to influence user extension code. e.g., some user create his own `evaluator` class and in `metricName` param he use `ParamValidators.inArray(Array("aa", "AA"))` as allowedParams. > Make MLlib string params case-insensitive > - > > Key: SPARK-22331 > URL: https://issues.apache.org/jira/browse/SPARK-22331 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > Some String params in ML are still case-sensitive, as they are checked by > ParamValidators.inArray. > For consistency in user experience, there should be some general guideline in > whether String params in Spark MLlib are case-insensitive or not. > I'm leaning towards making all String params case-insensitive where possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-22342) refactor schedulerDriver registration
[ https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22342: Comment: was deleted (was: @Arthur Rand fyi) > refactor schedulerDriver registration > - > > Key: SPARK-22342 > URL: https://issues.apache.org/jira/browse/SPARK-22342 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Stavros Kontopoulos > > This is an umbrella issue for working on: > https://github.com/apache/spark/pull/13143 > and handle the multiple re-registration issue which invalidates an offer. > To test: > dcos spark run --verbose --name=spark-nohive --submit-args="--driver-cores > 1 --conf spark.cores.max=1 --driver-memory 512M --class > org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar; > master log: > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3085 hierarchical.cpp:303] Added framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3085 hierarchical.cpp:412] Deactivated framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3090 hierarchical.cpp:380] Activated framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating
[jira] [Comment Edited] (SPARK-22342) refactor schedulerDriver registration
[ https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216959#comment-16216959 ] Stavros Kontopoulos edited comment on SPARK-22342 at 10/24/17 2:08 PM: --- @Arthur Rand fyi was (Author: skonto): @Arthur Rand > refactor schedulerDriver registration > - > > Key: SPARK-22342 > URL: https://issues.apache.org/jira/browse/SPARK-22342 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Stavros Kontopoulos > > This is an umbrella issue for working on: > https://github.com/apache/spark/pull/13143 > and handle the multiple re-registration issue which invalidates an offer. > To test: > dcos spark run --verbose --name=spark-nohive --submit-args="--driver-cores > 1 --conf spark.cores.max=1 --driver-memory 512M --class > org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar; > master log: > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3085 hierarchical.cpp:303] Added framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3085 hierarchical.cpp:412] Deactivated framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3090 hierarchical.cpp:380] Activated framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with
[jira] [Commented] (SPARK-22342) refactor schedulerDriver registration
[ https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216959#comment-16216959 ] Stavros Kontopoulos commented on SPARK-22342: - @Arthur Rand > refactor schedulerDriver registration > - > > Key: SPARK-22342 > URL: https://issues.apache.org/jira/browse/SPARK-22342 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Stavros Kontopoulos > > This is an umbrella issue for working on: > https://github.com/apache/spark/pull/13143 > and handle the multiple re-registration issue which invalidates an offer. > To test: > dcos spark run --verbose --name=spark-nohive --submit-args="--driver-cores > 1 --conf spark.cores.max=1 --driver-memory 512M --class > org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar; > master log: > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3085 hierarchical.cpp:303] Added framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3085 hierarchical.cpp:412] Deactivated framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3090 hierarchical.cpp:380] Activated framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618]
[jira] [Updated] (SPARK-22342) refactor schedulerDriver registration
[ https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22342: Description: This is an umbrella issue for working on: https://github.com/apache/spark/pull/13143 and handle the multiple re-registration issue which invalidates an offer. To test: dcos spark run --verbose --name=spark-nohive --submit-args="--driver-cores 1 --conf spark.cores.max=1 --driver-memory 512M --class org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar; master log: I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3085 hierarchical.cpp:303] Added framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3085 hierarchical.cpp:412] Deactivated framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3090 hierarchical.cpp:380] Activated framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing disabled and capabilities [ ] I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing disabled and capabilities [ ] I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing disabled and capabilities [ ] I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039 I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038 I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037 I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036 I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035 I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034 I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing disabled and capabilities [ ] I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing disabled and capabilities [ ] I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework
[jira] [Created] (SPARK-22342) refactor schedulerDriver registration
Stavros Kontopoulos created SPARK-22342: --- Summary: refactor schedulerDriver registration Key: SPARK-22342 URL: https://issues.apache.org/jira/browse/SPARK-22342 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 2.2.0 Reporter: Stavros Kontopoulos This is an umbrella issue for working on: https://github.com/apache/spark/pull/13143 and handle the multiple re-registration issue which invalidates an offer. ``` I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3085 hierarchical.cpp:303] Added framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3085 hierarchical.cpp:412] Deactivated framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3090 hierarchical.cpp:380] Activated framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing disabled and capabilities [ ] I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing disabled and capabilities [ ] I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing disabled and capabilities [ ] I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039 I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038 I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037 I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036 I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035 I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034 I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing disabled and capabilities [ ] I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020 13:49:05.00 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing disabled and capabilities [ ] I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 I1020
[jira] [Commented] (SPARK-21043) Add unionByName API to Dataset
[ https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216953#comment-16216953 ] Carlos Bribiescas commented on SPARK-21043: --- I really like this feature. Is there a motivation not to replace union with this functionality, other than backwards compatibility? > Add unionByName API to Dataset > -- > > Key: SPARK-21043 > URL: https://issues.apache.org/jira/browse/SPARK-21043 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Takeshi Yamamuro > Fix For: 2.3.0 > > > It would be useful to add unionByName which resolves columns by name, in > addition to the existing union (which resolves by position). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216950#comment-16216950 ] Carlos Bribiescas commented on SPARK-22335: --- I think if unionByName replaced union then it would be a solution. Its definitely a workaround... But as the api stands it feels like a bug since Dataset is supposed to be typed. Again, I suspect it has to do with the optimizer pushing the typing to a later step, after the union by column order happens. If this is the root cause of the bug I worry how else its being manifested, that is, what other bugs it may cause. I'll have to think about it a bit more. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22111) OnlineLDAOptimizer should filter out empty documents beforehand
[ https://issues.apache.org/jira/browse/SPARK-22111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22111: Assignee: (was: Apache Spark) > OnlineLDAOptimizer should filter out empty documents beforehand > > > Key: SPARK-22111 > URL: https://issues.apache.org/jira/browse/SPARK-22111 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Minor > > OnlineLDAOptimizer should filter out empty documents beforehand in order to > make corpusSize, batchSize, and nonEmptyDocsN all refer to the same filtered > corpus with all non-empty docs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22111) OnlineLDAOptimizer should filter out empty documents beforehand
[ https://issues.apache.org/jira/browse/SPARK-22111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22111: Assignee: Apache Spark > OnlineLDAOptimizer should filter out empty documents beforehand > > > Key: SPARK-22111 > URL: https://issues.apache.org/jira/browse/SPARK-22111 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Minor > > OnlineLDAOptimizer should filter out empty documents beforehand in order to > make corpusSize, batchSize, and nonEmptyDocsN all refer to the same filtered > corpus with all non-empty docs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22111) OnlineLDAOptimizer should filter out empty documents beforehand
[ https://issues.apache.org/jira/browse/SPARK-22111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216948#comment-16216948 ] Apache Spark commented on SPARK-22111: -- User 'akopich' has created a pull request for this issue: https://github.com/apache/spark/pull/19565 > OnlineLDAOptimizer should filter out empty documents beforehand > > > Key: SPARK-22111 > URL: https://issues.apache.org/jira/browse/SPARK-22111 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Minor > > OnlineLDAOptimizer should filter out empty documents beforehand in order to > make corpusSize, batchSize, and nonEmptyDocsN all refer to the same filtered > corpus with all non-empty docs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22118) Should prevent change epoch in success stage while there is some running stage
[ https://issues.apache.org/jira/browse/SPARK-22118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216908#comment-16216908 ] Maciej Bryński commented on SPARK-22118: I think this problem is resolved by: SPARK-20715 > Should prevent change epoch in success stage while there is some running stage > -- > > Key: SPARK-22118 > URL: https://issues.apache.org/jira/browse/SPARK-22118 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: SuYan >Priority: Critical > > In 2.1, will change epoch if stage success, and will trigger mapoutTracker to > clean cache broadcast > but when there have other running stage, and want to get mapstatus broadcast > value...it will has the possibility that occurs failed to get broadcast? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
[ https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216877#comment-16216877 ] Maciej Bryński edited comment on SPARK-22341 at 10/24/17 1:29 PM: -- Because Spark 2.2.0 is working perfectly fine with such a configuration. The problem is that Spark 2.3.0 is using wrong user on AM when trying to read files from hdfs. was (Author: maver1ck): Because Spark 2.2.0 is working perfectly fine with such a configuration. The problem is that Spark 2.3.0 is using wrong user on AM. > [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off > -- > > Key: SPARK-22341 > URL: https://issues.apache.org/jira/browse/SPARK-22341 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński > > I'm trying to run 2.3.0 (from master) on my yarn cluster. > The result is: > {code} > Exception in thread "main" org.apache.hadoop.security.AccessControlException: > Permission denied: user=yarn, access=EXECUTE, > inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} > I think the problem exists, because I'm not using yarn impersonation which > mean that all jobs on cluster are runned from user yarn. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leif Walsh updated SPARK-22340: --- Description: With pyspark, {{sc.setJobGroup}}'s documentation says {quote} Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared. {quote} However, this doesn't appear to be associated with Python threads, only with Java threads. As such, a Python thread which calls this and then submits multiple jobs doesn't necessarily get its jobs associated with any particular spark job group. For example: {code} def run_jobs(): sc.setJobGroup('hello', 'hello jobs') x = sc.range(100).sum() y = sc.range(1000).sum() return x, y import concurrent.futures with concurrent.futures.ThreadPoolExecutor() as executor: future = executor.submit(run_jobs) sc.cancelJobGroup('hello') future.result() {code} In this example, depending how the action calls on the Python side are allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be assigned the job group {{hello}}. First, we should clarify the docs if this truly is the case. Second, it would be really helpful if we could make the job group assignment reliable for a Python thread, though I’m not sure the best way to do this. As it stands, job groups are pretty useless from the pyspark side, if we can't rely on this fact. My only idea so far is to mimic the TLS behavior on the Python side and then patch every point where job submission may take place to pass that in, but this feels pretty brittle. In my experience with py4j, controlling threading there is a challenge. was: With pyspark, {{sc.setJobGroup}}'s documentation says {quote} Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared. {quote} However, this doesn't appear to be associated with Python threads, only with Java threads. As such, a Python thread which calls this and then submits multiple jobs doesn't necessarily get its jobs associated with any particular spark job group. For example: {code} def run_jobs(): sc.setJobGroup('hello', 'hello jobs') x = sc.range(100).sum() y = sc.range(1000).sum() return x, y import concurrent.futures with concurrent.futures.ThreadPoolExecutor() as executor: future = executor.submit(run_jobs) sc.cancelJobGroup('hello') future.result() {code} In this example, depending how the action calls on the Python side are allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be assigned the job group {{hello}}. First, we should clarify the docs if this truly is the case. Second, it would be really helpful if this could be made the case. As it stands, job groups are pretty useless from the pyspark side, if we can't rely on this fact. > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Walsh > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
[ https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216877#comment-16216877 ] Maciej Bryński edited comment on SPARK-22341 at 10/24/17 1:20 PM: -- Because Spark 2.2.0 is working perfectly fine with such a configuration. The problem is that Spark 2.3.0 is using wrong user on AM. was (Author: maver1ck): Because Spark 2.2.0 is working perfectly fine with such a configuration. > [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off > -- > > Key: SPARK-22341 > URL: https://issues.apache.org/jira/browse/SPARK-22341 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński > > I'm trying to run 2.3.0 (from master) on my yarn cluster. > The result is: > {code} > Exception in thread "main" org.apache.hadoop.security.AccessControlException: > Permission denied: user=yarn, access=EXECUTE, > inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} > I think the problem exists, because I'm not using yarn impersonation which > mean that all jobs on cluster are runned from user yarn. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
[ https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216877#comment-16216877 ] Maciej Bryński commented on SPARK-22341: Because Spark 2.2.0 is working perfectly fine with such a configuration. > [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off > -- > > Key: SPARK-22341 > URL: https://issues.apache.org/jira/browse/SPARK-22341 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński > > I'm trying to run 2.3.0 (from master) on my yarn cluster. > The result is: > {code} > Exception in thread "main" org.apache.hadoop.security.AccessControlException: > Permission denied: user=yarn, access=EXECUTE, > inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} > I think the problem exists, because I'm not using yarn impersonation which > mean that all jobs on cluster are runned from user yarn. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
[ https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22341: -- Priority: Major (was: Blocker) > [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off > -- > > Key: SPARK-22341 > URL: https://issues.apache.org/jira/browse/SPARK-22341 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński > > I'm trying to run 2.3.0 (from master) on my yarn cluster. > The result is: > {code} > Exception in thread "main" org.apache.hadoop.security.AccessControlException: > Permission denied: user=yarn, access=EXECUTE, > inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} > I think the problem exists, because I'm not using yarn impersonation which > mean that all jobs on cluster are runned from user yarn. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
[ https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216868#comment-16216868 ] Sean Owen commented on SPARK-22341: --- Yes, but why is that a Spark problem? > [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off > -- > > Key: SPARK-22341 > URL: https://issues.apache.org/jira/browse/SPARK-22341 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński >Priority: Blocker > > I'm trying to run 2.3.0 (from master) on my yarn cluster. > The result is: > {code} > Exception in thread "main" org.apache.hadoop.security.AccessControlException: > Permission denied: user=yarn, access=EXECUTE, > inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} > I think the problem exists, because I'm not using yarn impersonation which > mean that all jobs on cluster are runned from user yarn. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
[ https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216862#comment-16216862 ] Pranav Singhania commented on SPARK-16599: -- I've been temporarily able to avoid the error using a local SparkSession declared in the function rather than declaring it globally accessible to all the functions in my object/class. If required in other functions, passed it as a parameter to the particular function. Hope it helps to resolve the bug. Cheers! > java.util.NoSuchElementException: None.get at at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > -- > > Key: SPARK-16599 > URL: https://issues.apache.org/jira/browse/SPARK-16599 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: centos 6.7 spark 2.0 >Reporter: binde > > run a spark job with spark 2.0, error message > Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
Maciej Bryński created SPARK-22341: -- Summary: [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off Key: SPARK-22341 URL: https://issues.apache.org/jira/browse/SPARK-22341 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 2.3.0 Reporter: Maciej Bryński Priority: Blocker I'm trying to run 2.3.0 (from master) on my yarn cluster. The result is: {code} Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=EXECUTE, inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx-- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216) at scala.Option.foreach(Option.scala:257) at org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821) at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842) at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) {code} I think the problem exists, because I'm not using yarn impersonation which mean that all jobs on cluster are runned from user yarn. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216623#comment-16216623 ] Weichen Xu edited comment on SPARK-22277 at 10/24/17 12:45 PM: --- [~cheburakshu] Your post code has some issue: {code} df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) {code} Are these features categorical features ? If yes, then you cannot directly training them by DecisionTreeClassifier, because they lack of metadata(Binary / Nominal) and will be regarded as continuous features. You need use `VectorIndexer` to process the categorical features first. If no, then you cannot use ChiSqSelector to select topFeatures, because ChiSqSelector is only available for categorical features. But your above code compare the two cases, they're contrary, so I am confused. If these features are categorical, you can process them first by VectorIndexer, it will fill metadata for the features first, and then process by ChiSqSelector/DecisionTreeClassifier, then I think no error will occur. was (Author: weichenxu123): [~cheburakshu] Your post code has some issue: {code} df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) {code} Are these features categorical features ? If yes, then you cannot directly training them by DecisionTreeClassifier, because they lack of metadata(Binary / Nominal) and will be regarded as continuous features. If no, then you cannot use ChiSqSelector to select topFeatures, because ChiSqSelector is only available for categorical features. But your above code compare the two cases, they're contrary, so I am confused. If these features are categorical, you can process them first by VectorIndexer, it will fill metadata for the features first, and then process by ChiSqSelector/DecisionTreeClassifier, then I think no error will occur. > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at >
[jira] [Updated] (SPARK-22118) Should prevent change epoch in success stage while there is some running stage
[ https://issues.apache.org/jira/browse/SPARK-22118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-22118: --- Priority: Critical (was: Major) > Should prevent change epoch in success stage while there is some running stage > -- > > Key: SPARK-22118 > URL: https://issues.apache.org/jira/browse/SPARK-22118 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: SuYan >Priority: Critical > > In 2.1, will change epoch if stage success, and will trigger mapoutTracker to > clean cache broadcast > but when there have other running stage, and want to get mapstatus broadcast > value...it will has the possibility that occurs failed to get broadcast? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22118) Should prevent change epoch in success stage while there is some running stage
[ https://issues.apache.org/jira/browse/SPARK-22118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-22118: --- Affects Version/s: 2.2.0 > Should prevent change epoch in success stage while there is some running stage > -- > > Key: SPARK-22118 > URL: https://issues.apache.org/jira/browse/SPARK-22118 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: SuYan >Priority: Critical > > In 2.1, will change epoch if stage success, and will trigger mapoutTracker to > clean cache broadcast > but when there have other running stage, and want to get mapstatus broadcast > value...it will has the possibility that occurs failed to get broadcast? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22118) Should prevent change epoch in success stage while there is some running stage
[ https://issues.apache.org/jira/browse/SPARK-22118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216768#comment-16216768 ] Maciej Bryński commented on SPARK-22118: I think I have such a problem {code} 2017-10-24 11:23:47,482 DEBUG [org.apache.spark.scheduler.DAGScheduler] - submitStage(ShuffleMapStage 52) 2017-10-24 11:23:47,484 DEBUG [org.apache.hadoop.hdfs.DFSClient] - DFSClient seqno: 2164 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 1761269 2017-10-24 11:23:47,484 DEBUG [org.apache.hadoop.hdfs.DFSClient] - DFSClient seqno: 2165 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 1631471 2017-10-24 11:23:47,485 DEBUG [org.apache.hadoop.hdfs.DFSClient] - DFSClient seqno: 2166 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 1608203 2017-10-24 11:23:47,544 INFO [org.apache.spark.storage.memory.MemoryStore] - Block broadcast_69 stored as values in memory (estimated size 514.2 KB, free 897.0 MB) 2017-10-24 11:23:47,544 DEBUG [org.apache.spark.storage.BlockManager] - Put block broadcast_69 locally took 0 ms 2017-10-24 11:23:47,544 DEBUG [org.apache.spark.storage.BlockManager] - Putting block broadcast_69 without replication took 0 ms 2017-10-24 11:23:47,549 INFO [org.apache.spark.storage.memory.MemoryStore] - Block broadcast_69_piece0 stored as bytes in memory (estimated size 514.6 KB, free 896.5 MB) 2017-10-24 11:23:47,550 INFO [org.apache.spark.storage.BlockManagerInfo] - Added broadcast_69_piece0 in memory on 10.32.32.227:36396 (size: 514.6 KB, free: 910.7 MB) 2017-10-24 11:23:47,550 DEBUG [org.apache.spark.storage.BlockManagerMaster] - Updated info of block broadcast_69_piece0 2017-10-24 11:23:47,550 DEBUG [org.apache.spark.storage.BlockManager] - Told master about block broadcast_69_piece0 2017-10-24 11:23:47,550 DEBUG [org.apache.spark.storage.BlockManager] - Put block broadcast_69_piece0 locally took 1 ms 2017-10-24 11:23:47,550 DEBUG [org.apache.spark.storage.BlockManager] - Putting block broadcast_69_piece0 without replication took 1 ms 2017-10-24 11:23:47,550 INFO [org.apache.spark.MapOutputTracker] - Broadcast mapstatuses size = 430, actual size = 526530 2017-10-24 11:23:47,550 INFO [org.apache.spark.MapOutputTrackerMaster] - Size of output statuses for shuffle 14 is 430 bytes 2017-10-24 11:23:47,550 INFO [org.apache.spark.MapOutputTrackerMaster] - Epoch changed, not caching! 2017-10-24 11:23:47,550 DEBUG [org.apache.spark.broadcast.TorrentBroadcast] - Unpersisting TorrentBroadcast 69 2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.BlockManagerSlaveEndpoint] - removing broadcast 69 2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.BlockManager] - Removing broadcast 69 2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.BlockManager] - Removing block broadcast_69 2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.memory.MemoryStore] - Block broadcast_69 of size 526576 dropped from memory (free 940623361) 2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.BlockManager] - Removing block broadcast_69_piece0 2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.memory.MemoryStore] - Block broadcast_69_piece0 of size 526913 dropped from memory (free 941150274) 2017-10-24 11:23:47,551 DEBUG [org.apache.spark.MapOutputTrackerMaster] - cached status not found for : 14 2017-10-24 11:23:47,552 INFO [org.apache.spark.storage.BlockManagerInfo] - Removed broadcast_69_piece0 on 10.32.32.227:36396 in memory (size: 514.6 KB, free: 911.2 MB) 2017-10-24 11:23:47,554 DEBUG [org.apache.spark.storage.BlockManagerMaster] - Updated info of block broadcast_69_piece0 2017-10-24 11:23:47,554 DEBUG [org.apache.spark.storage.BlockManager] - Told master about block broadcast_69_piece0 2017-10-24 11:23:47,555 DEBUG [org.apache.spark.storage.BlockManagerSlaveEndpoint] - Done removing broadcast 69, response is 0 2017-10-24 11:23:47,555 DEBUG [org.apache.spark.storage.BlockManagerSlaveEndpoint] - Sent response: 0 to 10.32.32.227:53171 2017-10-24 11:23:47,556 INFO [org.apache.spark.MapOutputTrackerMasterEndpoint] - Asked to send map output locations for shuffle 14 to 10.32.32.28:39424 2017-10-24 11:23:47,556 DEBUG [org.apache.spark.MapOutputTrackerMaster] - Handling request to send map output locations for shuffle 14 to 10.32.32.28:39424 2017-10-24 11:23:47,556 DEBUG [org.apache.spark.MapOutputTrackerMaster] - cached status not found for : 14 2017-10-24 11:23:47,607 DEBUG [org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint] - Launching task 31490 on executor id: 2 hostname: dwh-hn28.adpilot.co. 2017-10-24 11:23:47,608 DEBUG [org.apache.spark.ExecutorAllocationManager] - Clearing idle timer for 2 because it is now running a task 2017-10-24 11:23:47,633 DEBUG [org.apache.hadoop.ipc.Client] - IPC Client (766216029) connection to master/10.32.32.3:8032 from bi sending #1418 2017-10-24
[jira] [Updated] (SPARK-22323) Design doc for different types of pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22323: -- Fix Version/s: (was: 2.3.0) > Design doc for different types of pandas_udf > > > Key: SPARK-22323 > URL: https://issues.apache.org/jira/browse/SPARK-22323 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216683#comment-16216683 ] Apache Spark commented on SPARK-21936: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/19564 > backward compatibility test framework for HiveExternalCatalog > - > > Key: SPARK-21936 > URL: https://issues.apache.org/jira/browse/SPARK-21936 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.1, 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe
[ https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216634#comment-16216634 ] Andreas Maier commented on SPARK-22249: --- Thank you for solving this issue so quickly. Not every open source project is reacting so fast. > UnsupportedOperationException: empty.reduceLeft when caching a dataframe > > > Key: SPARK-22249 > URL: https://issues.apache.org/jira/browse/SPARK-22249 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: $ uname -a > Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 > 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.2.0 > /_/ > > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92 > Branch > Compiled by user jenkins on 2017-06-30T22:58:04Z > Revision > Url >Reporter: Andreas Maier >Assignee: Marco Gaido > Fix For: 2.2.1, 2.3.0 > > > It seems that the {{isin()}} method with an empty list as argument only > works, if the dataframe is not cached. If it is cached, it results in an > exception. To reproduce > {code:java} > $ pyspark > >>> df = spark.createDataFrame([pyspark.Row(KEY="value")]) > >>> df.where(df["KEY"].isin([])).show() > +---+ > |KEY| > +---+ > +---+ > >>> df.cache() > DataFrame[KEY: string] > >>> df.where(df["KEY"].isin([])).show() > Traceback (most recent call last): > File "", line 1, in > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py", > line 336, in show > print(self._jdf.showString(n, 20)) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString. > : java.lang.UnsupportedOperationException: empty.reduceLeft > at > scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180) > at > scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74) > at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) > at scala.collection.AbstractTraversable.reduce(Traversable.scala:104) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > at >
[jira] [Comment Edited] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216623#comment-16216623 ] Weichen Xu edited comment on SPARK-22277 at 10/24/17 9:58 AM: -- [~cheburakshu] Your post code has some issue: {code} df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) {code} Are these features categorical features ? If yes, then you cannot directly training them by DecisionTreeClassifier, because they lack of metadata(Binary / Nominal) and will be regarded as continuous features. If no, then you cannot use ChiSqSelector to select topFeatures, because ChiSqSelector is only available for categorical features. But your above code compare the two cases, they're contrary, so I am confused. If these features are categorical, you can process them first by VectorIndexer, it will fill metadata for the features first, and then process by ChiSqSelector/DecisionTreeClassifier, then I think no error will occur. was (Author: weichenxu123): [~cheburakshu] Your post code has some issue: {code} df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) {code} Are these features categorical features ? If yes, then you cannot directly training them by DecisionTreeClassifier, because they lack of metadata(Binary / Nominal) and will be regarded as continuous features. If no, then you cannot use ChiSqSelector to select topFeatures, because ChiSqSelector is only available for categorical features. But your above code compare the two cases, they're contrary, so I am confused. > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at >
[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216623#comment-16216623 ] Weichen Xu commented on SPARK-22277: [~cheburakshu] Your post code has some issue: {code} df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) {code} Are these features categorical features ? If yes, then you cannot directly training them by DecisionTreeClassifier, because they lack of metadata(Binary / Nominal) and will be regarded as continuous features. If no, then you cannot use ChiSqSelector to select topFeatures, because ChiSqSelector is only available for categorical features. But your above code compare the two cases, they're contrary, so I am confused. > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at > org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t > at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at > org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at > sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at >
[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216611#comment-16216611 ] Apache Spark commented on SPARK-22284: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/19563 > Code of class > \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" > grows beyond 64 KB > -- > > Key: SPARK-22284 > URL: https://issues.apache.org/jira/browse/SPARK-22284 > Project: Spark > Issue Type: Bug > Components: Optimizer, PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Ben > Attachments: 64KB Error.log > > > I am using pySpark 2.1.0 in a production environment, and trying to join two > DataFrames, one of which is very large and has complex nested structures. > Basically, I load both DataFrames and cache them. > Then, in the large DataFrame, I extract 3 nested values and save them as > direct columns. > Finally, I join on these three columns with the smaller DataFrame. > This would be a short code for this: > {code} > dataFrame.read..cache() > dataFrameSmall.read...cache() > dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS > Value1','nested.Value2 AS Value2','nested.Value3 AS Value3']) > dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, > ['Value1','Value2',Value3']) > dataFrame.count() > {code} > And this is the error I get when it gets to the count(): > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in > stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 > (TID 11234, somehost.com, executor 10): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\" > of class > \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" > grows beyond 64 KB > {code} > I have seen many tickets with similar issues here, but no proper solution. > Most of the fixes are until Spark 2.1.0 so I don't know if running it on > Spark 2.2.0 would fix it. In any case I cannot change the version of Spark > since it is in production. > I have also tried setting > {code:java} > spark.sql.codegen.wholeStage=false > {code} > but still the same error. > The job worked well up to now, also with large datasets, but apparently this > batch got larger, and that is the only thing that changed. Is there any > workaround for this? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org