[jira] [Assigned] (SPARK-18726) Filesystem unnecessarily scanned twice during creation of non-catalog table
[ https://issues.apache.org/jira/browse/SPARK-18726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-18726: --- Assignee: Song Jun > Filesystem unnecessarily scanned twice during creation of non-catalog table > --- > > Key: SPARK-18726 > URL: https://issues.apache.org/jira/browse/SPARK-18726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Song Jun > Fix For: 2.2.0 > > > It seems that for non-catalog tables (e.g. spark.read.parquet(...)), we scan > the filesystem twice, once for schema inference, and another to create a > FileIndex class for the relation. > It would be better to combine these scans somehow, since this is the most > costly step of creating a table. This is a follow-up ticket to > https://github.com/apache/spark/pull/16090. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18726) Filesystem unnecessarily scanned twice during creation of non-catalog table
[ https://issues.apache.org/jira/browse/SPARK-18726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18726. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17081 [https://github.com/apache/spark/pull/17081] > Filesystem unnecessarily scanned twice during creation of non-catalog table > --- > > Key: SPARK-18726 > URL: https://issues.apache.org/jira/browse/SPARK-18726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang > Fix For: 2.2.0 > > > It seems that for non-catalog tables (e.g. spark.read.parquet(...)), we scan > the filesystem twice, once for schema inference, and another to create a > FileIndex class for the relation. > It would be better to combine these scans somehow, since this is the most > costly step of creating a table. This is a follow-up ticket to > https://github.com/apache/spark/pull/16090. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors
[ https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893877#comment-15893877 ] Paul Lysak commented on SPARK-19371: I'm observing similar behavior in Spark 2.1 - unfortunately, due to complex workflow of our application wasn't yet able to identify after which operation exactly all the partitions of DataFrame end up on a single executor, so no matter how big cluster is - only one executor picks all the job. > Cannot spread cached partitions evenly across executors > --- > > Key: SPARK-19371 > URL: https://issues.apache.org/jira/browse/SPARK-19371 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Thunder Stumpges > > Before running an intensive iterative job (in this case a distributed topic > model training), we need to load a dataset and persist it across executors. > After loading from HDFS and persisting, the partitions are spread unevenly > across executors (based on the initial scheduling of the reads which are not > data locale sensitive). The partition sizes are even, just not their > distribution over executors. We currently have no way to force the partitions > to spread evenly, and as the iterative algorithm begins, tasks are > distributed to executors based on this initial load, forcing some very > unbalanced work. > This has been mentioned a > [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059] > of > [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html] > in > [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html] > user/dev group threads. > None of the discussions I could find had solutions that worked for me. Here > are examples of things I have tried. All resulted in partitions in memory > that were NOT evenly distributed to executors, causing future tasks to be > imbalanced across executors as well. > *Reduce Locality* > {code}spark.shuffle.reduceLocality.enabled=false/true{code} > *"Legacy" memory mode* > {code}spark.memory.useLegacyMode = true/false{code} > *Basic load and repartition* > {code} > val numPartitions = 48*16 > val df = sqlContext.read. > parquet("/data/folder_to_load"). > repartition(numPartitions). > persist > df.count > {code} > *Load and repartition to 2x partitions, then shuffle repartition down to > desired partitions* > {code} > val numPartitions = 48*16 > val df2 = sqlContext.read. > parquet("/data/folder_to_load"). > repartition(numPartitions*2) > val df = df2.repartition(numPartitions). > persist > df.count > {code} > It would be great if when persisting an RDD/DataFrame, if we could request > that those partitions be stored evenly across executors in preparation for > future tasks. > I'm not sure if this is a more general issue (I.E. not just involving > persisting RDDs), but for the persisted in-memory case, it can make a HUGE > difference in the over-all running time of the remaining work. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19339) StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next on empty iterator
[ https://issues.apache.org/jira/browse/SPARK-19339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893858#comment-15893858 ] Nick Pentreath commented on SPARK-19339: This should be addressed by SPARK-19573 - empty (or all null) columns will return empty Array rather than throw exception. > StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next > on empty iterator > - > > Key: SPARK-19339 > URL: https://issues.apache.org/jira/browse/SPARK-19339 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.2, 2.1.0 >Reporter: Barry Becker >Priority: Minor > > This problem is easy to reproduce by running > StatFunctions.multipleApproxQuantiles on an empty dataset, but I think it can > occur in other cases, like if the column is all null or all one value. > I have unit tests that can hit it in several different cases. > The fix that I have introduced locally is to return > {code} > if (sampled.length == 0) 0 else sampled.last.value > {code} > instead of > {code} > sampled.last.value > {code} > at the end of QuantileSummaries.query. > Below is the exception: > {code} > next on empty iterator > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.last(TraversableLike.scala:459) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132) > at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207) > at > org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(SparkPercentileCalculator.scala:91) > at > org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(SparkPercentileCalculator.scala:91) > at > org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(SparkPercentileCalculator.scala:91) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1.apply(SparkPercentileCalculator.scala:91) > at > org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1.apply(SparkPercentileCalculator.scala:91) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.SparkPercentileCalculator.multipleApproxQuantiles(SparkPercentileCalculator.scala:91) > at > com.mineset.spark.statistics.model.ContinuousMinesetStats.quartiles$lzycompute(ContinuousMinesetStats.scala:274) > at > com.mineset.spark.statistics.model.ContinuousMinesetStats.quartiles(ContinuousMinesetStats.scala:272) > at > com.mineset.spark.statistics.model.MinesetStats.com$mineset$spark$statistics$model$MinesetStats$$serializeContinuousFeature$1(MinesetStats.scala:66) > at > com.mineset.spark.statistics.model.MinesetStats$$anonfun$calculateWithColumns$1.apply(MinesetStats.scala:118) > at > com.mineset.spark.statistics.model.MinesetStats$$anonfun$calculateWithColumns$1.apply(MinesetStats.scala:114) >
[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893821#comment-15893821 ] Nick Pentreath commented on SPARK-19714: If you feel that handling values outside the bucket ranges as "invalid" is reasonable - specifically including them in the special "invalid" bucket - then we can discuss if and how that could be implemented. I agree it's quite a large departure, but we could support it with a further param value such as "keepAll" which keeps both {{NaN}} and values outside of range in the special bucket. I don't see a compelling reason that this is a bug, so if you want to motivate for a change then propose an approach. I do think we should update the doc for {{handleInvalid}} - [~wojtek-szymanski] feel free to open a PR for that. > Bucketizer Bug Regarding Handling Unbucketed Inputs > --- > > Key: SPARK-19714 > URL: https://issues.apache.org/jira/browse/SPARK-19714 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Bill Chambers > > {code} > contDF = spark.range(500).selectExpr("cast(id as double) as id") > import org.apache.spark.ml.feature.Bucketizer > val splits = Array(5.0, 10.0, 250.0, 500.0) > val bucketer = new Bucketizer() > .setSplits(splits) > .setInputCol("id") > .setHandleInvalid("skip") > bucketer.transform(contDF).show() > {code} > You would expect that this would handle the invalid buckets. However it fails > {code} > Caused by: org.apache.spark.SparkException: Feature value 0.0 out of > Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the > lower/upper bound constraints. > {code} > It seems strange that handleInvalud doesn't actually handleInvalid inputs. > Thoughts anyone? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators
[ https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893811#comment-15893811 ] Nick Pentreath commented on SPARK-19747: Also agree we should be able to extract out the penalty / regularization term. I know Seth's done that in the WIP PR for L2 - but L1 is interesting because currently it is more tightly baked into the Breeze optimizers... > Consolidate code in ML aggregators > -- > > Key: SPARK-19747 > URL: https://issues.apache.org/jira/browse/SPARK-19747 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Seth Hendrickson >Priority: Minor > > Many algorithms in Spark ML are posed as optimization of a differentiable > loss function over a parameter vector. We implement these by having a loss > function accumulate the gradient using an Aggregator class which has methods > that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm > that obeys this form implements a cost function class and an aggregator > class, which are completely separate from one another but share probably 80% > of the same code. > I think it is important to clean things like this up, and if we can do it > properly it will make the code much more maintainable, readable, and bug > free. It will also help reduce the overhead of future implementations. > The design is of course open for discussion, but I think we should aim to: > 1. Have all aggregators share parent classes, so that they only need to > implement the {{add}} function. This is really the only difference in the > current aggregators. > 2. Have a single, generic cost function that is parameterized by the > aggregator type. This reduces the many places we implement cost functions and > greatly reduces the amount of duplicated code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators
[ https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893810#comment-15893810 ] Nick Pentreath commented on SPARK-19747: [~yuhaoyan] for {{SGDClassifier}} it would be interesting to look at Vowpal Wabbit - style normalized adaptive gradient descent approach which does the normalization / standardization on the fly during training. > Consolidate code in ML aggregators > -- > > Key: SPARK-19747 > URL: https://issues.apache.org/jira/browse/SPARK-19747 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Seth Hendrickson >Priority: Minor > > Many algorithms in Spark ML are posed as optimization of a differentiable > loss function over a parameter vector. We implement these by having a loss > function accumulate the gradient using an Aggregator class which has methods > that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm > that obeys this form implements a cost function class and an aggregator > class, which are completely separate from one another but share probably 80% > of the same code. > I think it is important to clean things like this up, and if we can do it > properly it will make the code much more maintainable, readable, and bug > free. It will also help reduce the overhead of future implementations. > The design is of course open for discussion, but I think we should aim to: > 1. Have all aggregators share parent classes, so that they only need to > implement the {{add}} function. This is really the only difference in the > current aggregators. > 2. Have a single, generic cost function that is parameterized by the > aggregator type. This reduces the many places we implement cost functions and > greatly reduces the amount of duplicated code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18478) Support codegen for Hive UDFs
[ https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro closed SPARK-18478. Resolution: Won't Fix > Support codegen for Hive UDFs > - > > Key: SPARK-18478 > URL: https://issues.apache.org/jira/browse/SPARK-18478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently does not codegen Hive UDFs in hiveUDFs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19779) structured streaming exist needless tmp file
[ https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-19779: Assignee: Feng Gui > structured streaming exist needless tmp file > - > > Key: SPARK-19779 > URL: https://issues.apache.org/jira/browse/SPARK-19779 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.3, 2.1.1, 2.2.0 >Reporter: Feng Gui >Assignee: Feng Gui >Priority: Minor > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > The PR (https://github.com/apache/spark/pull/17012) can to fix restart a > Structured Streaming application using hdfs as fileSystem, but also exist a > problem that a tmp file of delta file is still reserved in hdfs. And > Structured Streaming don't delete the tmp file generated when restart > streaming job in future, so we need to delete the tmp file after restart > streaming job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19779) structured streaming exist needless tmp file
[ https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19779. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 2.0.3 > structured streaming exist needless tmp file > - > > Key: SPARK-19779 > URL: https://issues.apache.org/jira/browse/SPARK-19779 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.3, 2.1.1, 2.2.0 >Reporter: Feng Gui >Priority: Minor > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > The PR (https://github.com/apache/spark/pull/17012) can to fix restart a > Structured Streaming application using hdfs as fileSystem, but also exist a > problem that a tmp file of delta file is still reserved in hdfs. And > Structured Streaming don't delete the tmp file generated when restart > streaming job in future, so we need to delete the tmp file after restart > streaming job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19805) Log the row type when query result dose not match
[ https://issues.apache.org/jira/browse/SPARK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Genmao Yu updated SPARK-19805: -- Summary: Log the row type when query result dose not match (was: Log the row type when query result dose match) > Log the row type when query result dose not match > - > > Key: SPARK-19805 > URL: https://issues.apache.org/jira/browse/SPARK-19805 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893736#comment-15893736 ] Shivaram Venkataraman commented on SPARK-19796: --- I think (a) is worth exploring in a new JIRA -- We should try to avoid sending data that we dont need on the executors during task execution. > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Blocker > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19806) PySpark GLR supports tweedie distribution
[ https://issues.apache.org/jira/browse/SPARK-19806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-19806: --- Assignee: Yanbo Liang > PySpark GLR supports tweedie distribution > - > > Key: SPARK-19806 > URL: https://issues.apache.org/jira/browse/SPARK-19806 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > > PySpark {{GeneralizedLinearRegression}} supports tweedie distribution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19806) PySpark GLR supports tweedie distribution
[ https://issues.apache.org/jira/browse/SPARK-19806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19806: Assignee: (was: Apache Spark) > PySpark GLR supports tweedie distribution > - > > Key: SPARK-19806 > URL: https://issues.apache.org/jira/browse/SPARK-19806 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Priority: Minor > > PySpark {{GeneralizedLinearRegression}} supports tweedie distribution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19806) PySpark GLR supports tweedie distribution
[ https://issues.apache.org/jira/browse/SPARK-19806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19806: Assignee: Apache Spark > PySpark GLR supports tweedie distribution > - > > Key: SPARK-19806 > URL: https://issues.apache.org/jira/browse/SPARK-19806 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > PySpark {{GeneralizedLinearRegression}} supports tweedie distribution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19806) PySpark GLR supports tweedie distribution
[ https://issues.apache.org/jira/browse/SPARK-19806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893733#comment-15893733 ] Apache Spark commented on SPARK-19806: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/17146 > PySpark GLR supports tweedie distribution > - > > Key: SPARK-19806 > URL: https://issues.apache.org/jira/browse/SPARK-19806 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Priority: Minor > > PySpark {{GeneralizedLinearRegression}} supports tweedie distribution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19806) PySpark GLR supports tweedie distribution
Yanbo Liang created SPARK-19806: --- Summary: PySpark GLR supports tweedie distribution Key: SPARK-19806 URL: https://issues.apache.org/jira/browse/SPARK-19806 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Yanbo Liang Priority: Minor PySpark {{GeneralizedLinearRegression}} supports tweedie distribution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe
[ https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893713#comment-15893713 ] Hyukjin Kwon commented on SPARK-15474: -- Let me leave some pointer - https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java#L96-L106 but it seems now it does not write the empty one in the master -https://github.com/apache/hive/blob/4a42bec6ba4cb8257dec517bc7c45b6a8f5a9e67/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java#L116 > ORC data source fails to write and read back empty dataframe > - > > Key: SPARK-15474 > URL: https://issues.apache.org/jira/browse/SPARK-15474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently ORC data source fails to write and read empty data. > The code below: > {code} > val emptyDf = spark.range(10).limit(0) > emptyDf.write > .format("orc") > .save(path.getCanonicalPath) > val copyEmptyDf = spark.read > .format("orc") > .load(path.getCanonicalPath) > copyEmptyDf.show() > {code} > throws an exception below: > {code} > Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114) > {code} > Note that this is a different case with the data below > {code} > val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) > {code} > In this case, any writer is not initialised and created. (no calls of > {{WriterContainer.writeRows()}}. > For Parquet and JSON, it works but ORC does not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10294) When Parquet writer's close method throws an exception, we will call close again and trigger a NPE
[ https://issues.apache.org/jira/browse/SPARK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893709#comment-15893709 ] Hyukjin Kwon commented on SPARK-10294: -- Maybe, we could resolve this as a duplicate of SPARK-13127 if it is true because I see SPARK-18140 is also resolved as a duplicate. > When Parquet writer's close method throws an exception, we will call close > again and trigger a NPE > -- > > Key: SPARK-10294 > URL: https://issues.apache.org/jira/browse/SPARK-10294 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai > Attachments: screenshot-1.png > > > When a task saves a large parquet file (larger than the S3 file size limit) > to S3, looks like we still call parquet writer's close twice and triggers NPE > reported in SPARK-7837. Eventually, job failed and I got NPE as the > exception. Actually, the real problem was that the file was too large for S3. > {code} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1818) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1831) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1908) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > com.databricks.spark.sql.perf.tpcds.Tables$Table.genData(Tables.scala:147) > at > com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:192) > at > com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Ta
[jira] [Commented] (SPARK-10294) When Parquet writer's close method throws an exception, we will call close again and trigger a NPE
[ https://issues.apache.org/jira/browse/SPARK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893706#comment-15893706 ] Hyukjin Kwon commented on SPARK-10294: -- Hi [~yhuai], it seems this issue refers PARQUET-544 which is fixed in Parquet 1.9.0. > When Parquet writer's close method throws an exception, we will call close > again and trigger a NPE > -- > > Key: SPARK-10294 > URL: https://issues.apache.org/jira/browse/SPARK-10294 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai > Attachments: screenshot-1.png > > > When a task saves a large parquet file (larger than the S3 file size limit) > to S3, looks like we still call parquet writer's close twice and triggers NPE > reported in SPARK-7837. Eventually, job failed and I got NPE as the > exception. Actually, the real problem was that the file was too large for S3. > {code} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1818) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1831) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1908) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > com.databricks.spark.sql.perf.tpcds.Tables$Table.genData(Tables.scala:147) > at > com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:192) > at > com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:190) > at scala.collection.imm
[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe
[ https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893703#comment-15893703 ] Nicholas Chammas commented on SPARK-15474: -- cc [~owen.omalley] > ORC data source fails to write and read back empty dataframe > - > > Key: SPARK-15474 > URL: https://issues.apache.org/jira/browse/SPARK-15474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently ORC data source fails to write and read empty data. > The code below: > {code} > val emptyDf = spark.range(10).limit(0) > emptyDf.write > .format("orc") > .save(path.getCanonicalPath) > val copyEmptyDf = spark.read > .format("orc") > .load(path.getCanonicalPath) > copyEmptyDf.show() > {code} > throws an exception below: > {code} > Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114) > {code} > Note that this is a different case with the data below > {code} > val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) > {code} > In this case, any writer is not initialised and created. (no calls of > {{WriterContainer.writeRows()}}. > For Parquet and JSON, it works but ORC does not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19805) Log the row type when query result dose match
[ https://issues.apache.org/jira/browse/SPARK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19805: Assignee: Apache Spark > Log the row type when query result dose match > - > > Key: SPARK-19805 > URL: https://issues.apache.org/jira/browse/SPARK-19805 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19805) Log the row type when query result dose match
[ https://issues.apache.org/jira/browse/SPARK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893693#comment-15893693 ] Apache Spark commented on SPARK-19805: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/17145 > Log the row type when query result dose match > - > > Key: SPARK-19805 > URL: https://issues.apache.org/jira/browse/SPARK-19805 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19805) Log the row type when query result dose match
[ https://issues.apache.org/jira/browse/SPARK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19805: Assignee: (was: Apache Spark) > Log the row type when query result dose match > - > > Key: SPARK-19805 > URL: https://issues.apache.org/jira/browse/SPARK-19805 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19805) Log the row type when query result dose match
Genmao Yu created SPARK-19805: - Summary: Log the row type when query result dose match Key: SPARK-19805 URL: https://issues.apache.org/jira/browse/SPARK-19805 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.1.0, 2.0.2 Reporter: Genmao Yu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe
[ https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893691#comment-15893691 ] Hyukjin Kwon commented on SPARK-15474: -- It seems an issue related with Hive's {{OrcOutputFormat}}. It seems the record writer does not write the footer if any row is not written but it writes an empty one when it closes. Currently, we are lazily creating the {{RecordWriter}} in Spark side in {{OrcFileFormat}} so currently the empty one is not being created if any row is not written. > ORC data source fails to write and read back empty dataframe > - > > Key: SPARK-15474 > URL: https://issues.apache.org/jira/browse/SPARK-15474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently ORC data source fails to write and read empty data. > The code below: > {code} > val emptyDf = spark.range(10).limit(0) > emptyDf.write > .format("orc") > .save(path.getCanonicalPath) > val copyEmptyDf = spark.read > .format("orc") > .load(path.getCanonicalPath) > copyEmptyDf.show() > {code} > throws an exception below: > {code} > Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114) > {code} > Note that this is a different case with the data below > {code} > val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) > {code} > In this case, any writer is not initialised and created. (no calls of > {{WriterContainer.writeRows()}}. > For Parquet and JSON, it works but ORC does not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19745) SVCAggregator serializes coefficients
[ https://issues.apache.org/jira/browse/SPARK-19745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-19745. - Resolution: Fixed Fix Version/s: 2.2.0 > SVCAggregator serializes coefficients > - > > Key: SPARK-19745 > URL: https://issues.apache.org/jira/browse/SPARK-19745 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > Fix For: 2.2.0 > > > Similar to [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], > the SVC aggregator captures the coefficients in the class closure, and > therefore ships them around during optimization. We can prevent this with a > bit of reorganization of the aggregator class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893640#comment-15893640 ] Apache Spark commented on SPARK-19803: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/17144 > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sital Kedia > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19803: Assignee: (was: Apache Spark) > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sital Kedia > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19803: Assignee: Apache Spark > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sital Kedia >Assignee: Apache Spark > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data
[ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893596#comment-15893596 ] zhengruifeng commented on SPARK-18608: -- [~mlnick] [~yuhaoyan] [~srowen] I think if we use {{train(dataset: Dataset[_], handlePersistence: Boolean)}} instead of {{train(dataset: Dataset[_])}} may result in extra problems for external implementers, because the existing external algorithms overriding {{Predictor.train}} will not work. I think we can do it in another way: {code} abstract class Predictor[ FeaturesType, Learner <: Predictor[FeaturesType, Learner, M], M <: PredictionModel[FeaturesType, M]] extends Estimator[M] with PredictorParams { protected var storageLevel = StorageLevel.NONE // override def fit(dataset: Dataset[_]): M = { storageLevel = dataset.storageLevel ... } protected def train(dataset: Dataset[_]): M {code} so in algorithm implementations we can use the orignial storageLevel of the input dataset. > Spark ML algorithms that check RDD cache level for internal caching > double-cache data > - > > Key: SPARK-18608 > URL: https://issues.apache.org/jira/browse/SPARK-18608 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Nick Pentreath > > Some algorithms in Spark ML (e.g. {{LogisticRegression}}, > {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence > internally. They check whether the input dataset is cached, and if not they > cache it for performance. > However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. > This will actually always be true, since even if the dataset itself is > cached, the RDD returned by {{dataset.rdd}} will not be cached. > Hence if the input dataset is cached, the data will end up being cached > twice, which is wasteful. > To see this: > {code} > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val df = spark.range(10).toDF("num") > df: org.apache.spark.sql.DataFrame = [num: bigint] > scala> df.storageLevel == StorageLevel.NONE > res0: Boolean = true > scala> df.persist > res1: df.type = [num: bigint] > scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK > res2: Boolean = true > scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK > res3: Boolean = false > scala> df.rdd.getStorageLevel == StorageLevel.NONE > res4: Boolean = true > {code} > Before SPARK-16063, there was no way to check the storage level of the input > {{DataSet}}, but now we can, so the checks should be migrated to use > {{dataset.storageLevel}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893584#comment-15893584 ] Mridul Muralidharan commented on SPARK-19796: - I would not prefer (b) - if we are worried that users are depending on a private property, sending a truncated version of it is to aggravate it ! I would rather fail-fast with missing value. Having said that, while we should limit our internal usage of properties, since this is also used to propagate user specified key value pairs; adding limits or log messages might not be optimal. Worst case, if we start detecting that the properties Map is growing really large, we could broadcast it (ugh ?). > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Blocker > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19802) Remote History Server
[ https://issues.apache.org/jira/browse/SPARK-19802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893580#comment-15893580 ] Saisai Shao commented on SPARK-19802: - Spark's {{ApplicationHistoryProvider}} is pluggable, user could implement their own provider and plug into Spark's history server. So you could implement a {{HistoryProvider}} you wanted out of Spark. >From your description, this is more like a Hadoop ATS (Hadoop application >timeline server). We have an implementation of Timeline based history provider >for Spark's history server. The main feature is like what you mentioned query >through TCP, get the event and display on UI. > Remote History Server > - > > Key: SPARK-19802 > URL: https://issues.apache.org/jira/browse/SPARK-19802 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Ben Barnard > > Currently the history server expects to find history in a filesystem > somewhere. It would be nice to have a history server that listens for > application events on a TCP port, and have a EventLoggingListener that sends > events to the listening history server instead of writing to a file. This > would allow the history server to show up-to-date history for past and > running jobs in a cluster environment that lacks a shared filesystem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-19804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893579#comment-15893579 ] Marcelo Vanzin commented on SPARK-19804: For posterity, the error you get looks like this: {noformat} java.lang.ExceptionInInitializerError: null at java.lang.Class.getConstructor0(Class.java:2892) at java.lang.Class.getDeclaredConstructor(Class.java:2058) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1541) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:67) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3220) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3239) at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3464) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:226) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:210) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:333) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:294) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:269) at org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:272) {noformat} Which is rather cryptic; it's caused by one of the classes in the constructor being loaded by two different class loaders, so {{getDeclaredConstructor}} fails to find the right constructor and returns null. > HiveClientImpl does not work with Hive 2.2.0 metastore > -- > > Key: SPARK-19804 > URL: https://issues.apache.org/jira/browse/SPARK-19804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Minor > > I know that Spark currently does not officially support Hive 2.2 (perhaps > because it hasn't been released yet); but we have some 2.2 patches in CDH and > the current code in the isolated client fails. The most probably culprit are > changes added in HIVE-13149. > The fix is simple, and here's the patch we applied in CDH: > https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0 > Fixing that doesn't affect any existing Hive version support, but will make > it easier to support 2.2 when it's out. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14698) CREATE FUNCTION cloud not add function to hive metastore
[ https://issues.apache.org/jira/browse/SPARK-14698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893565#comment-15893565 ] poseidon commented on SPARK-14698: -- [~azeroth2b] I think in spark 1.6.1, author do it on purpose. If this bug fixed, function can store in DB, but can't not loaded again on thrift-server restart. But i can upload the patch anyway. spark-1.6.1\sql\hive\src\main\scala\org\apache\spark\sql\hive\HiveContext.scala private def functionOrMacroDDLPattern(command: String) = Pattern.compile( ".*(create|drop)\\s+(temporary\\s+)(function|macro).+", Pattern.DOTALL).matcher(command) this is the correct regular-expression to lead create function command stored in DB > CREATE FUNCTION cloud not add function to hive metastore > > > Key: SPARK-14698 > URL: https://issues.apache.org/jira/browse/SPARK-14698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: spark1.6.1 >Reporter: poseidon > Labels: easyfix > > build spark 1.6.1 , and run it with 1.2.1 hive version,config mysql as > metastore server. > Start a thrift server , then in beeline , try to CREATE FUNCTION as HIVE SQL > UDF. > find out , can not add this FUNCTION to mysql metastore,but the function > usage goes well. > if you try to add it again , thrift server throw a alread Exist Exception. > [SPARK-10151][SQL] Support invocation of hive macro > add a if condition when runSqlHive, which will exec create function in > hiveexec. caused this problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19349) Check resource ready to avoid multiple receivers to be scheduled on the same node.
[ https://issues.apache.org/jira/browse/SPARK-19349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Genmao Yu closed SPARK-19349. - Resolution: Won't Fix > Check resource ready to avoid multiple receivers to be scheduled on the same > node. > -- > > Key: SPARK-19349 > URL: https://issues.apache.org/jira/browse/SPARK-19349 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu > > Currently, we can only ensure registered resource satisfy the > "spark.scheduler.minRegisteredResourcesRatio". But if > "spark.scheduler.minRegisteredResourcesRatio" is set too small, receivers may > still be scheduled to few nodes. In fact, we can give once more chance to > wait for sufficient resource to schedule receiver evenly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19750) Spark UI http -> https redirect error
[ https://issues.apache.org/jira/browse/SPARK-19750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19750. Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 2.1.1 2.0.3 > Spark UI http -> https redirect error > - > > Key: SPARK-19750 > URL: https://issues.apache.org/jira/browse/SPARK-19750 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.2, 2.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 2.0.3, 2.1.1 > > > Spark's http redirect uses port 0 as a secure port to do redirect if http > port is not set, this will introduce {{ java.net.NoRouteToHostException: > Can't assign requested address }}, so here fixed to use bound port for > redirect. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893508#comment-15893508 ] Yun Ni commented on SPARK-19771: [~merlin] What you are suggesting is to hash each AND hash vector into a single integer, which I don't think make sense. It does little improvement to running time since SparkSQL does a hash join and the chance of vector comparison is almost minimized. It improves the memory cost of each transformed row from O(NumHashFunctions*NumHashTables) to O(NumHashTables) but at the cost of increasing false positive rate especially when the NumHashFunctions is large. >From user experience perspective, hiding the actual hash values from users is >a bad practice because users need to run their own algorithms based on the >hash values. Besides that, we expect users to increase the number of hash >functions when they want to lower the false positive rate. Hashing the vector >will increase the false positive rate again, which should not be expected. > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling
[ https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19276. Resolution: Fixed Assignee: Imran Rashid Fix Version/s: 2.2.0 > FetchFailures can be hidden by user (or sql) exception handling > --- > > Key: SPARK-19276 > URL: https://issues.apache.org/jira/browse/SPARK-19276 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Critical > Fix For: 2.2.0 > > > The scheduler handles node failures by looking for a special > {{FetchFailedException}} thrown by the shuffle block fetcher. This is > handled in {{Executor}} and then passed as a special msg back to the driver: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403 > However, user code exists in between the shuffle block fetcher and that catch > block -- it could intercept the exception, wrap it with something else, and > throw a different exception. If that happens, spark treats it as an ordinary > task failure, and retries the task, rather than regenerating the missing > shuffle data. The task eventually is retried 4 times, its doomed to fail > each time, and the job is failed. > You might think that no user code should do that -- but even sparksql does it: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214 > Here's an example stack trace. This is from Spark 1.6, so the sql code is > not the same, but the problem is still there: > {noformat} > 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while > writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to xxx/yyy:zzz > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > ... > 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 > failed 4 times; aborting job > {noformat} > I think the right fix here is to also set a fetch failure status in the > {{TaskContextImpl}}, so the executor can check that instead of just one > exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore
Marcelo Vanzin created SPARK-19804: -- Summary: HiveClientImpl does not work with Hive 2.2.0 metastore Key: SPARK-19804 URL: https://issues.apache.org/jira/browse/SPARK-19804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Marcelo Vanzin Priority: Minor I know that Spark currently does not officially support Hive 2.2 (perhaps because it hasn't been released yet); but we have some 2.2 patches in CDH and the current code in the isolated client fails. The most probably culprit are changes added in HIVE-13149. The fix is simple, and here's the patch we applied in CDH: https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0 Fixing that doesn't affect any existing Hive version support, but will make it easier to support 2.2 when it's out. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893376#comment-15893376 ] Kay Ousterhout commented on SPARK-19796: Do you think we should (separately) fix the underlying problem? Specifically, we could: (a) not send the SPARK_JOB_DESCRIPTION property to the workers, since it's only used on the master for the UI (and while users *could* access it, the variable name SPARK_JOB_DESCRIPTION is spark-private, which suggests that it shouldn't be used by users). Perhaps this is too risky because users could be using it? (b) Truncate SPARK_JOB_DESCRIPTION to something reasonable (100 characters?) before sending it to the workers. This is more backwards compatible if users are actually reading the property, but maybe a useless intermediate approach? (c) (Possibly in addition to one of the above) Log a warning if any of the properties is longer than 100 characters (or some threshold). Thoughts? I can file a JIRA if you think any of these is worthwhile. > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Blocker > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19631) OutputCommitCoordinator should not allow commits for already failed tasks
[ https://issues.apache.org/jira/browse/SPARK-19631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout reassigned SPARK-19631: -- Assignee: Patrick Woody > OutputCommitCoordinator should not allow commits for already failed tasks > - > > Key: SPARK-19631 > URL: https://issues.apache.org/jira/browse/SPARK-19631 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Patrick Woody >Assignee: Patrick Woody > Fix For: 2.2.0 > > > This is similar to SPARK-6614, but there a race condition where a task may > fail (e.g. Executor heartbeat timeout) and still manage to go through the > commit protocol successfully. After this any retries of the task will fail > indefinitely because of TaskCommitDenied. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19631) OutputCommitCoordinator should not allow commits for already failed tasks
[ https://issues.apache.org/jira/browse/SPARK-19631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19631. Resolution: Fixed Fix Version/s: 2.2.0 > OutputCommitCoordinator should not allow commits for already failed tasks > - > > Key: SPARK-19631 > URL: https://issues.apache.org/jira/browse/SPARK-19631 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Patrick Woody > Fix For: 2.2.0 > > > This is similar to SPARK-6614, but there a race condition where a task may > fail (e.g. Executor heartbeat timeout) and still manage to go through the > commit protocol successfully. After this any retries of the task will fail > indefinitely because of TaskCommitDenied. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893222#comment-15893222 ] Andrew Ash commented on SPARK-18113: We discovered another bug related to committing that causes task deadloop and have work being done in SPARK-19631 to fix it. > Sending AskPermissionToCommitOutput failed, driver enter into task deadloop > --- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: xuqing >Assignee: jin xing > Fix For: 2.2.0 > > > Executor sends *AskPermissionToCommitOutput* to driver failed, and retry > another sending. Driver receives 2 AskPermissionToCommitOutput messages and > handles them. But executor ignores the first response(true) and receives the > second response(false). The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > {noformat} > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > {noformat} > h4. Executor Log: > {noformat} > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$Def
[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893117#comment-15893117 ] Mingjie Tang commented on SPARK-19771: -- (1) because you need to explode each tuple. For example mentioned above, for one input tuple, you have to build 3 rows, and each hashvalue contain a vector is the length of hash functions. thus, for one tuple, your memory overhead is NumHashFunctions*NumHashTables=15. Thus, if the number input tuple is N, the overhead is NumHashFunctions*NumHashTables*N. (2) yes, the hashvalue can be any based on your input bucketwidth W. Actually, it should be very big for less collision. (3) I am not sure the hashCode can work, because we need to use this function for multi-probe searching. > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893097#comment-15893097 ] Yun Ni edited comment on SPARK-19771 at 3/2/17 9:55 PM: [~merlin] (1) The computation cost is NumHashFunctions because we go through each index only once. I don't know what's N in the memory overhead? (2) The hash values are not necessarily 0, 1, -1. (3) If we really want a hash function of Vector, why not use Vector.hashCode? was (Author: yunn): [~merlin] (1) The computation cost is NumHashFunctions because we go through each index only once. I don't know what's N in the memory overhead? (2) The hash values are not necessarily {0, 1, -1}. (3) If we really want a hash function of Vector, why not use Vector.hashCode? > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893097#comment-15893097 ] Yun Ni commented on SPARK-19771: [~merlin] (1) The computation cost is NumHashFunctions because we go through each index only once. I don't know what's N in the memory overhead? (2) The hash values are not necessarily {0, 1, -1}. (3) If we really want a hash function of Vector, why not use Vector.hashCode? > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893087#comment-15893087 ] Mingjie Tang commented on SPARK-18454: -- [~yunn] the current multi-probe NNS can be improved without building index. > Changes to improve Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) What are the issues and how we should change the current Nearest Neighbor > implementation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1693) Dependent on multiple versions of servlet-api jars lead to throw an SecurityException when Spark built for hadoop 2.3.0 , 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893074#comment-15893074 ] Andrew Otto commented on SPARK-1693: We just upgraded to CDH 5.10, which has Spark 1.6.0, Hadoop 2.6.0, Hive 1.1.0, and Oozie 4.1.0. We are having trouble running Spark jobs that use HiveContext from Oozie. They run perfectly fine from the CLI with spark-submit, just not in Oozie. We aren't certain that HiveContext is related, but we can reproduce regularly with a job that uses HiveContext. Anyway, I post this here, because the error we are getting is the same that started this issue: {code}class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package{code} I've noticed that the Oozie sharelib includes javax.servlet-3.0.0.v201112011016.jar. I also see that spark-assembly.jar includes a javax.servlet.FilterRegistration class, although its hard for me to tell which version. The jetty pom.xml files in spark-assembly.jar seem to say {{javax.servlet.*;version="2.6.0"}}, but I'm a little green on how all these dependencies get resolved. I don't see any javax.servlet .jars in any of /usr/lib/hadoop* (where CDH installs hadoop jars). Help! :) If this is not related to this issue, I'll open a new one. > Dependent on multiple versions of servlet-api jars lead to throw an > SecurityException when Spark built for hadoop 2.3.0 , 2.4.0 > > > Key: SPARK-1693 > URL: https://issues.apache.org/jira/browse/SPARK-1693 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Blocker > Fix For: 1.0.0 > > Attachments: log.txt > > > {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 > > log.txt{code} > The log: > {code} > UnpersistSuite: > - unpersist RDD *** FAILED *** > java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s > signer information does not match signer information of other classes in the > same package > at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) > at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666) > at java.lang.ClassLoader.defineClass(ClassLoader.java:794) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1693) Dependent on multiple versions of servlet-api jars lead to throw an SecurityException when Spark built for hadoop 2.3.0 , 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893074#comment-15893074 ] Andrew Otto edited comment on SPARK-1693 at 3/2/17 9:42 PM: We just upgraded to CDH 5.10, which has Spark 1.6.0, Hadoop 2.6.0, Hive 1.1.0, and Oozie 4.1.0. We are having trouble running Spark jobs that use HiveContext from Oozie. They run perfectly fine from the CLI with spark-submit, just not in Oozie. We aren't certain that HiveContext is related, but we can reproduce regularly with a job that uses HiveContext. Anyway, I post this here, because the error we are getting is the same that started this issue: {code}class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package{code} I've noticed that the Oozie sharelib includes javax.servlet-3.0.0.v201112011016.jar. I also see that spark-assembly.jar includes a javax.servlet.FilterRegistration class, although its hard for me to tell which version. The jetty pom.xml files in spark-assembly.jar seem to say {{javax.servlet.\*;version="2.6.0"}}, but I'm a little green on how all these dependencies get resolved. I don't see any javax.servlet .jars in any of /usr/lib/hadoop* (where CDH installs hadoop jars). Help! :) If this is not related to this issue, I'll open a new one. was (Author: ottomata): We just upgraded to CDH 5.10, which has Spark 1.6.0, Hadoop 2.6.0, Hive 1.1.0, and Oozie 4.1.0. We are having trouble running Spark jobs that use HiveContext from Oozie. They run perfectly fine from the CLI with spark-submit, just not in Oozie. We aren't certain that HiveContext is related, but we can reproduce regularly with a job that uses HiveContext. Anyway, I post this here, because the error we are getting is the same that started this issue: {code}class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package{code} I've noticed that the Oozie sharelib includes javax.servlet-3.0.0.v201112011016.jar. I also see that spark-assembly.jar includes a javax.servlet.FilterRegistration class, although its hard for me to tell which version. The jetty pom.xml files in spark-assembly.jar seem to say {{javax.servlet.*;version="2.6.0"}}, but I'm a little green on how all these dependencies get resolved. I don't see any javax.servlet .jars in any of /usr/lib/hadoop* (where CDH installs hadoop jars). Help! :) If this is not related to this issue, I'll open a new one. > Dependent on multiple versions of servlet-api jars lead to throw an > SecurityException when Spark built for hadoop 2.3.0 , 2.4.0 > > > Key: SPARK-1693 > URL: https://issues.apache.org/jira/browse/SPARK-1693 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Blocker > Fix For: 1.0.0 > > Attachments: log.txt > > > {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 > > log.txt{code} > The log: > {code} > UnpersistSuite: > - unpersist RDD *** FAILED *** > java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s > signer information does not match signer information of other classes in the > same package > at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) > at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666) > at java.lang.ClassLoader.defineClass(ClassLoader.java:794) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
Sital Kedia created SPARK-19803: --- Summary: Flaky BlockManagerProactiveReplicationSuite tests Key: SPARK-19803 URL: https://issues.apache.org/jira/browse/SPARK-19803 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Sital Kedia The tests added for BlockManagerProactiveReplicationSuite has made the jenkins build flaky. Please refer to the build for more details - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19802) Remote History Server
Ben Barnard created SPARK-19802: --- Summary: Remote History Server Key: SPARK-19802 URL: https://issues.apache.org/jira/browse/SPARK-19802 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Ben Barnard Currently the history server expects to find history in a filesystem somewhere. It would be nice to have a history server that listens for application events on a TCP port, and have a EventLoggingListener that sends events to the listening history server instead of writing to a file. This would allow the history server to show up-to-date history for past and running jobs in a cluster environment that lacks a shared filesystem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19801) Remove JDK7 from Travis CI
[ https://issues.apache.org/jira/browse/SPARK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892817#comment-15892817 ] Apache Spark commented on SPARK-19801: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/17143 > Remove JDK7 from Travis CI > -- > > Key: SPARK-19801 > URL: https://issues.apache.org/jira/browse/SPARK-19801 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR > verification (JDK7/JDK8 maven compilation and Java Linter) and contributors > can see the additional result via their Travis CI dashboard (or PC). > This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was > removed via SPARK-19550. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19801) Remove JDK7 from Travis CI
[ https://issues.apache.org/jira/browse/SPARK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19801: Assignee: Apache Spark > Remove JDK7 from Travis CI > -- > > Key: SPARK-19801 > URL: https://issues.apache.org/jira/browse/SPARK-19801 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR > verification (JDK7/JDK8 maven compilation and Java Linter) and contributors > can see the additional result via their Travis CI dashboard (or PC). > This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was > removed via SPARK-19550. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19801) Remove JDK7 from Travis CI
[ https://issues.apache.org/jira/browse/SPARK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19801: Assignee: (was: Apache Spark) > Remove JDK7 from Travis CI > -- > > Key: SPARK-19801 > URL: https://issues.apache.org/jira/browse/SPARK-19801 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR > verification (JDK7/JDK8 maven compilation and Java Linter) and contributors > can see the additional result via their Travis CI dashboard (or PC). > This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was > removed via SPARK-19550. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19801) Remove JDK7 from Travis CI
Dongjoon Hyun created SPARK-19801: - Summary: Remove JDK7 from Travis CI Key: SPARK-19801 URL: https://issues.apache.org/jira/browse/SPARK-19801 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.1.0 Reporter: Dongjoon Hyun Priority: Minor Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR verification (JDK7/JDK8 maven compilation and Java Linter) and contributors can see the additional result via their Travis CI dashboard (or PC). This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was removed via SPARK-19550. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19720) Redact sensitive information from SparkSubmit console output
[ https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19720. Resolution: Fixed Assignee: Mark Grover Fix Version/s: 2.2.0 > Redact sensitive information from SparkSubmit console output > > > Key: SPARK-19720 > URL: https://issues.apache.org/jira/browse/SPARK-19720 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.2.0 >Reporter: Mark Grover >Assignee: Mark Grover > Fix For: 2.2.0 > > > SPARK-18535 took care of redacting sensitive information from Spark event > logs and UI. However, it intentionally didn't bother redacting the same > sensitive information from SparkSubmit's console output because it was on the > client's machine, which already had the sensitive information on disk (in > spark-defaults.conf) or on terminal (spark-submit command line). > However, it seems now that it's better to redact information from > SparkSubmit's console output as well because orchestration software like > Oozie usually expose SparkSubmit's console output via a UI. To make matters > worse, Oozie, in particular, always sets the {{--verbose}} flag on > SparkSubmit invocation, making the sensitive information readily available in > its UI (see > [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248] > here). > This is a JIRA for tracking redaction of sensitive information from > SparkSubmit's console output. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11197) Run SQL query on files directly without create a table
[ https://issues.apache.org/jira/browse/SPARK-11197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892706#comment-15892706 ] Ladislav Jech commented on SPARK-11197: --- Grat stuff! > Run SQL query on files directly without create a table > -- > > Key: SPARK-11197 > URL: https://issues.apache.org/jira/browse/SPARK-11197 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.6.0 > > > It's useful to run SQL query directly on files without creating a table, as > people done with Apache Drill. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed
[ https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892679#comment-15892679 ] Apache Spark commented on SPARK-18699: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17142 > Spark CSV parsing types other than String throws exception when malformed > - > > Key: SPARK-18699 > URL: https://issues.apache.org/jira/browse/SPARK-18699 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Jakub Nowacki >Assignee: Takeshi Yamamuro > Fix For: 2.2.0 > > > If CSV is read and the schema contains any other type than String, exception > is thrown when the string value in CSV is malformed; e.g. if the timestamp > does not match the defined one, an exception is thrown: > {code} > Caused by: java.lang.IllegalArgumentException > at java.sql.Date.valueOf(Date.java:143) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at scala.util.Try.getOrElse(Try.scala:79) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > {code} > It behaves similarly with Integer and Long types, from what I've seen. > To my understanding modes PERMISSIVE and DROPMALFORMED should just null the > value or drop the line, but instead they kill the job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19800) Implement one kind of streaming sampling - reservoir sampling
[ https://issues.apache.org/jira/browse/SPARK-19800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19800: Assignee: (was: Apache Spark) > Implement one kind of streaming sampling - reservoir sampling > - > > Key: SPARK-19800 > URL: https://issues.apache.org/jira/browse/SPARK-19800 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19800) Implement one kind of streaming sampling - reservoir sampling
[ https://issues.apache.org/jira/browse/SPARK-19800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892571#comment-15892571 ] Apache Spark commented on SPARK-19800: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/17141 > Implement one kind of streaming sampling - reservoir sampling > - > > Key: SPARK-19800 > URL: https://issues.apache.org/jira/browse/SPARK-19800 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19800) Implement one kind of streaming sampling - reservoir sampling
[ https://issues.apache.org/jira/browse/SPARK-19800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19800: Assignee: Apache Spark > Implement one kind of streaming sampling - reservoir sampling > - > > Key: SPARK-19800 > URL: https://issues.apache.org/jira/browse/SPARK-19800 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19800) Implement one kind of streaming sampling - reservoir sampling
Genmao Yu created SPARK-19800: - Summary: Implement one kind of streaming sampling - reservoir sampling Key: SPARK-19800 URL: https://issues.apache.org/jira/browse/SPARK-19800 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.0, 2.0.2 Reporter: Genmao Yu -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892547#comment-15892547 ] Imran Rashid commented on SPARK-19796: -- [~kayousterhout] [~shivaram] here's another example of serializing lots of pointless data in each task -- in this case, {{TaskDescription.properties}} contains lots of data which the executors don't care about. and this gets serialized once per task. For this jira, I'll just do a small fix, but I thought you might be interested in this. > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Blocker > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19766) INNER JOIN on constant alias columns return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19766: Fix Version/s: 2.0.3 > INNER JOIN on constant alias columns return incorrect results > - > > Key: SPARK-19766 > URL: https://issues.apache.org/jira/browse/SPARK-19766 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Assignee: StanZhai >Priority: Critical > Labels: Correctness > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > We can demonstrate the problem with the following data set and query: > {code} > val spark = > SparkSession.builder().appName("test").master("local").getOrCreate() > val sql1 = > """ > |create temporary view t1 as select * from values > |(1) > |as grouping(a) > """.stripMargin > val sql2 = > """ > |create temporary view t2 as select * from values > |(1) > |as grouping(a) > """.stripMargin > val sql3 = > """ > |create temporary view t3 as select * from values > |(1), > |(1) > |as grouping(a) > """.stripMargin > val sql4 = > """ > |create temporary view t4 as select * from values > |(1), > |(1) > |as grouping(a) > """.stripMargin > val sqlA = > """ > |create temporary view ta as > |select a, 'a' as tag from t1 union all > |select a, 'b' as tag from t2 > """.stripMargin > val sqlB = > """ > |create temporary view tb as > |select a, 'a' as tag from t3 union all > |select a, 'b' as tag from t4 > """.stripMargin > val sql = > """ > |select tb.* from ta inner join tb on > |ta.a = tb.a and > |ta.tag = tb.tag > """.stripMargin > spark.sql(sql1) > spark.sql(sql2) > spark.sql(sql3) > spark.sql(sql4) > spark.sql(sqlA) > spark.sql(sqlB) > spark.sql(sql).show() > {code} > The results which is incorrect: > {code} > +---+---+ > | a|tag| > +---+---+ > | 1| b| > | 1| b| > | 1| a| > | 1| a| > | 1| b| > | 1| b| > | 1| a| > | 1| a| > +---+---+ > {code} > The correct results should be: > {code} > +---+---+ > | a|tag| > +---+---+ > | 1| a| > | 1| a| > | 1| b| > | 1| b| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19796: Assignee: Apache Spark > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Assignee: Apache Spark >Priority: Blocker > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19796: Assignee: (was: Apache Spark) > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Blocker > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892482#comment-15892482 ] Apache Spark commented on SPARK-19796: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/17140 > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Blocker > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19799) Support WITH clause in subqueries
Giambattista created SPARK-19799: Summary: Support WITH clause in subqueries Key: SPARK-19799 URL: https://issues.apache.org/jira/browse/SPARK-19799 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Giambattista Because of Spark-17590 it should be relatively easy to support WITH clause in subqueries besides nested CTE definitions. Here an example of a query that does not run on spark: create table test (seqno int, k string, v int) using parquet; insert into TABLE test values (1,'a', 99),(2, 'b', 88),(3, 'a', 77),(4, 'b', 66),(5, 'c', 55),(6, 'a', 44),(7, 'b', 33); SELECT percentile(b, 0.5) FROM (WITH mavg AS (SELECT k, AVG(v) OVER (PARTITION BY k ORDER BY seqno ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as b FROM test ORDER BY seqno) SELECT k, MAX(b) as b FROM mavg GROUP BY k); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892390#comment-15892390 ] Imran Rashid commented on SPARK-19796: -- Since its a regression, I'm making this a blocker for 2.2.0 (or else we revert SPARK-17931, but the fix should be simple). > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Blocker > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-19796: - Priority: Blocker (was: Major) > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Blocker > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892363#comment-15892363 ] Apache Spark commented on SPARK-18890: -- User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/17139 > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should fix this because it seems like there are > better ways of reducing the serialization time (e.g., by re-using a single > serialized object with the Task/jars/files and broadcasting it for each > stage) but I wanted to open this JIRA to document the discussion. > cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17080) join reorder
[ https://issues.apache.org/jira/browse/SPARK-17080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892325#comment-15892325 ] Apache Spark commented on SPARK-17080: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/17138 > join reorder > > > Key: SPARK-17080 > URL: https://issues.apache.org/jira/browse/SPARK-17080 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu > > We decide the join order of a multi-way join query based on the cost function > defined in the spec. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17080) join reorder
[ https://issues.apache.org/jira/browse/SPARK-17080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17080: Assignee: Apache Spark > join reorder > > > Key: SPARK-17080 > URL: https://issues.apache.org/jira/browse/SPARK-17080 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu >Assignee: Apache Spark > > We decide the join order of a multi-way join query based on the cost function > defined in the spec. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19798) Query returns stale results when tables are modified on other sessions
Giambattista created SPARK-19798: Summary: Query returns stale results when tables are modified on other sessions Key: SPARK-19798 URL: https://issues.apache.org/jira/browse/SPARK-19798 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Giambattista I observed the problem on master branch with thrift server in multisession mode (default), but I was able to replicate also with spark-shell as well (see below the sequence for replicating). I observed cases where changes made in a session (table insert, table renaming) are not visible to other derived sessions (created with session.newSession). The problem seems due to the fact that each session has its own tableRelationCache and it does not get refreshed. IMO tableRelationCache should be shared in sharedState, maybe in the cacheManager so that refresh of caches for data that is not session-specific such as temporary tables gets centralized. --- Spark shell script val spark2 = spark.newSession spark.sql("CREATE TABLE test (a int) using parquet") spark2.sql("select * from test").show // OK returns empty spark.sql("select * from test").show // OK returns empty spark.sql("insert into TABLE test values 1,2,3") spark2.sql("select * from test").show // ERROR returns empty spark.sql("select * from test").show // OK returns 3,2,1 spark.sql("create table test2 (a int) using parquet") spark.sql("insert into TABLE test2 values 4,5,6") spark2.sql("select * from test2").show // OK returns 6,4,5 spark.sql("select * from test2").show // OK returns 6,4,5 spark.sql("alter table test rename to test3") spark.sql("alter table test2 rename to test") spark.sql("alter table test3 rename to test2") spark2.sql("select * from test").show // ERROR returns empty spark.sql("select * from test").show // OK returns 6,4,5 spark2.sql("select * from test2").show // ERROR throws java.io.FileNotFoundException spark.sql("select * from test2").show // OK returns 3,1,2 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17080) join reorder
[ https://issues.apache.org/jira/browse/SPARK-17080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17080: Assignee: (was: Apache Spark) > join reorder > > > Key: SPARK-17080 > URL: https://issues.apache.org/jira/browse/SPARK-17080 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu > > We decide the join order of a multi-way join query based on the cost function > defined in the spec. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18769) Spark to be smarter about what the upper bound is and to restrict number of executor when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-18769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892318#comment-15892318 ] Thomas Graves commented on SPARK-18769: --- I definitely understand there is an actual problem here, but I think the problem is more with Spark and its event processing/synchronization then the fact we are asking for more containers.Like I mention I agree with doing the jira I just want to clarify why we are doing it and make sure we do it such that it doesn't hurt our container allocation. Its always good to play nice in the yarn environment and not ask for more containers then the entire cluster can handle for instance, but at the same time if we are limiting the container requests early on, yarn could easily free up resource and make them available for you. If you don't have your request in yarn could give those to someone else. There are a lot of configs in the yarn schedulers and different situations.If you look at some other apps on yarn (MR and TEZ), both immediately ask for all of their resource. MR is definitely different since it doesn't reuse containers, TEZ does. With asking for everything immediately you can definitely hit issues where if your tasks run really fast then you don't need all of those containers, but the exponential ramp up on our allocation now gets you their really quickly anyway and I think you can hit the same issue. Note that in our clusters we set the upper limit by default to something reasonable (couple thousand) and if someone has really large job they can reconfigure. > Spark to be smarter about what the upper bound is and to restrict number of > executor when dynamic allocation is enabled > > > Key: SPARK-18769 > URL: https://issues.apache.org/jira/browse/SPARK-18769 > Project: Spark > Issue Type: New Feature >Reporter: Neerja Khattar > > Currently when dynamic allocation is enabled max.executor is infinite and > spark creates so many executor and even exceed the yarn nodemanager memory > limit and vcores. > It should have a check to not exceed more that yarn resource limit. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19345) Add doc for "coldStartStrategy" usage in ALS
[ https://issues.apache.org/jira/browse/SPARK-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19345. Resolution: Fixed Fix Version/s: 2.2.0 > Add doc for "coldStartStrategy" usage in ALS > > > Key: SPARK-19345 > URL: https://issues.apache.org/jira/browse/SPARK-19345 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Nick Pentreath >Assignee: Nick Pentreath > Fix For: 2.2.0 > > > SPARK-14489 adds the ability to skip {{NaN}} predictions during > {{ALS.transform}}. This can be useful in production scenarios but is > particularly useful when trying to use the cross-validation classes with ALS, > since in many cases the test set will have users/items that are not in the > training set, leading to evaluation metrics that are all {{NaN}} and making > cross-validation unusable. > Add an explanation for the {{coldStartStrategy}} param to the ALS > documentation, and add example code to illustrate the usage. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19345) Add doc for "coldStartStrategy" usage in ALS
[ https://issues.apache.org/jira/browse/SPARK-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-19345: --- Priority: Minor (was: Major) > Add doc for "coldStartStrategy" usage in ALS > > > Key: SPARK-19345 > URL: https://issues.apache.org/jira/browse/SPARK-19345 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Nick Pentreath >Assignee: Nick Pentreath >Priority: Minor > Fix For: 2.2.0 > > > SPARK-14489 adds the ability to skip {{NaN}} predictions during > {{ALS.transform}}. This can be useful in production scenarios but is > particularly useful when trying to use the cross-validation classes with ALS, > since in many cases the test set will have users/items that are not in the > training set, leading to evaluation metrics that are all {{NaN}} and making > cross-validation unusable. > Add an explanation for the {{coldStartStrategy}} param to the ALS > documentation, and add example code to illustrate the usage. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892216#comment-15892216 ] Sean Owen commented on SPARK-19797: --- Yes, it's not true of scoring though, and the difference in fitting won't matter to the caller though. > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Priority: Trivial > Labels: documentation > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892189#comment-15892189 ] Zhe Sun edited comment on SPARK-19797 at 3/2/17 12:52 PM: -- Hi Sean, thanks for your quick reply. bq. If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. Let's use IDF as an example. If the pipeline is like: bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ and pass the idf result to LogisticRegression. Because LogisticRegression is an Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of *IDF*. However, if the last stage of pipeline is Normalizer (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer) bq. Tokenizer -> HashingTF -> IDF -> Normalizer When fitting this pipeline, *IDF* will only call _fit_, and do not need to call _transform_ That's why I think it is better to modify the description as below to make it accurate. bq. If the Pipeline had more Estimators, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. was (Author: ymwdalex): Hi Sean, thanks for your quick reply. bq. If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. Let's use IDF as an example. If the pipeline is like: bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ and pass the idf result to LogisticRegression. Because LogisticRegression is an Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of *IDF*. However, if the last stage of pipeline is Normalizer (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer) bq. Tokenizer -> HashingTF -> IDF -> Normalizer When fitting this pipeline, *IDF* will only call _fit_, and do not need to call _transform_ That's why I think it is better to correct the description as bq. If the Pipeline had more Estimators, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Priority: Trivial > Labels: documentation > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892189#comment-15892189 ] Zhe Sun commented on SPARK-19797: - Hi Sean, thanks for your quick reply. bq. If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. Let's use IDF as an example. If the pipeline is like: bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ and pass the idf result to LogisticRegression. Because LogisticRegression is an Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of *IDF*. However, if the last stage of pipeline is Normalizer (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer) bq. Tokenizer -> HashingTF -> IDF -> Normalizer When fitting this pipeline, *IDF* will only call _fit_, and do not need to call _transform_ That's why I think it is better to correct the description as bq. If the Pipeline had more Estimators, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Priority: Trivial > Labels: documentation > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19797: Assignee: Apache Spark > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Assignee: Apache Spark >Priority: Trivial > Labels: documentation > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19797: Assignee: (was: Apache Spark) > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Priority: Trivial > Labels: documentation > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()
[ https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892169#comment-15892169 ] Takeshi Yamamuro commented on SPARK-19503: -- I'm not sure this should be fixed though, postgresql leaves this kind of sorting as it is; {code} postgres=# \d testTable Table "public.testTable" Column | Type | Modifiers +-+--- key| integer | value | integer | postgres=# select count(*) from (select * from testTable order by key) t; count --- 1 (1 row) postgres=# explain select count(*) from (select * from testTable order by key) t; QUERY PLAN Aggregate (cost=192.41..192.42 rows=1 width=0) -> Sort (cost=158.51..164.16 rows=2260 width=4) Sort Key: testTable.key -> Seq Scan on testTable (cost=0.00..32.60 rows=2260 width=4) (4 rows) {code} > Execution Plan Optimizer: avoid sort or shuffle when it does not change end > result such as df.sort(...).count() > --- > > Key: SPARK-19503 > URL: https://issues.apache.org/jira/browse/SPARK-19503 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 > Environment: Perhaps only a pyspark or databricks AWS issue >Reporter: R >Priority: Minor > Labels: execution, optimizer, plan, query > > df.sort(...).count() > performs shuffle and sort and then count! This is wasteful as sort is not > required here and makes me wonder how smart the algebraic optimiser is > indeed! The data may be partitioned by known count (such as parquet files) > and we should not shuffle to just perform count. > This may look trivial, but if optimiser fails to recognise this, I wonder > what else is it missing especially in more complex operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()
[ https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892169#comment-15892169 ] Takeshi Yamamuro edited comment on SPARK-19503 at 3/2/17 12:39 PM: --- I'm not sure this should be fixed though, postgresql leaves this kind of sorting as it is...; {code} postgres=# \d testTable Table "public.testTable" Column | Type | Modifiers +-+--- key| integer | value | integer | postgres=# select count(*) from (select * from testTable order by key) t; count --- 1 (1 row) postgres=# explain select count(*) from (select * from testTable order by key) t; QUERY PLAN Aggregate (cost=192.41..192.42 rows=1 width=0) -> Sort (cost=158.51..164.16 rows=2260 width=4) Sort Key: testTable.key -> Seq Scan on testTable (cost=0.00..32.60 rows=2260 width=4) (4 rows) {code} was (Author: maropu): I'm not sure this should be fixed though, postgresql leaves this kind of sorting as it is; {code} postgres=# \d testTable Table "public.testTable" Column | Type | Modifiers +-+--- key| integer | value | integer | postgres=# select count(*) from (select * from testTable order by key) t; count --- 1 (1 row) postgres=# explain select count(*) from (select * from testTable order by key) t; QUERY PLAN Aggregate (cost=192.41..192.42 rows=1 width=0) -> Sort (cost=158.51..164.16 rows=2260 width=4) Sort Key: testTable.key -> Seq Scan on testTable (cost=0.00..32.60 rows=2260 width=4) (4 rows) {code} > Execution Plan Optimizer: avoid sort or shuffle when it does not change end > result such as df.sort(...).count() > --- > > Key: SPARK-19503 > URL: https://issues.apache.org/jira/browse/SPARK-19503 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 > Environment: Perhaps only a pyspark or databricks AWS issue >Reporter: R >Priority: Minor > Labels: execution, optimizer, plan, query > > df.sort(...).count() > performs shuffle and sort and then count! This is wasteful as sort is not > required here and makes me wonder how smart the algebraic optimiser is > indeed! The data may be partitioned by known count (such as parquet files) > and we should not shuffle to just perform count. > This may look trivial, but if optimiser fails to recognise this, I wonder > what else is it missing especially in more complex operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892170#comment-15892170 ] Zhe Sun commented on SPARK-19797: - A pull request was created https://github.com/apache/spark/pull/17137 > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Priority: Trivial > Labels: documentation > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892168#comment-15892168 ] Apache Spark commented on SPARK-19797: -- User 'ymwdalex' has created a pull request for this issue: https://github.com/apache/spark/pull/17137 > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Priority: Trivial > Labels: documentation > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892157#comment-15892157 ] Sean Owen commented on SPARK-19797: --- Hm, on second look, the placement of the sentence suggest it applies to fitting. It is a bit of an implementation detail that this is optimized away, and the user won't actually care whether the pointless transforms happen during fitting or not. It is probably OK as is, but, might be clearer to say something like, "has more stages that require the output of the LogisticRegressionModel to fit" or something? > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Priority: Trivial > Labels: documentation > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19778) alais cannot use in group by
[ https://issues.apache.org/jira/browse/SPARK-19778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19778. -- Resolution: Duplicate I am resolving this as a duplicate of SPARK-14471 Please reopen this if I misunderstood. > alais cannot use in group by > > > Key: SPARK-19778 > URL: https://issues.apache.org/jira/browse/SPARK-19778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: xukun > > not support “select key as key1 from src group by key1” -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19797) ML pipelines document error
[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892149#comment-15892149 ] Sean Owen commented on SPARK-19797: --- I don't think that's true. The resulting pipeline would contain a LogisticRegressionModel, and when invoked, its transform() method would be called, and the result passed to subsequent transformers if any. You are pointing out that the trailing transformations aren't necessary to compute when _fitting_ the pipeline. That's what the if statement here is optimizing away. > ML pipelines document error > --- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zhe Sun >Priority: Trivial > Labels: documentation > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19783) Treat shorter/longer lengths of tokens as malformed records in CSV parser
[ https://issues.apache.org/jira/browse/SPARK-19783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19783: Assignee: Apache Spark > Treat shorter/longer lengths of tokens as malformed records in CSV parser > - > > Key: SPARK-19783 > URL: https://issues.apache.org/jira/browse/SPARK-19783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark > > If a length of tokens does not match an expected length in a schema, we > probably need to treat it as a malformed record. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L239 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19783) Treat shorter/longer lengths of tokens as malformed records in CSV parser
[ https://issues.apache.org/jira/browse/SPARK-19783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19783: Assignee: (was: Apache Spark) > Treat shorter/longer lengths of tokens as malformed records in CSV parser > - > > Key: SPARK-19783 > URL: https://issues.apache.org/jira/browse/SPARK-19783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro > > If a length of tokens does not match an expected length in a schema, we > probably need to treat it as a malformed record. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L239 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19783) Treat shorter/longer lengths of tokens as malformed records in CSV parser
[ https://issues.apache.org/jira/browse/SPARK-19783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892146#comment-15892146 ] Apache Spark commented on SPARK-19783: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/17136 > Treat shorter/longer lengths of tokens as malformed records in CSV parser > - > > Key: SPARK-19783 > URL: https://issues.apache.org/jira/browse/SPARK-19783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro > > If a length of tokens does not match an expected length in a schema, we > probably need to treat it as a malformed record. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L239 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19797) ML pipelines document error
Zhe Sun created SPARK-19797: --- Summary: ML pipelines document error Key: SPARK-19797 URL: https://issues.apache.org/jira/browse/SPARK-19797 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.0 Reporter: Zhe Sun Priority: Trivial Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which misleads the user bq. If the Pipeline had more *stages*, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. The description is not accurate, because *Transformer* could also be a stage. But only another Estimator will invoke an extra transform call. So, the description should be corrected as: *If the Pipeline had more _Estimators_*. The code to prove it is here https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol
[ https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-19704: --- Fix Version/s: 2.2.0 > AFTSurvivalRegression should support numeric censorCol > -- > > Key: SPARK-19704 > URL: https://issues.apache.org/jira/browse/SPARK-19704 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.2.0 > > > AFTSurvivalRegression should support numeric censorCol -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol
[ https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19704: -- Assignee: zhengruifeng > AFTSurvivalRegression should support numeric censorCol > -- > > Key: SPARK-19704 > URL: https://issues.apache.org/jira/browse/SPARK-19704 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > AFTSurvivalRegression should support numeric censorCol -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol
[ https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19704. Resolution: Fixed > AFTSurvivalRegression should support numeric censorCol > -- > > Key: SPARK-19704 > URL: https://issues.apache.org/jira/browse/SPARK-19704 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > AFTSurvivalRegression should support numeric censorCol -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19733) ALS performs unnecessary casting on item and user ids
[ https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19733: -- Assignee: Vasilis Vryniotis > ALS performs unnecessary casting on item and user ids > - > > Key: SPARK-19733 > URL: https://issues.apache.org/jira/browse/SPARK-19733 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Vasilis Vryniotis >Assignee: Vasilis Vryniotis > Fix For: 2.2.0 > > > The ALS is performing unnecessary casting to the user and item ids (to > double). I believe this is because the protected checkedCast() method > requires a double input. This can be avoided by refactroing the code of > checkedCast method. > Issue resolved by pull-request 17059: > https://github.com/apache/spark/pull/17059 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19733) ALS performs unnecessary casting on item and user ids
[ https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19733. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17059 [https://github.com/apache/spark/pull/17059] > ALS performs unnecessary casting on item and user ids > - > > Key: SPARK-19733 > URL: https://issues.apache.org/jira/browse/SPARK-19733 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Vasilis Vryniotis > Fix For: 2.2.0 > > > The ALS is performing unnecessary casting to the user and item ids (to > double). I believe this is because the protected checkedCast() method > requires a double input. This can be avoided by refactroing the code of > checkedCast method. > Issue resolved by pull-request 17059: > https://github.com/apache/spark/pull/17059 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org