[jira] [Updated] (SPARK-23766) Not able to execute multiple queries in spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apeksha Agnihotri updated SPARK-23766: -- Component/s: (was: Spark Core) Structured Streaming > Not able to execute multiple queries in spark structured streaming > -- > > Key: SPARK-23766 > URL: https://issues.apache.org/jira/browse/SPARK-23766 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Apeksha Agnihotri >Priority: Major > > I am able to receive output of first query(.ie reader) only. Although all the > queries are running in logs.No data is stored in hdfs also > > {code:java} > public class A extends D implements Serializable { > public Dataset getDataSet(SparkSession session) { > Dataset dfs = > session.readStream().format("socket").option("host", hostname).option("port", > port).load(); > publish(dfs.toDF(), "reader"); > return dfs; > } > } > public class B extends D implements Serializable { > public Dataset execute(Dataset ds) { >Dataset d = > ds.select(functions.explode(functions.split(ds.col("value"), "\\s+"))); > publish(d.toDF(), "component"); > return d; > } > } > public class C extends D implements Serializable { > public Dataset execute(Dataset ds) { > publish(inputDataSet.toDF(), "console"); > ds.writeStream().format("csv").option("path", > "hdfs://hostname:9000/user/abc/data1/") > .option("checkpointLocation", > "hdfs://hostname:9000/user/abc/cp").outputMode("append").start(); > return ds; > } > } > public class D { > public void publish(Dataset dataset, String name) { > dataset.writeStream().format("csv").queryName(name).option("path", > "hdfs://hostname:9000/user/abc/" + name) > .option("checkpointLocation", > "hdfs://hostname:9000/user/abc/checkpoint/" + directory).outputMode("append") > .start(); > } > } > public static void main(String[] args) { > SparkSession session = createSession(); > try { > A a = new A(); > Dataset records = a.getDataSet(session); > B b = new B(); > Dataset ds = b.execute(records); > C c = new C(); > c.execute(ds); > session.streams().awaitAnyTermination(); > } catch (StreamingQueryException e) { > e.printStackTrace(); > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem
[ https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415093#comment-16415093 ] Davide Isoardi commented on SPARK-23739: We know that kafka client 0.8 has not this class, but the druid packet has. Is it possible that this issue is cause by druid packet? In this case, can not you install druid and use structured streaming for get data by Kafka? > Spark structured streaming long running problem > --- > > Key: SPARK-23739 > URL: https://issues.apache.org/jira/browse/SPARK-23739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Florencio >Priority: Critical > Labels: spark, streaming, structured > > I had a problem with long running spark structured streaming in spark 2.1. > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.common.requests.LeaveGroupResponse. > The detailed error is the following: > 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. > Metadata OffsetSeqMetadata(0,1521216656590) > 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = > Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = > \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}} > 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map() > 18/03/16 16:10:57 ERROR StreamExecution: Query [id = > a233b9ff-cc39-44d3-b953-a255986c04bf, runId = > 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error > java.util.zip.ZipException: invalid code lengths set > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2101) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.map(RDD.scala:369) > at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator > java.lang.NoClassDefFoundError: > org/apache/kafka/common/requests/LeaveGroupResponse > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555) > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377) > at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66) > at > org.apache.kafka.clients.consu
[jira] [Updated] (SPARK-1359) SGD implementation is not efficient
[ https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-1359: --- Remaining Estimate: (was: 168h) Original Estimate: (was: 168h) > SGD implementation is not efficient > --- > > Key: SPARK-1359 > URL: https://issues.apache.org/jira/browse/SPARK-1359 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 0.9.0, 1.0.0 >Reporter: Xiangrui Meng >Priority: Major > > The SGD implementation samples a mini-batch to compute the stochastic > gradient. This is not efficient because examples are provided via an iterator > interface. We have to scan all of them to obtain a sample. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1359) SGD implementation is not efficient
[ https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-1359: --- Remaining Estimate: 168h Original Estimate: 168h > SGD implementation is not efficient > --- > > Key: SPARK-1359 > URL: https://issues.apache.org/jira/browse/SPARK-1359 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 0.9.0, 1.0.0 >Reporter: Xiangrui Meng >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > The SGD implementation samples a mini-batch to compute the stochastic > gradient. This is not efficient because examples are provided via an iterator > interface. We have to scan all of them to obtain a sample. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1359) SGD implementation is not efficient
[ https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-1359: --- Remaining Estimate: (was: 504h) Original Estimate: (was: 504h) > SGD implementation is not efficient > --- > > Key: SPARK-1359 > URL: https://issues.apache.org/jira/browse/SPARK-1359 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 0.9.0, 1.0.0 >Reporter: Xiangrui Meng >Priority: Major > > The SGD implementation samples a mini-batch to compute the stochastic > gradient. This is not efficient because examples are provided via an iterator > interface. We have to scan all of them to obtain a sample. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1359) SGD implementation is not efficient
[ https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-1359: --- Remaining Estimate: 504h Original Estimate: 504h > SGD implementation is not efficient > --- > > Key: SPARK-1359 > URL: https://issues.apache.org/jira/browse/SPARK-1359 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 0.9.0, 1.0.0 >Reporter: Xiangrui Meng >Priority: Major > Original Estimate: 504h > Remaining Estimate: 504h > > The SGD implementation samples a mini-batch to compute the stochastic > gradient. This is not efficient because examples are provided via an iterator > interface. We have to scan all of them to obtain a sample. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23598) WholeStageCodegen can lead to IllegalAccessError calling append for HashAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-23598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414984#comment-16414984 ] Dongjoon Hyun commented on SPARK-23598: --- Thank you, [~hvanhovell] ! > WholeStageCodegen can lead to IllegalAccessError calling append for > HashAggregateExec > -- > > Key: SPARK-23598 > URL: https://issues.apache.org/jira/browse/SPARK-23598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: David Vogelbacher >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > Got the following stacktrace for a large QueryPlan using WholeStageCodeGen: > {noformat} > java.lang.IllegalAccessError: tried to access method > org.apache.spark.sql.execution.BufferedRowIterator.append(Lorg/apache/spark/sql/catalyst/InternalRow;)V > from class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass.agg_doAggregateWithKeysOutput$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345){noformat} > After disabling codegen, everything works. > The root cause seems to be that we are trying to call the protected _append_ > method of > [BufferedRowIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java#L68] > from an inner-class of a sub-class that is loaded by a different > class-loader (after codegen compilation). > [https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-5.html#jvms-5.4.4] > states that a protected method _R_ can be accessed only if one of the > following two conditions is fulfilled: > # R is protected and is declared in a class C, and D is either a subclass of > C or C itself. Furthermore, if R is not static, then the symbolic reference > to R must contain a symbolic reference to a class T, such that T is either a > subclass of D, a superclass of D, or D itself. > # R is either protected or has default access (that is, neither public nor > protected nor private), and is declared by a class in the same run-time > package as D. > 2.) doesn't apply as we have loaded the class with a different class loader > (and are in a different package) and 1.) doesn't apply because we are > apparently trying to call the method from an inner class of a subclass of > _BufferedRowIterator_. > Looking at the Code path of _WholeStageCodeGen_, the following happens: > # In > [WholeStageCodeGen|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L527], > we create the subclass of _BufferedRowIterator_, along with a _processNext_ > method for processing the output of the child plan. > # In the child, which is a > [HashAggregateExec|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L517], > we create the method which shows up at the top of the stack trace (called > _doAggregateWithKeysOutput_ ) > # We add this method to the compiled code invoking _addNewFunction_ of > [CodeGenerator|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L460] > In the generated function body we call the _append_ method.| > Now, the _addNewFunction_ method states that: > {noformat} > If the code for the `OuterClass` grows too large, the function will be > inlined into a new private, inner class > {noformat} > This indeed seems to happen: the _doAggregateWithKeysOutput_ method is put > into a new private inner class. Thus, it doesn't have access to the protected > _append_ method anymore but still tries to call it, which results in the > _IllegalAccessError._ > Possible fixes: > * Pass in the _inlineTo
[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.x final
[ https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414831#comment-16414831 ] ASF GitHub Bot commented on SPARK-19552: Github user robertdale commented on the issue: https://github.com/apache/tinkerpop/pull/826 Looks like netty is upgraded in Spark 2.3.0 only. https://issues.apache.org/jira/browse/SPARK-19552 > Upgrade Netty version to 4.1.x final > > > Key: SPARK-19552 > URL: https://issues.apache.org/jira/browse/SPARK-19552 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.3.0 > > > Netty 4.1.8 was recently released but isn't API compatible with previous > major versions (like Netty 4.0.x), see > http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details. > This version does include a fix for a security concern but not one we'd be > exposed to with Spark "out of the box". Let's upgrade the version we use to > be on the safe side as the security fix I'm especially interested in is not > available in the 4.0.x release line. > We should move up anyway to take on a bunch of other big fixes cited in the > release notes (and if anyone were to use Spark with netty and tcnative, they > shouldn't be exposed to the security problem) - we should be good citizens > and make this change. > As this 4.1 version involves API changes we'll need to implement a few > methods and possibly adjust the Sasl tests. This JIRA and associated pull > request starts the process which I'll work on - and any help would be much > appreciated! Currently I know: > {code} > @Override > public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise > promise) > throws Exception { > if (!foundEncryptionHandler) { > foundEncryptionHandler = > ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this > returns false and causes test failures > } > ctx.write(msg, promise); > } > {code} > Here's what changes will be required (at least): > {code} > common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code} > requires touch, retain and transferred methods > {code} > common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code} > requires the above methods too > {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code} > With "dummy" implementations so we can at least compile and test, we'll see > five new test failures to address. > These are > {code} > org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption > org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption > org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption > org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption > org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23797) SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
Tin Vu created SPARK-23797: -- Summary: SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto Key: SPARK-23797 URL: https://issues.apache.org/jira/browse/SPARK-23797 Project: Spark Issue Type: Bug Components: Optimizer, Spark Submit, SQL Affects Versions: 2.3.0 Reporter: Tin Vu I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup: * TPCDS dataset with scale factor 100 (size 100GB). * Spark, Drill, Presto have a same number of workers: 12. * Each worked has same allocated amount of memory: 4GB. * Data is stored by Hive with ORC format. I executed a very simple SQL query: "SELECT * from table_name" The issue is that for some small size tables (even table with few dozen of records), SparkSQL still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1 second. For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table. Do you have any idea or reasonable explanation for this issue? Thanks, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22839) Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction
[ https://issues.apache.org/jira/browse/SPARK-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414739#comment-16414739 ] Apache Spark commented on SPARK-22839: -- User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/20910 > Refactor Kubernetes code for configuring driver/executor pods to use > consistent and cleaner abstraction > --- > > Key: SPARK-22839 > URL: https://issues.apache.org/jira/browse/SPARK-22839 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Priority: Major > Fix For: 2.4.0 > > > As discussed in https://github.com/apache/spark/pull/19954, the current code > for configuring the driver pod vs the code for configuring the executor pods > are not using the same abstraction. Besides that, the current code leaves a > lot to be desired in terms of the level and cleaness of abstraction. For > example, the current code is passing around many pieces of information around > different class hierarchies, which makes code review and maintenance > challenging. We need some thorough refactoring of the current code to achieve > better, cleaner, and consistent abstraction. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22839) Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction
[ https://issues.apache.org/jira/browse/SPARK-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414738#comment-16414738 ] Matt Cheah commented on SPARK-22839: Design was proposed and agreed upon in [https://docs.google.com/document/d/1XPLh3E2JJ7yeJSDLZWXh_lUcjZ1P0dy9QeUEyxIlfak/edit#.] Will be posting a pull request with the refactor shortly. > Refactor Kubernetes code for configuring driver/executor pods to use > consistent and cleaner abstraction > --- > > Key: SPARK-22839 > URL: https://issues.apache.org/jira/browse/SPARK-22839 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Priority: Major > Fix For: 2.4.0 > > > As discussed in https://github.com/apache/spark/pull/19954, the current code > for configuring the driver pod vs the code for configuring the executor pods > are not using the same abstraction. Besides that, the current code leaves a > lot to be desired in terms of the level and cleaness of abstraction. For > example, the current code is passing around many pieces of information around > different class hierarchies, which makes code review and maintenance > challenging. We need some thorough refactoring of the current code to achieve > better, cleaner, and consistent abstraction. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23776) pyspark-sql tests should display build instructions when components are missing
[ https://issues.apache.org/jira/browse/SPARK-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23776: Assignee: Apache Spark > pyspark-sql tests should display build instructions when components are > missing > --- > > Key: SPARK-23776 > URL: https://issues.apache.org/jira/browse/SPARK-23776 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Minor > > This is a follow up to SPARK-23417. > The pyspark-streaming tests print useful build instructions when certain > components are missing in the build. > pyspark-sql's udf and readwrite tests also have specific build requirements: > the build must compile test scala files, and the build must also create the > Hive assembly. When those class or jar files are not created, the tests throw > only partially helpful exceptions, e.g.: > {noformat} > AnalysisException: u'Can not load class > test.org.apache.spark.sql.JavaStringLength, please make sure it is on the > classpath;' > {noformat} > or > {noformat} > IllegalArgumentException: u"Error while instantiating > 'org.apache.spark.sql.hive.HiveExternalCatalog':" > {noformat} > You end up in this situation when you follow Spark's build instructions and > then attempt to run the pyspark tests. > It would be nice if pyspark-sql tests provide helpful build instructions in > these cases. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23776) pyspark-sql tests should display build instructions when components are missing
[ https://issues.apache.org/jira/browse/SPARK-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23776: Assignee: (was: Apache Spark) > pyspark-sql tests should display build instructions when components are > missing > --- > > Key: SPARK-23776 > URL: https://issues.apache.org/jira/browse/SPARK-23776 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > This is a follow up to SPARK-23417. > The pyspark-streaming tests print useful build instructions when certain > components are missing in the build. > pyspark-sql's udf and readwrite tests also have specific build requirements: > the build must compile test scala files, and the build must also create the > Hive assembly. When those class or jar files are not created, the tests throw > only partially helpful exceptions, e.g.: > {noformat} > AnalysisException: u'Can not load class > test.org.apache.spark.sql.JavaStringLength, please make sure it is on the > classpath;' > {noformat} > or > {noformat} > IllegalArgumentException: u"Error while instantiating > 'org.apache.spark.sql.hive.HiveExternalCatalog':" > {noformat} > You end up in this situation when you follow Spark's build instructions and > then attempt to run the pyspark tests. > It would be nice if pyspark-sql tests provide helpful build instructions in > these cases. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23776) pyspark-sql tests should display build instructions when components are missing
[ https://issues.apache.org/jira/browse/SPARK-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414718#comment-16414718 ] Apache Spark commented on SPARK-23776: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/20909 > pyspark-sql tests should display build instructions when components are > missing > --- > > Key: SPARK-23776 > URL: https://issues.apache.org/jira/browse/SPARK-23776 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > This is a follow up to SPARK-23417. > The pyspark-streaming tests print useful build instructions when certain > components are missing in the build. > pyspark-sql's udf and readwrite tests also have specific build requirements: > the build must compile test scala files, and the build must also create the > Hive assembly. When those class or jar files are not created, the tests throw > only partially helpful exceptions, e.g.: > {noformat} > AnalysisException: u'Can not load class > test.org.apache.spark.sql.JavaStringLength, please make sure it is on the > classpath;' > {noformat} > or > {noformat} > IllegalArgumentException: u"Error while instantiating > 'org.apache.spark.sql.hive.HiveExternalCatalog':" > {noformat} > You end up in this situation when you follow Spark's build instructions and > then attempt to run the pyspark tests. > It would be nice if pyspark-sql tests provide helpful build instructions in > these cases. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj
[ https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-23162. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20842 [https://github.com/apache/spark/pull/20842] > PySpark ML LinearRegressionSummary missing r2adj > > > Key: SPARK-23162 > URL: https://issues.apache.org/jira/browse/SPARK-23162 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: kevin yu >Priority: Minor > Labels: starter > Fix For: 2.4.0 > > > Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj
[ https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-23162: Assignee: kevin yu > PySpark ML LinearRegressionSummary missing r2adj > > > Key: SPARK-23162 > URL: https://issues.apache.org/jira/browse/SPARK-23162 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: kevin yu >Priority: Minor > Labels: starter > > Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23572) Update security.md to cover new features
[ https://issues.apache.org/jira/browse/SPARK-23572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23572. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20742 [https://github.com/apache/spark/pull/20742] > Update security.md to cover new features > > > Key: SPARK-23572 > URL: https://issues.apache.org/jira/browse/SPARK-23572 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0 > > > I just took a look at {{security.md}} and while it is correct, it covers > functionality that is now sort of obsolete (such as SASL-based encryption > instead of the newer AES encryption support). > We should go over that document and make sure everything is up to date. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23572) Update security.md to cover new features
[ https://issues.apache.org/jira/browse/SPARK-23572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23572: -- Assignee: Marcelo Vanzin > Update security.md to cover new features > > > Key: SPARK-23572 > URL: https://issues.apache.org/jira/browse/SPARK-23572 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > > I just took a look at {{security.md}} and while it is correct, it covers > functionality that is now sort of obsolete (such as SASL-based encryption > instead of the newer AES encryption support). > We should go over that document and make sure everything is up to date. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23736) Extension of the concat function to support array columns
[ https://issues.apache.org/jira/browse/SPARK-23736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marek Novotny updated SPARK-23736: -- Description: Extend the _concat_ function to also support array columns. Example: {{concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, 20, 30,100, 200] }} was: Extend the _concat_ function to also support array columns. Example: concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, 20, 30,100, 200] > Extension of the concat function to support array columns > - > > Key: SPARK-23736 > URL: https://issues.apache.org/jira/browse/SPARK-23736 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Major > > Extend the _concat_ function to also support array columns. > Example: > {{concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, > 20, 30,100, 200] }} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23736) Extension of the concat function to support array columns
[ https://issues.apache.org/jira/browse/SPARK-23736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marek Novotny updated SPARK-23736: -- Description: Extend the _concat_ function to also support array columns. Example: concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, 20, 30,100, 200] was: Implement the _concat_arrays_ function that merges two or more array columns into one. If any of children values is null, the function should return null. {{def concat_arrays(columns : Column*): Column }} Example: [1, 2, 3], [10, 20, 30], [100, 200] => [1, 2, 3,10, 20, 30,100, 200] Summary: Extension of the concat function to support array columns (was: collection function: concat_arrays) > Extension of the concat function to support array columns > - > > Key: SPARK-23736 > URL: https://issues.apache.org/jira/browse/SPARK-23736 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Major > > Extend the _concat_ function to also support array columns. > Example: > concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, > 20, 30,100, 200] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23672) Document Support returning lists in Arrow UDFs
[ https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23672: Assignee: (was: Apache Spark) > Document Support returning lists in Arrow UDFs > -- > > Key: SPARK-23672 > URL: https://issues.apache.org/jira/browse/SPARK-23672 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Major > > Documenting the support for returning lists for individual inputs on > non-grouped data inside of PySpark UDFs to better support the wordcount > example (and other things but wordcount is the simplest I can think of). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23672) Document Support returning lists in Arrow UDFs
[ https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23672: Assignee: Apache Spark > Document Support returning lists in Arrow UDFs > -- > > Key: SPARK-23672 > URL: https://issues.apache.org/jira/browse/SPARK-23672 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: holdenk >Assignee: Apache Spark >Priority: Major > > Documenting the support for returning lists for individual inputs on > non-grouped data inside of PySpark UDFs to better support the wordcount > example (and other things but wordcount is the simplest I can think of). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23672) Document Support returning lists in Arrow UDFs
[ https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414348#comment-16414348 ] Apache Spark commented on SPARK-23672: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/20908 > Document Support returning lists in Arrow UDFs > -- > > Key: SPARK-23672 > URL: https://issues.apache.org/jira/browse/SPARK-23672 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Major > > Documenting the support for returning lists for individual inputs on > non-grouped data inside of PySpark UDFs to better support the wordcount > example (and other things but wordcount is the simplest I can think of). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11237) PMML export for ML KMeans
[ https://issues.apache.org/jira/browse/SPARK-11237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414344#comment-16414344 ] Apache Spark commented on SPARK-11237: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/20907 > PMML export for ML KMeans > - > > Key: SPARK-11237 > URL: https://issues.apache.org/jira/browse/SPARK-11237 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: holdenk >Priority: Major > > Add PMML export for ML KMeans -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem
[ https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414326#comment-16414326 ] Cody Koeninger commented on SPARK-23739: Ok, the OutOfMemoryError is probably a separate and unrelated issue. > Spark structured streaming long running problem > --- > > Key: SPARK-23739 > URL: https://issues.apache.org/jira/browse/SPARK-23739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Florencio >Priority: Critical > Labels: spark, streaming, structured > > I had a problem with long running spark structured streaming in spark 2.1. > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.common.requests.LeaveGroupResponse. > The detailed error is the following: > 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. > Metadata OffsetSeqMetadata(0,1521216656590) > 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = > Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = > \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}} > 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map() > 18/03/16 16:10:57 ERROR StreamExecution: Query [id = > a233b9ff-cc39-44d3-b953-a255986c04bf, runId = > 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error > java.util.zip.ZipException: invalid code lengths set > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2101) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.map(RDD.scala:369) > at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator > java.lang.NoClassDefFoundError: > org/apache/kafka/common/requests/LeaveGroupResponse > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555) > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377) > at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1383) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1364) > at org.apache.sp
[jira] [Updated] (SPARK-23599) The UUID() expression is too non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-23599: -- Fix Version/s: 2.3.1 > The UUID() expression is too non-deterministic > -- > > Key: SPARK-23599 > URL: https://issues.apache.org/jira/browse/SPARK-23599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Liang-Chi Hsieh >Priority: Critical > Fix For: 2.3.1, 2.4.0 > > > The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID > generation. There are a couple of major problems with this: > - It is non-deterministic across task retries. This breaks Spark's processing > model, and this will to very hard to trace bugs, like non-deterministic > shuffles, duplicates and missing rows. > - It uses a single secure random for UUID generation. This uses a single JVM > wide lock, and this can lead to lock contention and other performance > problems. > We should move to something that is deterministic between retries. This can > be done by using seeded PRNGs for which we set the seed during planning. It > is important here to use a PRNG that provides enough entropy for creating a > proper UUID. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore
[ https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414291#comment-16414291 ] Stavros Kontopoulos edited comment on SPARK-23790 at 3/26/18 6:29 PM: -- Yes that is what I am saying. The initial fix here: [https://github.com/apache/spark/pull/17333] does the trick but I want to have a similar approach with yarn:[https://github.com/apache/spark/pull/17335] that adds delegation tokens in current user's ugi. When I did that I hit the issue with HadoopRDD which fetches its delegation tokens on its own. was (Author: skonto): Yes that is what I am saying. The initial fix here: [https://github.com/apache/spark/pull/17333] does the trick but I want to have a similar approach with yarn that adds delegation tokens in current user's ugi. When I did that I hit the issue with HadoopRDD which fetches its delegation tokens on its own. > proxy-user failed connecting to a kerberos configured metastore > --- > > Key: SPARK-23790 > URL: https://issues.apache.org/jira/browse/SPARK-23790 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Major > > This appeared at a customer trying to integrate with a kerberized hdfs > cluster. > This can be easily fixed with the proposed fix > [here|https://github.com/apache/spark/pull/17333] and the problem was > reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for > yarn. > The other option is to add the delegation tokens to the current user's UGI as > in [here|https://github.com/apache/spark/pull/17335] . The last fixes the > problem but leads to a failure when someones uses a HadoopRDD because the > latter, uses FileInputFormat to get the splits which calls the local ticket > cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail > with: > {quote}Exception in thread "main" > org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token > can be issued only with kerberos or web authenticationat > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896) > {quote} > This implies that security mode is SIMPLE and hadoop libs there are not aware > of kerberos. > This is related to this issue the workaround decided was to > [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804] > hadoop. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore
[ https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414291#comment-16414291 ] Stavros Kontopoulos edited comment on SPARK-23790 at 3/26/18 6:29 PM: -- Yes that is what I am saying. The initial fix here: [https://github.com/apache/spark/pull/17333] does the trick but I wanted to have a similar approach with yarn:[https://github.com/apache/spark/pull/17335] that adds delegation tokens in current user's ugi. When I did that I hit the issue with HadoopRDD which fetches its delegation tokens on its own. was (Author: skonto): Yes that is what I am saying. The initial fix here: [https://github.com/apache/spark/pull/17333] does the trick but I want to have a similar approach with yarn:[https://github.com/apache/spark/pull/17335] that adds delegation tokens in current user's ugi. When I did that I hit the issue with HadoopRDD which fetches its delegation tokens on its own. > proxy-user failed connecting to a kerberos configured metastore > --- > > Key: SPARK-23790 > URL: https://issues.apache.org/jira/browse/SPARK-23790 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Major > > This appeared at a customer trying to integrate with a kerberized hdfs > cluster. > This can be easily fixed with the proposed fix > [here|https://github.com/apache/spark/pull/17333] and the problem was > reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for > yarn. > The other option is to add the delegation tokens to the current user's UGI as > in [here|https://github.com/apache/spark/pull/17335] . The last fixes the > problem but leads to a failure when someones uses a HadoopRDD because the > latter, uses FileInputFormat to get the splits which calls the local ticket > cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail > with: > {quote}Exception in thread "main" > org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token > can be issued only with kerberos or web authenticationat > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896) > {quote} > This implies that security mode is SIMPLE and hadoop libs there are not aware > of kerberos. > This is related to this issue the workaround decided was to > [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804] > hadoop. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore
[ https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414291#comment-16414291 ] Stavros Kontopoulos commented on SPARK-23790: - Yes that is what I am saying. The initial fix here: [https://github.com/apache/spark/pull/17333] does the trick but I want to have a similar approach with yarn that adds delegation tokens in current user's ugi. When I did that I hit the issue with HadoopRDD which fetches its delegation tokens on its own. > proxy-user failed connecting to a kerberos configured metastore > --- > > Key: SPARK-23790 > URL: https://issues.apache.org/jira/browse/SPARK-23790 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Major > > This appeared at a customer trying to integrate with a kerberized hdfs > cluster. > This can be easily fixed with the proposed fix > [here|https://github.com/apache/spark/pull/17333] and the problem was > reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for > yarn. > The other option is to add the delegation tokens to the current user's UGI as > in [here|https://github.com/apache/spark/pull/17335] . The last fixes the > problem but leads to a failure when someones uses a HadoopRDD because the > latter, uses FileInputFormat to get the splits which calls the local ticket > cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail > with: > {quote}Exception in thread "main" > org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token > can be issued only with kerberos or web authenticationat > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896) > {quote} > This implies that security mode is SIMPLE and hadoop libs there are not aware > of kerberos. > This is related to this issue the workaround decided was to > [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804] > hadoop. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23672) Document Support returning lists in Arrow UDFs
[ https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-23672: Description: Documenting the support for returning lists for individual inputs on non-grouped data inside of PySpark UDFs to better support the wordcount example (and other things but wordcount is the simplest I can think of). (was: Consider to add support for returning lists for individual inputs on non-grouped data inside of PySpark UDFs to better support the wordcount example (and other things but wordcount is the simplest I can think of).) > Document Support returning lists in Arrow UDFs > -- > > Key: SPARK-23672 > URL: https://issues.apache.org/jira/browse/SPARK-23672 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Major > > Documenting the support for returning lists for individual inputs on > non-grouped data inside of PySpark UDFs to better support the wordcount > example (and other things but wordcount is the simplest I can think of). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23672) Document Support returning lists in Arrow UDFs
[ https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-23672: Summary: Document Support returning lists in Arrow UDFs (was: Support returning lists in Arrow UDFs) > Document Support returning lists in Arrow UDFs > -- > > Key: SPARK-23672 > URL: https://issues.apache.org/jira/browse/SPARK-23672 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Major > > Consider to add support for returning lists for individual inputs on > non-grouped data inside of PySpark UDFs to better support the wordcount > example (and other things but wordcount is the simplest I can think of). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23561) make StreamWriter not a DataSourceWriter subclass
[ https://issues.apache.org/jira/browse/SPARK-23561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23561: Assignee: (was: Apache Spark) > make StreamWriter not a DataSourceWriter subclass > - > > Key: SPARK-23561 > URL: https://issues.apache.org/jira/browse/SPARK-23561 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The inheritance makes little sense now; they've almost entirely diverged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23561) make StreamWriter not a DataSourceWriter subclass
[ https://issues.apache.org/jira/browse/SPARK-23561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23561: Assignee: Apache Spark > make StreamWriter not a DataSourceWriter subclass > - > > Key: SPARK-23561 > URL: https://issues.apache.org/jira/browse/SPARK-23561 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Major > > The inheritance makes little sense now; they've almost entirely diverged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23561) make StreamWriter not a DataSourceWriter subclass
[ https://issues.apache.org/jira/browse/SPARK-23561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414209#comment-16414209 ] Apache Spark commented on SPARK-23561: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20906 > make StreamWriter not a DataSourceWriter subclass > - > > Key: SPARK-23561 > URL: https://issues.apache.org/jira/browse/SPARK-23561 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The inheritance makes little sense now; they've almost entirely diverged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore
[ https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414208#comment-16414208 ] Marcelo Vanzin commented on SPARK-23790: BTW if what you're saying is that Yuming's fix also works for the issue you're seeing, we should probably dupe this to the other bug. > proxy-user failed connecting to a kerberos configured metastore > --- > > Key: SPARK-23790 > URL: https://issues.apache.org/jira/browse/SPARK-23790 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Major > > This appeared at a customer trying to integrate with a kerberized hdfs > cluster. > This can be easily fixed with the proposed fix > [here|https://github.com/apache/spark/pull/17333] and the problem was > reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for > yarn. > The other option is to add the delegation tokens to the current user's UGI as > in [here|https://github.com/apache/spark/pull/17335] . The last fixes the > problem but leads to a failure when someones uses a HadoopRDD because the > latter, uses FileInputFormat to get the splits which calls the local ticket > cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail > with: > {quote}Exception in thread "main" > org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token > can be issued only with kerberos or web authenticationat > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896) > {quote} > This implies that security mode is SIMPLE and hadoop libs there are not aware > of kerberos. > This is related to this issue the workaround decided was to > [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804] > hadoop. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8
[ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414190#comment-16414190 ] Nicholas Chammas commented on SPARK-22513: -- Thanks for the breakdown. This will be handy for reference. So I guess at the summary level Sean was correct. :D > Provide build profile for hadoop 2.8 > > > Key: SPARK-22513 > URL: https://issues.apache.org/jira/browse/SPARK-22513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Christine Koppelt >Priority: Major > > hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. > Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8. > [1] > https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23672) Support returning lists in Arrow UDFs
[ https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-23672: Description: Consider to add support for returning lists for individual inputs on non-grouped data inside of PySpark UDFs to better support the wordcount example (and other things but wordcount is the simplest I can think of). (was: Consider to add support for returning lists inside of PySpark UDFs to better support the wordcount example.) > Support returning lists in Arrow UDFs > - > > Key: SPARK-23672 > URL: https://issues.apache.org/jira/browse/SPARK-23672 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Major > > Consider to add support for returning lists for individual inputs on > non-grouped data inside of PySpark UDFs to better support the wordcount > example (and other things but wordcount is the simplest I can think of). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore
[ https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414176#comment-16414176 ] Marcelo Vanzin commented on SPARK-23790: I haven't had the time to see exactly what spark-cli is doing. This looks the same as SPARK-23639, and I don't like the place where the fix is being made. But I don't know enough about spark-cli yet to suggest something different. > proxy-user failed connecting to a kerberos configured metastore > --- > > Key: SPARK-23790 > URL: https://issues.apache.org/jira/browse/SPARK-23790 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Major > > This appeared at a customer trying to integrate with a kerberized hdfs > cluster. > This can be easily fixed with the proposed fix > [here|https://github.com/apache/spark/pull/17333] and the problem was > reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for > yarn. > The other option is to add the delegation tokens to the current user's UGI as > in [here|https://github.com/apache/spark/pull/17335] . The last fixes the > problem but leads to a failure when someones uses a HadoopRDD because the > latter, uses FileInputFormat to get the splits which calls the local ticket > cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail > with: > {quote}Exception in thread "main" > org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token > can be issued only with kerberos or web authenticationat > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896) > {quote} > This implies that security mode is SIMPLE and hadoop libs there are not aware > of kerberos. > This is related to this issue the workaround decided was to > [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804] > hadoop. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23796) There's no API to change state RDD's name
[ https://issues.apache.org/jira/browse/SPARK-23796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] István Gansperger updated SPARK-23796: -- Description: I use a few {{mapWithState}} stream oparations in my application and at some point it became a minor inconvenience that I could not figure out how to set the state RDDs name or serialization level. Searching around didn't really help and I have not come across any issues regarding this (pardon my inability to find it if there's one). It could be useful to see how much memory each state uses if the user has multiple such transformations. I have used some ugly reflection based code to be able to set the name of the state RDD and also the serialization level. I understand that the latter may be intentionally limited, but I haven't come across any issues caused by this apart from slightly degraded performance in exchange for a bit less memory usage. Are these limitations in place intentionally or is it just an oversight? Having some extra methods for these on {{StateSpec}} could be useful in my opinion. was: I use a few {{mapWithState}} stream oparations in my application and at some point it became a minor inconvenience that I could not figure out how to set the state RDDs name or serialization level. Searching around didn't really help and I have not come across any issues regarding this (pardon my inability to find it if there's one). It could be useful to see how much memory each state uses if the user has multiple such transformations. I have used some ugly reflection based code to be able to set the name of the state RDD and also the serialization level. I understand that the latter may be intentionally limited, but I haven't come across any issues caused by this apart from sightly degraded performance in exchange for a bit less memory usage. Are these limitations in place intentionally or is it just an oversight? Having some extra methods for these on {{StateSpec}} could be useful in my opinion. > There's no API to change state RDD's name > - > > Key: SPARK-23796 > URL: https://issues.apache.org/jira/browse/SPARK-23796 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: István Gansperger >Priority: Minor > > I use a few {{mapWithState}} stream oparations in my application and at some > point it became a minor inconvenience that I could not figure out how to set > the state RDDs name or serialization level. Searching around didn't really > help and I have not come across any issues regarding this (pardon my > inability to find it if there's one). It could be useful to see how much > memory each state uses if the user has multiple such transformations. > I have used some ugly reflection based code to be able to set the name of the > state RDD and also the serialization level. I understand that the latter may > be intentionally limited, but I haven't come across any issues caused by this > apart from slightly degraded performance in exchange for a bit less memory > usage. Are these limitations in place intentionally or is it just an > oversight? Having some extra methods for these on {{StateSpec}} could be > useful in my opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23796) There's no API to change state RDD's name
István Gansperger created SPARK-23796: - Summary: There's no API to change state RDD's name Key: SPARK-23796 URL: https://issues.apache.org/jira/browse/SPARK-23796 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.3.0 Reporter: István Gansperger I use a few {{mapWithState}} stream oparations in my application and at some point it became a minor inconvenience that I could not figure out how to set the state RDDs name or serialization level. Searching around didn't really help and I have not come across any issues regarding this (pardon my inability to find it if there's one). It could be useful to see how much memory each state uses if the user has multiple such transformations. I have used some ugly reflection based code to be able to set the name of the state RDD and also the serialization level. I understand that the latter may be intentionally limited, but I haven't come across any issues caused by this apart from sightly degraded performance in exchange for a bit less memory usage. Are these limitations in place intentionally or is it just an oversight? Having some extra methods for these on {{StateSpec}} could be useful in my opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23795) AbstractLauncher is not extendable
[ https://issues.apache.org/jira/browse/SPARK-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23795. Resolution: Not A Problem That class is not meant to be extended by outside libraries. > AbstractLauncher is not extendable > -- > > Key: SPARK-23795 > URL: https://issues.apache.org/jira/browse/SPARK-23795 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.4.0 >Reporter: Dan Sanduleac >Priority: Minor > > The class is {{public abstract}} but because {{self()}} is package-private, > it cannot actually be implemented, which seems like an oversight. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23737) Scala API documentation leads to nonexistent pages for sources
[ https://issues.apache.org/jira/browse/SPARK-23737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414111#comment-16414111 ] Alexander Bessonov commented on SPARK-23737: [~sameerag], Making a wild guess the username in the URL is yours. > Scala API documentation leads to nonexistent pages for sources > -- > > Key: SPARK-23737 > URL: https://issues.apache.org/jira/browse/SPARK-23737 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Alexander Bessonov >Priority: Minor > > h3. Steps to reproduce: > # Go to [Scala API > homepage|[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package]]. > # Click "Source: package.scala" > h3. Result: > The link leads to nonexistent page: > [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/package.scala] > h3. Expected result: > The link leads to proper page: > [https://github.com/apache/spark/tree/v2.3.0/core/src/main/scala/org/apache/spark/package.scala] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23737) Scala API documentation leads to nonexistent pages for sources
[ https://issues.apache.org/jira/browse/SPARK-23737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Bessonov reopened SPARK-23737: Okay. The bug isn't fixed and it affects everyone who wants to jump to the source code from ScalaDocs. > Scala API documentation leads to nonexistent pages for sources > -- > > Key: SPARK-23737 > URL: https://issues.apache.org/jira/browse/SPARK-23737 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Alexander Bessonov >Priority: Minor > > h3. Steps to reproduce: > # Go to [Scala API > homepage|[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package]]. > # Click "Source: package.scala" > h3. Result: > The link leads to nonexistent page: > [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/package.scala] > h3. Expected result: > The link leads to proper page: > [https://github.com/apache/spark/tree/v2.3.0/core/src/main/scala/org/apache/spark/package.scala] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23795) AbstractLauncher is not extendable
[ https://issues.apache.org/jira/browse/SPARK-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23795: Assignee: (was: Apache Spark) > AbstractLauncher is not extendable > -- > > Key: SPARK-23795 > URL: https://issues.apache.org/jira/browse/SPARK-23795 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.4.0 >Reporter: Dan Sanduleac >Priority: Minor > > The class is {{public abstract}} but because {{self()}} is package-private, > it cannot actually be implemented, which seems like an oversight. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23795) AbstractLauncher is not extendable
[ https://issues.apache.org/jira/browse/SPARK-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23795: Assignee: Apache Spark > AbstractLauncher is not extendable > -- > > Key: SPARK-23795 > URL: https://issues.apache.org/jira/browse/SPARK-23795 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.4.0 >Reporter: Dan Sanduleac >Assignee: Apache Spark >Priority: Minor > > The class is {{public abstract}} but because {{self()}} is package-private, > it cannot actually be implemented, which seems like an oversight. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23795) AbstractLauncher is not extendable
[ https://issues.apache.org/jira/browse/SPARK-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414092#comment-16414092 ] Apache Spark commented on SPARK-23795: -- User 'dansanduleac' has created a pull request for this issue: https://github.com/apache/spark/pull/20905 > AbstractLauncher is not extendable > -- > > Key: SPARK-23795 > URL: https://issues.apache.org/jira/browse/SPARK-23795 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.4.0 >Reporter: Dan Sanduleac >Priority: Minor > > The class is {{public abstract}} but because {{self()}} is package-private, > it cannot actually be implemented, which seems like an oversight. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23795) AbstractLauncher is not extendable
Dan Sanduleac created SPARK-23795: - Summary: AbstractLauncher is not extendable Key: SPARK-23795 URL: https://issues.apache.org/jira/browse/SPARK-23795 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0, 2.4.0 Reporter: Dan Sanduleac The class is {{public abstract}} but because {{self()}} is package-private, it cannot actually be implemented, which seems like an oversight. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml
[ https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413704#comment-16413704 ] Apache Spark commented on SPARK-23751: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/20904 > Kolmogorov-Smirnoff test Python API in pyspark.ml > - > > Key: SPARK-23751 > URL: https://issues.apache.org/jira/browse/SPARK-23751 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Python wrapper for new DataFrame-based API for KS test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22513) Provide build profile for hadoop 2.8
[ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685 ] Steve Loughran edited comment on SPARK-22513 at 3/26/18 11:13 AM: -- API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions. * The profiles are similar except in the cloud profiles where later releases add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8 There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) Hadoop 3 adds a new "hadoop-cloud-storage" profile which is intended to allow descendants to get a "cruft removed and up to date" set of FS dependencies; if new things go in, this will be updated. And I have the task of shading that soon. * There's also the inevitable changes in versions of things, specifically and inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle so there's no need to worry about the version of jackson it builds against. As to Guava, that's the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK, usual disclaimers, etc). * We all strive to keep the semantics of things the same, but one person's "small improvement" is always someone else's "fundamental regression in the way things behave" —the eternal losing battle of software engineering. Best strategy there: build and test against alpha and beta releases, complain when things don't work & make sure it's fixed in the final one. FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7. Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/Insighs, albeit with a fair amount of backporting. Using 2.7 as the foundation means you don't have to worry about what was backported, except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work with Azure outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need to git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage speedups. AWS EMR and google dataproc are both 2.8.x —no idea about changes made. You can build spark against any Hadoop version you like on the 2.x line without problems. {code:java} mvn package -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0 {code} Against 3.x things compile but Hive is unhappy unless you have one of: a spark hive module with a patch to hive's version check case statement or apache hadoop trunk pretending to be a branch-2 line `-Ddeclared.hadoop.version=2.11`. That works OK for spark build & test but MUST NOT be deployed as HDFS version checking will be unhappy. Clear :)? ps: don't mention Java 9 (HADOOP-11123) 10 (HADOOP-11423) or 11 (HADOOP-15338)]. thanks. was (Author: ste...@apache.org): API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions. * The profiles are similar except in the cloud profiles where later releases add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8 There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) Hadoop 3 adds a new "hadoop-cloud-storage" profile which is intended to allow descendants to get a "cruft removed and up to date" set of FS dependencies; if new things go in, this will be updated. And I have the task of shading that soon. * There's also the inevitable changes in versions of things, specifically and inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle so there's no need to worry about the version of jackson it builds against. As to Guava, that's the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK, usual disclaimers, etc). * We all strive to keep the semantics of things the same, but one person's "small improvement" is always someone else's "fundamental regression in the way things behave". —the eternal losing battle of software engineeing. Best strategy there: build and test against alpha and beta releases, complain when things don't work & make sure it's fixed in the final one. FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7. Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair amount of backporting. Using 2.7 as the foundation means you don't have to worry about what was backported, except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work with Azure outside HDP or HD/Insights, 2.9. I don't know about CDH
[jira] [Comment Edited] (SPARK-22513) Provide build profile for hadoop 2.8
[ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685 ] Steve Loughran edited comment on SPARK-22513 at 3/26/18 11:10 AM: -- API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions. * The profiles are similar except in the cloud profiles where later releases add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8 There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) 3.x, which the authors would like backported to the 2.x line. Hadoop 3 adds a new "hadoop-cloud-storage" profile which is intended to allow descendants to get a "cruft removed and up to date" set of FS dependencies; if new things go in, this will be updated. And I have the task of shading that soon. * There's also the inevitable changes in versions of things, specifically and inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle so there's no need to worry about the version of jackson it builds against. As to Guava, that's the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK, usual disclaimers, etc). * We all strive to keep the semantics of things the same, but one person's "small improvement" is always someone else's "fundamental regression in the way things behave". —the eternal losing battle of software engineeing. Best strategy there: build and test against alpha and beta releases, complain when things don't work & make sure it's fixed in the final one. FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7. Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair amount of backporting. Using 2.7 as the foundation means you don't have to worry about what was backported, except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work with Azure outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need to git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage speedups. Otherwise, you can build against any version you like on the 2.x line without problems. {code:java} mvn install -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0 {code} Against 3.x things compile but Hive is unhappy unless you have one of: a spark hive module with a patch to hive's version check case statement or apache 3.x branch pretending to be a branch-2 line `-Ddeclared.hadoop.version=2.11`, which works OK for spark build & test but MUST NOT be deployed as HDFS version checking will be unhappy. Clear :)? ps: don't mention Java 9 (HADOOP-11123) 10 (HADOOP-11423) or 11 (HADOOP-15338)]. thanks. was (Author: ste...@apache.org): API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions. * The profiles are similar except in the cloud profiles where later releases add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8 There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) 3.x, which the authors would like backported to the 2.x line. Hadoop 3 adds a new "hadoop-cloud-storage" profile which is intended to allow descendants to get a "cruft removed and up to date" set of FS dependencies; if new things go in, this will be updated. And I have the task of shading that soon. * There's also the inevitable changes in versions of things, specifically and inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle so there's no need to worry about the version of jackson it builds against. As to Guava, that's the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK, usual disclaimers, etc). * We all strive to keep the semantics of things the same, but one person's "small improvement" is always someone else's "fundamental regression in the way things behave". —the eternal losing battle of software engineeing. Best strategy there: build and test against alpha and beta releases, complain when things don't work & make sure it's fixed in the final one. FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7. Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair amount of backporting. Using 2.7 as the foundation means you don't have to worry about what was backported, except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work with Azure outside HDP
[jira] [Assigned] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml
[ https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23751: Assignee: (was: Apache Spark) > Kolmogorov-Smirnoff test Python API in pyspark.ml > - > > Key: SPARK-23751 > URL: https://issues.apache.org/jira/browse/SPARK-23751 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Python wrapper for new DataFrame-based API for KS test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml
[ https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23751: Assignee: Apache Spark > Kolmogorov-Smirnoff test Python API in pyspark.ml > - > > Key: SPARK-23751 > URL: https://issues.apache.org/jira/browse/SPARK-23751 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Major > > Python wrapper for new DataFrame-based API for KS test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22513) Provide build profile for hadoop 2.8
[ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685 ] Steve Loughran edited comment on SPARK-22513 at 3/26/18 11:10 AM: -- API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions. * The profiles are similar except in the cloud profiles where later releases add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8 There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) Hadoop 3 adds a new "hadoop-cloud-storage" profile which is intended to allow descendants to get a "cruft removed and up to date" set of FS dependencies; if new things go in, this will be updated. And I have the task of shading that soon. * There's also the inevitable changes in versions of things, specifically and inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle so there's no need to worry about the version of jackson it builds against. As to Guava, that's the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK, usual disclaimers, etc). * We all strive to keep the semantics of things the same, but one person's "small improvement" is always someone else's "fundamental regression in the way things behave". —the eternal losing battle of software engineeing. Best strategy there: build and test against alpha and beta releases, complain when things don't work & make sure it's fixed in the final one. FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7. Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair amount of backporting. Using 2.7 as the foundation means you don't have to worry about what was backported, except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work with Azure outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need to git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage speedups. Otherwise, you can build against any version you like on the 2.x line without problems. {code:java} mvn install -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0 {code} Against 3.x things compile but Hive is unhappy unless you have one of: a spark hive module with a patch to hive's version check case statement or apache 3.x branch pretending to be a branch-2 line `-Ddeclared.hadoop.version=2.11`, which works OK for spark build & test but MUST NOT be deployed as HDFS version checking will be unhappy. Clear :)? ps: don't mention Java 9 (HADOOP-11123) 10 (HADOOP-11423) or 11 (HADOOP-15338)]. thanks. was (Author: ste...@apache.org): API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions. * The profiles are similar except in the cloud profiles where later releases add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8 There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) 3.x, which the authors would like backported to the 2.x line. Hadoop 3 adds a new "hadoop-cloud-storage" profile which is intended to allow descendants to get a "cruft removed and up to date" set of FS dependencies; if new things go in, this will be updated. And I have the task of shading that soon. * There's also the inevitable changes in versions of things, specifically and inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle so there's no need to worry about the version of jackson it builds against. As to Guava, that's the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK, usual disclaimers, etc). * We all strive to keep the semantics of things the same, but one person's "small improvement" is always someone else's "fundamental regression in the way things behave". —the eternal losing battle of software engineeing. Best strategy there: build and test against alpha and beta releases, complain when things don't work & make sure it's fixed in the final one. FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7. Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair amount of backporting. Using 2.7 as the foundation means you don't have to worry about what was backported, except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work with Azure outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will ne
[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8
[ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685 ] Steve Loughran commented on SPARK-22513: API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions. * The profiles are similar except in the cloud profiles where later releases add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8 There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) 3.x, which the authors would like backported to the 2.x line. Hadoop 3 adds a new "hadoop-cloud-storage" profile which is intended to allow descendants to get a "cruft removed and up to date" set of FS dependencies; if new things go in, this will be updated. And I have the task of shading that soon. * There's also the inevitable changes in versions of things, specifically and inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle so there's no need to worry about the version of jackson it builds against. As to Guava, that's the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK, usual disclaimers, etc). * We all strive to keep the semantics of things the same, but one person's "small improvement" is always someone else's "fundamental regression in the way things behave". —the eternal losing battle of software engineeing. Best strategy there: build and test against alpha and beta releases, complain when things don't work & make sure it's fixed in the final one. FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7. Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair amount of backporting. Using 2.7 as the foundation means you don't have to worry about what was backported, except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work with Azure outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need to git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage speedups. Otherwise, you can build against any version you like on the 2.x line without problems. {code:java} mvn install -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0 {code} Against 3.x things compile but Hive is unhappy unless you have one of: a spark hive module with a patch to hive's version check case statement or apache 3.x branch pretending to be a branch-2 line `-Ddeclared.hadoop.version=2.11`, which works OK for spark build & test but MUST NOT be deployed as HDFS version checking will be unhappy. Clear :)? ps: don't mention Java 9 10 or 11. thanks. > Provide build profile for hadoop 2.8 > > > Key: SPARK-22513 > URL: https://issues.apache.org/jira/browse/SPARK-22513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Christine Koppelt >Priority: Major > > hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. > Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8. > [1] > https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22513) Provide build profile for hadoop 2.8
[ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685 ] Steve Loughran edited comment on SPARK-22513 at 3/26/18 11:11 AM: -- API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions. * The profiles are similar except in the cloud profiles where later releases add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8 There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) Hadoop 3 adds a new "hadoop-cloud-storage" profile which is intended to allow descendants to get a "cruft removed and up to date" set of FS dependencies; if new things go in, this will be updated. And I have the task of shading that soon. * There's also the inevitable changes in versions of things, specifically and inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle so there's no need to worry about the version of jackson it builds against. As to Guava, that's the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK, usual disclaimers, etc). * We all strive to keep the semantics of things the same, but one person's "small improvement" is always someone else's "fundamental regression in the way things behave". —the eternal losing battle of software engineeing. Best strategy there: build and test against alpha and beta releases, complain when things don't work & make sure it's fixed in the final one. FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7. Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair amount of backporting. Using 2.7 as the foundation means you don't have to worry about what was backported, except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work with Azure outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need to git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage speedups. Otherwise, you can build spark against any version you like on the 2.x line without problems. {code:java} mvn install -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0 {code} Against 3.x things compile but Hive is unhappy unless you have one of: a spark hive module with a patch to hive's version check case statement or apache 3.x branch pretending to be a branch-2 line `-Ddeclared.hadoop.version=2.11`, which works OK for spark build & test but MUST NOT be deployed as HDFS version checking will be unhappy. Clear :)? ps: don't mention Java 9 (HADOOP-11123) 10 (HADOOP-11423) or 11 (HADOOP-15338)]. thanks. was (Author: ste...@apache.org): API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions. * The profiles are similar except in the cloud profiles where later releases add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8 There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) Hadoop 3 adds a new "hadoop-cloud-storage" profile which is intended to allow descendants to get a "cruft removed and up to date" set of FS dependencies; if new things go in, this will be updated. And I have the task of shading that soon. * There's also the inevitable changes in versions of things, specifically and inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle so there's no need to worry about the version of jackson it builds against. As to Guava, that's the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK, usual disclaimers, etc). * We all strive to keep the semantics of things the same, but one person's "small improvement" is always someone else's "fundamental regression in the way things behave". —the eternal losing battle of software engineeing. Best strategy there: build and test against alpha and beta releases, complain when things don't work & make sure it's fixed in the final one. FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7. Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair amount of backporting. Using 2.7 as the foundation means you don't have to worry about what was backported, except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work with Azure outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need to git log --grep for HADOOP-14660 and HADOOP-14535
[jira] [Created] (SPARK-23794) UUID() should be stateful
Herman van Hovell created SPARK-23794: - Summary: UUID() should be stateful Key: SPARK-23794 URL: https://issues.apache.org/jira/browse/SPARK-23794 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Herman van Hovell The UUID() expression is stateful and should implement the Stateful trait instead of the Nondeterministic trait. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413620#comment-16413620 ] Apache Spark commented on SPARK-23599: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/20903 > The UUID() expression is too non-deterministic > -- > > Key: SPARK-23599 > URL: https://issues.apache.org/jira/browse/SPARK-23599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Liang-Chi Hsieh >Priority: Critical > Fix For: 2.4.0 > > > The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID > generation. There are a couple of major problems with this: > - It is non-deterministic across task retries. This breaks Spark's processing > model, and this will to very hard to trace bugs, like non-deterministic > shuffles, duplicates and missing rows. > - It uses a single secure random for UUID generation. This uses a single JVM > wide lock, and this can lead to lock contention and other performance > problems. > We should move to something that is deterministic between retries. This can > be done by using seeded PRNGs for which we set the seed during planning. It > is important here to use a PRNG that provides enough entropy for creating a > proper UUID. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23565) Improved error message for when the number of sources for a query changes
[ https://issues.apache.org/jira/browse/SPARK-23565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413570#comment-16413570 ] Roman Maier commented on SPARK-23565: - I can not continue to work on this task in the foreseeable future. Those who wish can take it. > Improved error message for when the number of sources for a query changes > - > > Key: SPARK-23565 > URL: https://issues.apache.org/jira/browse/SPARK-23565 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Patrick McGloin >Priority: Minor > > If you change the number of sources for a Structured Streaming query then you > will get an assertion error as the number of sources in the checkpoint does > not match the number of sources in the query that is starting. This can > happen if, for example, you add a union to the input of the query. This is > of course correct but the error is a bit cryptic and requires investigation. > Suggestion for a more informative error message => > The number of sources for this query has changed. There are [x] sources in > the checkpoint offsets and now there are [y] sources requested by the query. > Cannot continue. > This is the current message. > 02-03-2018 13:14:22 ERROR StreamExecution:91 - Query ORPositionsState to > Kafka [id = 35f71e63-dbd0-49e9-98b2-a4c72a7da80e, runId = > d4439aca-549c-4ef6-872e-29fbfde1df78] terminated with error > java.lang.AssertionError: assertion failed at > scala.Predef$.assert(Predef.scala:156) at > org.apache.spark.sql.execution.streaming.OffsetSeq.toStreamProgress(OffsetSeq.scala:38) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$populateStartOffsets(StreamExecution.scala:429) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:297) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23793) Handle database names in spark.udf.register()
[ https://issues.apache.org/jira/browse/SPARK-23793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413542#comment-16413542 ] Takeshi Yamamuro commented on SPARK-23793: -- [~smilegator] This behaviour is expected? > Handle database names in spark.udf.register() > - > > Key: SPARK-23793 > URL: https://issues.apache.org/jira/browse/SPARK-23793 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > spark.udf.register currently ignores database names in function names; > {code} > scala> sql("create database testdb") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.udf.register("testdb.testfunc", (a: Int) => a + 1) > res1: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,IntegerType,Some(List(IntegerType))) > scala> sql("select testdb.testfunc(1)").show > org.apache.spark.sql.AnalysisException: Undefined function: 'testfunc'. This > function is neither a registered temporary function nor a permanent function > registered in the database 'testdb'.; line 1 pos 7 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1198) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1198) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23793) Handle database names in spark.udf.register()
Takeshi Yamamuro created SPARK-23793: Summary: Handle database names in spark.udf.register() Key: SPARK-23793 URL: https://issues.apache.org/jira/browse/SPARK-23793 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Takeshi Yamamuro spark.udf.register currently ignores database names in function names; {code} scala> sql("create database testdb") res0: org.apache.spark.sql.DataFrame = [] scala> spark.udf.register("testdb.testfunc", (a: Int) => a + 1) res1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,IntegerType,Some(List(IntegerType))) scala> sql("select testdb.testfunc(1)").show org.apache.spark.sql.AnalysisException: Undefined function: 'testfunc'. This function is neither a registered temporary function nor a permanent function registered in the database 'testdb'.; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1198) at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1198) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem
[ https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413530#comment-16413530 ] Florencio commented on SPARK-23739: --- Thanks Cody. The kafka version is version=0.10.0.1 and I checked the class org.apache.kafka.common.requests.LeaveGroupResponse is inside the assembly. Additionally, we check that we had installed the kafka version 0.8 in the cluster which was used by Druid, we uninstalled this kafka version and the application do not present the class not found error but I have java.lang.OutOfMemoryError: Java heap space. I attach part of the error: _18/03/23 15:20:08 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again._ _18/03/23 15:20:12 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM_ _18/03/23 15:20:12 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again._ _18/03/23 15:20:16 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again._ _18/03/23 15:20:21 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again._ _18/03/23 15:20:25 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again._ _18/03/23 15:20:28 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException_ _java.util.concurrent.TimeoutException_ _at java.util.concurrent.FutureTask.get(FutureTask.java:205)_ _at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)_ _18/03/23 15:20:33 WARN TransportChannelHandler: Exception in connection from NODE/NODE_IP:33018_ _java.io.IOException: Connection reset by peer_ _at sun.nio.ch.FileDispatcherImpl.read0(Native Method)_ _at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)_ _at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)_ _at sun.nio.ch.IOUtil.read(IOUtil.java:192)_ _at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)_ _at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)_ _at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)_ _at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)_ _at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)_ _at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)_ _at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)_ _at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)_ _at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)_ _at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)_ _at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)_ _at java.lang.Thread.run(Thread.java:745)_ _18/03/23 15:20:47 ERROR TransportResponseHandler: Still have 4 requests outstanding when connection from NODE/NODE_IP:33018 is closed_ _18/03/23 15:20:47 ERROR OneForOneBlockFetcher: Failed while starting block fetches_ _java.io.IOException: Connection reset by peer_ _at sun.nio.ch.FileDispatcherImpl.read0(Native Method)_ _at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)_ _at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)_ _at sun.nio.ch.IOUtil.read(IOUtil.java:192)_ _at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)_ _at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)_ _18/03/23 15:21:02 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms_ _18/03/23 15:21:01 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again._ _18/03/23 15:21:05 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again._ _18/03/23 15:21:05 ERROR TransportRequestHandler: Error sending result RpcResponse\{requestId=5287967802569519759, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to /172.20.13.58:33820; closing connection_ _io.netty.handler.codec.EncoderException: java.lang.OutOfMemoryError: Java heap space_ _at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)_ _at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)_ _at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735)_ _at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820)_ _at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728)_ _at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)_ _at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)_ _at io.netty.channel.AbstractChannelHa