[jira] [Commented] (SPARK-23594) Add interpreted execution for GetExternalRowField expression
[ https://issues.apache.org/jira/browse/SPARK-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387396#comment-16387396 ] Apache Spark commented on SPARK-23594: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/20746 > Add interpreted execution for GetExternalRowField expression > > > Key: SPARK-23594 > URL: https://issues.apache.org/jira/browse/SPARK-23594 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23594) Add interpreted execution for GetExternalRowField expression
[ https://issues.apache.org/jira/browse/SPARK-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23594: Assignee: Apache Spark > Add interpreted execution for GetExternalRowField expression > > > Key: SPARK-23594 > URL: https://issues.apache.org/jira/browse/SPARK-23594 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23594) Add interpreted execution for GetExternalRowField expression
[ https://issues.apache.org/jira/browse/SPARK-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23594: Assignee: (was: Apache Spark) > Add interpreted execution for GetExternalRowField expression > > > Key: SPARK-23594 > URL: https://issues.apache.org/jira/browse/SPARK-23594 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23594) Add interpreted execution for GetExternalRowField expression
[ https://issues.apache.org/jira/browse/SPARK-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387334#comment-16387334 ] Takeshi Yamamuro commented on SPARK-23594: -- If nobody works on this, I'll take this. > Add interpreted execution for GetExternalRowField expression > > > Key: SPARK-23594 > URL: https://issues.apache.org/jira/browse/SPARK-23594 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387315#comment-16387315 ] Takeshi Yamamuro commented on SPARK-23580: -- I'll join this work, too. > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23545) [Spark-Core] port opened by the SparkDriver is vulnerable for flooding attacks
[ https://issues.apache.org/jira/browse/SPARK-23545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387307#comment-16387307 ] sandeep katta commented on SPARK-23545: --- I will be working on this bug, Solution is as follows 1.Send the HeartBeat(1 way message) from APP master to Driver ,so Driver treats this channel as active 2.Driver can close all the inactive channels If any questions regarding this solution,please be free to comment on this > [Spark-Core] port opened by the SparkDriver is vulnerable for flooding attacks > -- > > Key: SPARK-23545 > URL: https://issues.apache.org/jira/browse/SPARK-23545 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sandeep katta >Priority: Major > > port opened by the SparkDriver is vulnerable for flooding attacks > *Steps*: > set spark.network.timeout=60s //can be any value > Start the thriftserver in client mode and you can see in below logs that the > spark Driver opens the port for AM and executors to communicate. > Logs: > 018-03-01 16:11:16,497 | INFO | [main] | Successfully started service > *'sparkDriver'* on port *22643*. | > org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) > 2018-03-01 16:11:17,265 | INFO | [main] | Successfully started service > 'SparkUI' on port 22950. | > org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) > 2018-03-01 16:11:44,640 | INFO | [main] | Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 22663. | > org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) > 2018-03-01 16:11:52,822 | INFO | [Thread-56] | Starting > ThriftBinaryCLIService on port 22550 with 5...501 worker threads | > org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:111) > Do telnet to this port using *telnet IP 22643* command and keep it idle, > after 60 seconds check the status, connection is still established, it should > be terminated > *lsof command output along with the date* > > host1:/var/ # date > Thu Mar 1 *16:12:55* CST 2018 > host1:/var/ # lsof | grep 22643 > java 66730 user1 292u IPv6 1482635919 0t0 TCP > host1:22643->*10.18.152.191:59297* (ESTABLISHED) > java 66730 user1 297u IPv6 1482374122 0t0 TCP > host1:22643->BLR118529:43894 (ESTABLISHED) > java 66730 user1 346u IPv6 1482314249 0t0 TCP host1:22643 (LISTEN) > host1:/var/ # date > Thu Mar 1 16:13:43 CST 2018 > host1:/var/ # date > Thu Mar 1 *16:16:55* CST 2018 > host1:/var/ # lsof | grep 22643 > java 66730 user1 292u IPv6 1482635919 0t0 TCP > host1:22643->*10.18.152.191:59297* (ESTABLISHED) > java 66730 user1 297u IPv6 1482374122 0t0 TCP > host1:22643->BLR118529:43894 (ESTABLISHED) > java 66730 user1 346u IPv6 1482314249 0t0 TCP host1:22643 (LISTEN) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23537) Logistic Regression without standardization
[ https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387276#comment-16387276 ] Teng Peng commented on SPARK-23537: --- This is a quiet interesting question and I do not have answer yet: Do we need standardization for L-BFGS in the first place? > Logistic Regression without standardization > --- > > Key: SPARK-23537 > URL: https://issues.apache.org/jira/browse/SPARK-23537 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.2.1 >Reporter: Jordi >Priority: Major > Attachments: non-standardization.log, standardization.log > > > I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer > to not use standardization since all my features are binary, using the > hashing trick (2^20 sparse vector). > I trained two models to compare results, I've been expecting to end with two > similar models since it seems that internally the optimizer performs > standardization and "de-standardization" (when it's deactivated) in order to > improve the convergence. > Here you have the code I used: > {code:java} > val lr = new org.apache.spark.ml.classification.LogisticRegression() > .setRegParam(0.05) > .setElasticNetParam(0.0) > .setFitIntercept(true) > .setMaxIter(5000) > .setStandardization(false) > val model = lr.fit(data) > {code} > The results are disturbing me, I end with two significantly different models. > *Standardization:* > Training time: 8min. > Iterations: 37 > Intercept: -4.386090107224499 > Max weight: 4.724752299455218 > Min weight: -3.560570478164854 > Mean weight: -0.049325201841722795 > l1 norm: 116710.39522171849 > l2 norm: 402.2581552373957 > Non zero weights: 128084 > Non zero ratio: 0.12215042114257812 > Last 10 LBFGS Val and Grand Norms: > {code:java} > 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) > 0.000559057 > 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) > 0.000267527 > 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) > 0.000205888 > 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) > 0.000144173 > 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) > 0.000140296 > 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) > 0.000122709 > 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) > 3.08789e-05 > 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) > 2.23806e-05 > 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) > 1.47422e-05 > 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) > 2.37442e-05 > {code} > *No standardization:* > Training time: 7h 14 min. > Iterations: 4992 > Intercept: -4.216690468849263 > Max weight: 0.41930559767624725 > Min weight: -0.5949182537565524 > Mean weight: -1.2659769019012E-6 > l1 norm: 14.262025330648694 > l2 norm: 1.2508777025612263 > Non zero weights: 128955 > Non zero ratio: 0.12298107147216797 > Last 10 LBFGS Val and Grand Norms: > {code:java} > 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) > 0.217581 > 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) > 0.185812 > 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) > 0.214570 > 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) > 0.489464 > 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) > 0.178448 > 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) > 0.172527 > 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) > 0.189389 > 18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) > 0.480678 > 18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) > 0.184529 > 18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) > 0.154329 > {code} > Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23288) Incorrect number of written records in structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387266#comment-16387266 ] Apache Spark commented on SPARK-23288: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/20745 > Incorrect number of written records in structured streaming > --- > > Key: SPARK-23288 > URL: https://issues.apache.org/jira/browse/SPARK-23288 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Metrics, metrics > > I'm using SparkListener.onTaskEnd() to capture input and output metrics but > it seems that number of written records > ('taskEnd.taskMetrics().outputMetrics().recordsWritten()') is incorrect. Here > is my stream construction: > > {code:java} > StreamingQuery writeStream = session > .readStream() > .schema(RecordSchema.fromClass(TestRecord.class)) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv(inputFolder.getRoot().toPath().toString()) > .as(Encoders.bean(TestRecord.class)) > .flatMap( > ((FlatMapFunction) (u) -> { > List resultIterable = new ArrayList<>(); > try { > TestVendingRecord result = transformer.convert(u); > resultIterable.add(result); > } catch (Throwable t) { > System.err.println("Ooops"); > t.printStackTrace(); > } > return resultIterable.iterator(); > }), > Encoders.bean(TestVendingRecord.class)) > .writeStream() > .outputMode(OutputMode.Append()) > .format("parquet") > .option("path", outputFolder.getRoot().toPath().toString()) > .option("checkpointLocation", > checkpointFolder.getRoot().toPath().toString()) > .start(); > writeStream.processAllAvailable(); > writeStream.stop(); > {code} > Tested it with one good and one bad (throwing exception in > transformer.convert(u)) input records and it produces following metrics: > > {code:java} > (TestMain.java:onTaskEnd(73)) - ---status--> SUCCESS > (TestMain.java:onTaskEnd(75)) - ---recordsWritten--> 0 > (TestMain.java:onTaskEnd(76)) - ---recordsRead-> 2 > (TestMain.java:onTaskEnd(83)) - taskEnd.taskInfo().accumulables(): > (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max) > (TestMain.java:onTaskEnd(85)) - value = 323 > (TestMain.java:onTaskEnd(84)) - name = number of output rows > (TestMain.java:onTaskEnd(85)) - value = 2 > (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max) > (TestMain.java:onTaskEnd(85)) - value = 364 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.recordsRead > (TestMain.java:onTaskEnd(85)) - value = 2 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.bytesRead > (TestMain.java:onTaskEnd(85)) - value = 157 > (TestMain.java:onTaskEnd(84)) - name = > internal.metrics.resultSerializationTime > (TestMain.java:onTaskEnd(85)) - value = 3 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.resultSize > (TestMain.java:onTaskEnd(85)) - value = 2396 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorCpuTime > (TestMain.java:onTaskEnd(85)) - value = 633807000 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorRunTime > (TestMain.java:onTaskEnd(85)) - value = 683 > (TestMain.java:onTaskEnd(84)) - name = > internal.metrics.executorDeserializeCpuTime > (TestMain.java:onTaskEnd(85)) - value = 55662000 > (TestMain.java:onTaskEnd(84)) - name = > internal.metrics.executorDeserializeTime > (TestMain.java:onTaskEnd(85)) - value = 58 > (TestMain.java:onTaskEnd(89)) - input records 2 > Streaming query made progress: { > "id" : "1231f9cb-b2e8-4d10-804d-73d7826c1cb5", > "runId" : "bd23b60c-93f9-4e17-b3bc-55403edce4e7", > "name" : null, > "timestamp" : "2018-01-26T14:44:05.362Z", > "numInputRows" : 2, > "processedRowsPerSecond" : 0.8163265306122448, > "durationMs" : { > "addBatch" : 1994, > "getBatch" : 126, > "getOffset" : 52, > "queryPlanning" : 220, > "triggerExecution" : 2450, > "walCommit" : 41 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : > "FileStreamSource[file:/var/folders/4w/zks_kfls2s3glmrj3f725p7hllyb5_/T/junit3661035412295337071]", > "startOffset" : null, > "endOffset" : { > "logOffset" : 0 > }, > "numInputRows" : 2, > "processedRowsPerSecond" : 0.8163265306122448 > } ], > "sink" : { >
[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage
[ https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387247#comment-16387247 ] guoxiaolongzte commented on SPARK-23433: I also encountered the same problem, who can solve it? > java.lang.IllegalStateException: more than one active taskSet for stage > --- > > Key: SPARK-23433 > URL: https://issues.apache.org/jira/browse/SPARK-23433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shixiong Zhu >Priority: Major > > This following error thrown by DAGScheduler stopped the cluster: > {code} > 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage > 7580621: 7580621.2,7580621.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387245#comment-16387245 ] Cody Koeninger commented on SPARK-18057: It's probably easiest to keep the KIP discussion on the Kafka mailing list. I personally think overloads taking a timeout is probably the most flexible option, but I think any of the options under discussion (overloads, reuse existing config, new config) are workable from a Spark integration point of view. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23582) Add interpreted execution to StaticInvoke expression
[ https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387230#comment-16387230 ] Kazuaki Ishizaki commented on SPARK-23582: -- I am working for this > Add interpreted execution to StaticInvoke expression > > > Key: SPARK-23582 > URL: https://issues.apache.org/jira/browse/SPARK-23582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23583) Add interpreted execution to Invoke expression
[ https://issues.apache.org/jira/browse/SPARK-23583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387231#comment-16387231 ] Kazuaki Ishizaki commented on SPARK-23583: -- I am working for this > Add interpreted execution to Invoke expression > -- > > Key: SPARK-23583 > URL: https://issues.apache.org/jira/browse/SPARK-23583 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current
[ https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387217#comment-16387217 ] Cody Koeninger commented on SPARK-19767: [~nafshartous] I think at this point people are more likely to need easy access to the API docs for the 0.10 artifact rather than the 0.8, do you agree? If so look at commit 1f0d0213 for what I did to skip 0.10 If you want to change that to skip 0.8 instead, and include 0.10, I'd support that. Although now that I look at it, the latest api docs [http://spark.apache.org/docs/latest/api/scala/index.html] no longer have the kafka namespace at all, whereas [http://spark.apache.org/docs/2.2.1/api/scala/index.html] still did. Haven't dug into why. > API Doc pages for Streaming with Kafka 0.10 not current > --- > > Key: SPARK-19767 > URL: https://issues.apache.org/jira/browse/SPARK-19767 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Nick Afshartous >Priority: Minor > > The API docs linked from the Spark Kafka 0.10 Integration page are not > current. For instance, on the page >https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > the code examples show the new API (i.e. class ConsumerStrategies). However, > following the links > API Docs --> (Scala | Java) > lead to API pages that do not have class ConsumerStrategies) . The API doc > package names also have {code}streaming.kafka{code} as opposed to > {code}streaming.kafka10{code} > as in the code examples on streaming-kafka-0-10-integration.html. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
[ https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387191#comment-16387191 ] Liang-Chi Hsieh commented on SPARK-22446: - Yeah, sounds good. > Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" > exception incorrectly for filtered data. > --- > > Key: SPARK-22446 > URL: https://issues.apache.org/jira/browse/SPARK-22446 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.1.2, 2.2.1 > Environment: spark-shell, local mode, macOS Sierra 10.12.6 >Reporter: Greg Bellchambers >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.0 > > > In the following, the `indexer` UDF defined inside the > `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an > "Unseen label" error, despite the label not being present in the transformed > DataFrame. > Here is the definition of the indexer UDF in the transform method: > {code:java} > val indexer = udf { label: String => > if (labelToIndex.contains(label)) { > labelToIndex(label) > } else { > throw new SparkException(s"Unseen label: $label.") > } > } > {code} > We can demonstrate the error with a very simple example DataFrame. > {code:java} > scala> import org.apache.spark.ml.feature.StringIndexer > import org.apache.spark.ml.feature.StringIndexer > scala> // first we create a DataFrame with three cities > scala> val df = List( > | ("A", "London", "StrA"), > | ("B", "Bristol", null), > | ("C", "New York", "StrC") > | ).toDF("ID", "CITY", "CONTENT") > df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more > field] > scala> df.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | B| Bristol| null| > | C|New York| StrC| > +---++---+ > scala> // then we remove the row with null in CONTENT column, which removes > Bristol > scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) > dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: > string, CITY: string ... 1 more field] > scala> dfNoBristol.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | C|New York| StrC| > +---++---+ > scala> // now create a StringIndexer for the CITY column and fit to > dfNoBristol > scala> val model = { > | new StringIndexer() > | .setInputCol("CITY") > | .setOutputCol("CITYIndexed") > | .fit(dfNoBristol) > | } > model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb > scala> // the StringIndexerModel has only two labels: "London" and "New York" > scala> str.labels foreach println > London > New York > scala> // transform our DataFrame to add an index column > scala> val dfWithIndex = model.transform(dfNoBristol) > dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 > more fields] > scala> dfWithIndex.show > +---++---+---+ > | ID|CITY|CONTENT|CITYIndexed| > +---++---+---+ > | A| London| StrA|0.0| > | C|New York| StrC|1.0| > +---++---+---+ > {code} > The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` > equal to 1.0 and perform an action. The `indexer` UDF in `transform` method > throws an exception reporting unseen label "Bristol". This is irrational > behaviour as far as the user of the API is concerned, because there is no > such value as "Bristol" when do show all rows of `dfWithIndex`: > {code:java} > scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count > 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$5: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) >
[jira] [Updated] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data
[ https://issues.apache.org/jira/browse/SPARK-23603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-23603: --- Labels: (was: ca) > When the length of the json is in a range,get_json_object will result in > missing tail data > -- > > Key: SPARK-23603 > URL: https://issues.apache.org/jira/browse/SPARK-23603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Priority: Major > > Jackson(>=2.7.7) fixes the possibility of missing tail data when the length > of the value is in a range > [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7] > [https://github.com/FasterXML/jackson-core/issues/307] > > spark-shell: > > {code:java} > val value = "x" * 3000 > val json = s"""{"big": "$value"}""" > spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect > res0: Array[org.apache.spark.sql.Row] = Array([2991]) > {code} > correct result : 3000 > > > There are two solutions > One is > bump jackson version to 2.7.7 > The other one is > Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data
[ https://issues.apache.org/jira/browse/SPARK-23603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-23603: --- Labels: ca (was: ) > When the length of the json is in a range,get_json_object will result in > missing tail data > -- > > Key: SPARK-23603 > URL: https://issues.apache.org/jira/browse/SPARK-23603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Priority: Major > Labels: ca > > Jackson(>=2.7.7) fixes the possibility of missing tail data when the length > of the value is in a range > [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7] > [https://github.com/FasterXML/jackson-core/issues/307] > > spark-shell: > > {code:java} > val value = "x" * 3000 > val json = s"""{"big": "$value"}""" > spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect > res0: Array[org.apache.spark.sql.Row] = Array([2991]) > {code} > correct result : 3000 > > > There are two solutions > One is > bump jackson version to 2.7.7 > The other one is > Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23608) SHS needs synchronization between attachSparkUI and detachSparkUI functions
[ https://issues.apache.org/jira/browse/SPARK-23608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23608: Assignee: Apache Spark > SHS needs synchronization between attachSparkUI and detachSparkUI functions > --- > > Key: SPARK-23608 > URL: https://issues.apache.org/jira/browse/SPARK-23608 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Assignee: Apache Spark >Priority: Minor > > We continuously hit an issue with SHS after it runs for a while and have some > REST API calls to it. SHS suddenly shows an empty home page with 0 > application. It is caused by the unexpected JSON data returned from rest call > "api/v1/applications?limit=8000". This REST call returns the home page html > codes instead of list of application summary. Some other REST call which asks > for application detailed information also returns home page html codes. But > there are still some working REST calls. We have to restart SHS to solve the > issue. > We attached remote debugger to the problematic process and checked the > attached jetty handlers tree in the web server. We found that the jetty > handler added by "attachHandler(ApiRootResource.getServletHandler(this))" is > not in the tree as well as some other handlers. Without the root resource > servlet handler, SHS will not work correctly serving both UI and REST calls. > SHS will directly return the HistoryServerPage html to user as it cannot find > handlers to handle the request. > Spark History Server has to attachSparkUI in order to serve user requests. > The application SparkUI getting attached when the application details data > gets loaded into Guava Cache. While attaching SparkUI, SHS will add attach > all jetty handlers into the current web service. But while the data gets > cleared out from Guava Cache, SHS will detach all the application's SparkUI > jetty handlers. Due to the asynchronous feature in Guava Cache, the clear out > from cache is not synchronized with loading into cache. The actual clear out > in Guava Cache which triggers detachSparkUI might be detaching the handlers > while the attachSparkUI is attaching jetty handlers. > After adding synchronization between attachSparkUI and detachSparkUI in > history server, this issue never happens again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23608) SHS needs synchronization between attachSparkUI and detachSparkUI functions
[ https://issues.apache.org/jira/browse/SPARK-23608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23608: Assignee: (was: Apache Spark) > SHS needs synchronization between attachSparkUI and detachSparkUI functions > --- > > Key: SPARK-23608 > URL: https://issues.apache.org/jira/browse/SPARK-23608 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > > We continuously hit an issue with SHS after it runs for a while and have some > REST API calls to it. SHS suddenly shows an empty home page with 0 > application. It is caused by the unexpected JSON data returned from rest call > "api/v1/applications?limit=8000". This REST call returns the home page html > codes instead of list of application summary. Some other REST call which asks > for application detailed information also returns home page html codes. But > there are still some working REST calls. We have to restart SHS to solve the > issue. > We attached remote debugger to the problematic process and checked the > attached jetty handlers tree in the web server. We found that the jetty > handler added by "attachHandler(ApiRootResource.getServletHandler(this))" is > not in the tree as well as some other handlers. Without the root resource > servlet handler, SHS will not work correctly serving both UI and REST calls. > SHS will directly return the HistoryServerPage html to user as it cannot find > handlers to handle the request. > Spark History Server has to attachSparkUI in order to serve user requests. > The application SparkUI getting attached when the application details data > gets loaded into Guava Cache. While attaching SparkUI, SHS will add attach > all jetty handlers into the current web service. But while the data gets > cleared out from Guava Cache, SHS will detach all the application's SparkUI > jetty handlers. Due to the asynchronous feature in Guava Cache, the clear out > from cache is not synchronized with loading into cache. The actual clear out > in Guava Cache which triggers detachSparkUI might be detaching the handlers > while the attachSparkUI is attaching jetty handlers. > After adding synchronization between attachSparkUI and detachSparkUI in > history server, this issue never happens again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23608) SHS needs synchronization between attachSparkUI and detachSparkUI functions
[ https://issues.apache.org/jira/browse/SPARK-23608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387176#comment-16387176 ] Apache Spark commented on SPARK-23608: -- User 'zhouyejoe' has created a pull request for this issue: https://github.com/apache/spark/pull/20744 > SHS needs synchronization between attachSparkUI and detachSparkUI functions > --- > > Key: SPARK-23608 > URL: https://issues.apache.org/jira/browse/SPARK-23608 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > > We continuously hit an issue with SHS after it runs for a while and have some > REST API calls to it. SHS suddenly shows an empty home page with 0 > application. It is caused by the unexpected JSON data returned from rest call > "api/v1/applications?limit=8000". This REST call returns the home page html > codes instead of list of application summary. Some other REST call which asks > for application detailed information also returns home page html codes. But > there are still some working REST calls. We have to restart SHS to solve the > issue. > We attached remote debugger to the problematic process and checked the > attached jetty handlers tree in the web server. We found that the jetty > handler added by "attachHandler(ApiRootResource.getServletHandler(this))" is > not in the tree as well as some other handlers. Without the root resource > servlet handler, SHS will not work correctly serving both UI and REST calls. > SHS will directly return the HistoryServerPage html to user as it cannot find > handlers to handle the request. > Spark History Server has to attachSparkUI in order to serve user requests. > The application SparkUI getting attached when the application details data > gets loaded into Guava Cache. While attaching SparkUI, SHS will add attach > all jetty handlers into the current web service. But while the data gets > cleared out from Guava Cache, SHS will detach all the application's SparkUI > jetty handlers. Due to the asynchronous feature in Guava Cache, the clear out > from cache is not synchronized with loading into cache. The actual clear out > in Guava Cache which triggers detachSparkUI might be detaching the handlers > while the attachSparkUI is attaching jetty handlers. > After adding synchronization between attachSparkUI and detachSparkUI in > history server, this issue never happens again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23608) SHS needs synchronization between attachSparkUI and detachSparkUI functions
[ https://issues.apache.org/jira/browse/SPARK-23608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387175#comment-16387175 ] Ye Zhou commented on SPARK-23608: - Pull Request: https://github.com/apache/spark/pull/20744 > SHS needs synchronization between attachSparkUI and detachSparkUI functions > --- > > Key: SPARK-23608 > URL: https://issues.apache.org/jira/browse/SPARK-23608 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > > We continuously hit an issue with SHS after it runs for a while and have some > REST API calls to it. SHS suddenly shows an empty home page with 0 > application. It is caused by the unexpected JSON data returned from rest call > "api/v1/applications?limit=8000". This REST call returns the home page html > codes instead of list of application summary. Some other REST call which asks > for application detailed information also returns home page html codes. But > there are still some working REST calls. We have to restart SHS to solve the > issue. > We attached remote debugger to the problematic process and checked the > attached jetty handlers tree in the web server. We found that the jetty > handler added by "attachHandler(ApiRootResource.getServletHandler(this))" is > not in the tree as well as some other handlers. Without the root resource > servlet handler, SHS will not work correctly serving both UI and REST calls. > SHS will directly return the HistoryServerPage html to user as it cannot find > handlers to handle the request. > Spark History Server has to attachSparkUI in order to serve user requests. > The application SparkUI getting attached when the application details data > gets loaded into Guava Cache. While attaching SparkUI, SHS will add attach > all jetty handlers into the current web service. But while the data gets > cleared out from Guava Cache, SHS will detach all the application's SparkUI > jetty handlers. Due to the asynchronous feature in Guava Cache, the clear out > from cache is not synchronized with loading into cache. The actual clear out > in Guava Cache which triggers detachSparkUI might be detaching the handlers > while the attachSparkUI is attaching jetty handlers. > After adding synchronization between attachSparkUI and detachSparkUI in > history server, this issue never happens again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23608) SHS needs synchronization between attachSparkUI and detachSparkUI functions
[ https://issues.apache.org/jira/browse/SPARK-23608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387162#comment-16387162 ] Ye Zhou commented on SPARK-23608: - I will post a pull request for this minor change. [~vanzin] [~zsxwing] Any comments? Thanks. > SHS needs synchronization between attachSparkUI and detachSparkUI functions > --- > > Key: SPARK-23608 > URL: https://issues.apache.org/jira/browse/SPARK-23608 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > > We continuously hit an issue with SHS after it runs for a while and have some > REST API calls to it. SHS suddenly shows an empty home page with 0 > application. It is caused by the unexpected JSON data returned from rest call > "api/v1/applications?limit=8000". This REST call returns the home page html > codes instead of list of application summary. Some other REST call which asks > for application detailed information also returns home page html codes. But > there are still some working REST calls. We have to restart SHS to solve the > issue. > We attached remote debugger to the problematic process and checked the > attached jetty handlers tree in the web server. We found that the jetty > handler added by "attachHandler(ApiRootResource.getServletHandler(this))" is > not in the tree as well as some other handlers. Without the root resource > servlet handler, SHS will not work correctly serving both UI and REST calls. > SHS will directly return the HistoryServerPage html to user as it cannot find > handlers to handle the request. > Spark History Server has to attachSparkUI in order to serve user requests. > The application SparkUI getting attached when the application details data > gets loaded into Guava Cache. While attaching SparkUI, SHS will add attach > all jetty handlers into the current web service. But while the data gets > cleared out from Guava Cache, SHS will detach all the application's SparkUI > jetty handlers. Due to the asynchronous feature in Guava Cache, the clear out > from cache is not synchronized with loading into cache. The actual clear out > in Guava Cache which triggers detachSparkUI might be detaching the handlers > while the attachSparkUI is attaching jetty handlers. > After adding synchronization between attachSparkUI and detachSparkUI in > history server, this issue never happens again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23608) SHS needs synchronization between attachSparkUI and detachSparkUI functions
Ye Zhou created SPARK-23608: --- Summary: SHS needs synchronization between attachSparkUI and detachSparkUI functions Key: SPARK-23608 URL: https://issues.apache.org/jira/browse/SPARK-23608 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 2.3.0 Reporter: Ye Zhou We continuously hit an issue with SHS after it runs for a while and have some REST API calls to it. SHS suddenly shows an empty home page with 0 application. It is caused by the unexpected JSON data returned from rest call "api/v1/applications?limit=8000". This REST call returns the home page html codes instead of list of application summary. Some other REST call which asks for application detailed information also returns home page html codes. But there are still some working REST calls. We have to restart SHS to solve the issue. We attached remote debugger to the problematic process and checked the attached jetty handlers tree in the web server. We found that the jetty handler added by "attachHandler(ApiRootResource.getServletHandler(this))" is not in the tree as well as some other handlers. Without the root resource servlet handler, SHS will not work correctly serving both UI and REST calls. SHS will directly return the HistoryServerPage html to user as it cannot find handlers to handle the request. Spark History Server has to attachSparkUI in order to serve user requests. The application SparkUI getting attached when the application details data gets loaded into Guava Cache. While attaching SparkUI, SHS will add attach all jetty handlers into the current web service. But while the data gets cleared out from Guava Cache, SHS will detach all the application's SparkUI jetty handlers. Due to the asynchronous feature in Guava Cache, the clear out from cache is not synchronized with loading into cache. The actual clear out in Guava Cache which triggers detachSparkUI might be detaching the handlers while the attachSparkUI is attaching jetty handlers. After adding synchronization between attachSparkUI and detachSparkUI in history server, this issue never happens again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387126#comment-16387126 ] Ye Zhou edited comment on SPARK-23607 at 3/6/18 1:42 AM: - [~vanzin] [~zsxwing] Any comments? Thanks. was (Author: zhouyejoe): [~vanzin] Any comments? Thanks. > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Major > Fix For: 2.4.0 > > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Zhou updated SPARK-23607: Shepherd: (was: Marcelo Vanzin) > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Major > Fix For: 2.4.0 > > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387126#comment-16387126 ] Ye Zhou commented on SPARK-23607: - [~vanzin] Any comments? Thanks. > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Major > Fix For: 2.4.0 > > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
Ye Zhou created SPARK-23607: --- Summary: Use HDFS extended attributes to store application summary to improve the Spark History Server performance Key: SPARK-23607 URL: https://issues.apache.org/jira/browse/SPARK-23607 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 2.3.0 Reporter: Ye Zhou Fix For: 2.4.0 Currently in Spark History Server, checkForLogs thread will create replaying tasks for log files which have file size change. The replaying task will filter out most of the log file content and keep the application summary including applicationId, user, attemptACL, start time, end time. The application summary data will get updated into listing.ldb and serve the application list on SHS home page. For a long running application, its log file which name ends with "inprogress" will get replayed for multiple times to get these application summary. This is a waste of computing and data reading resource to SHS, which results in the delay for application to get showing up on home page. Internally we have a patch which utilizes HDFS extended attributes to improve the performance for getting application summary in SHS. With this patch, Driver will write the application summary information into extended attributes as key/value. SHS will try to read from extended attributes. If SHS fails to read from extended attributes, it will fall back to read from the log file content as usual. This feature can be enable/disable through configuration. It has been running fine for 4 months internally with this patch and the last updated timestamp on SHS keeps within 1 minute as we configure the interval to 1 minute. Originally we had long delay which could be as long as 30 minutes in our scale where we have a large number of Spark applications running per day. We want to see whether this kind of approach is also acceptable to community. Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387083#comment-16387083 ] Edwina Lu commented on SPARK-23206: --- [~assia6], sorry for the delay. I'm planning to submit a pull request in a couple of weeks. > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387071#comment-16387071 ] Attila Zsolt Piros commented on SPARK-16630: Of course we can consider "spark.yarn.executor.failuresValidityInterval" for "yarn-level" backlisting too. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21812) PySpark ML Models should not depend transfering params from Java
[ https://issues.apache.org/jira/browse/SPARK-21812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387068#comment-16387068 ] Bryan Cutler commented on SPARK-21812: -- Adding SPARK-15009 as an example of how to restructure the model class hierarchy, using CountVectorizer, to own params instead of depending on transfer from Scala. > PySpark ML Models should not depend transfering params from Java > > > Key: SPARK-21812 > URL: https://issues.apache.org/jira/browse/SPARK-21812 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Major > > After SPARK-10931 we should fix this so that the Python parameters are > explicitly defined instead of relying on copying them from Java. This can be > done in batches of models as sub issues since the number of params to be > update could be quite large. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23604) ParquetInteroperabilityTest timestamp test should use Statistics.hasNonNullValue
[ https://issues.apache.org/jira/browse/SPARK-23604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23604: -- Assignee: Henry Robinson > ParquetInteroperabilityTest timestamp test should use > Statistics.hasNonNullValue > > > Key: SPARK-23604 > URL: https://issues.apache.org/jira/browse/SPARK-23604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Assignee: Henry Robinson >Priority: Minor > Fix For: 2.4.0 > > > We ran into an issue with a downstream build of Spark running against a > custom Parquet build where {{ParquetInteroperabilityTestSuite}} started > failing because {{Statistics.isEmpty}} changed its behavior as of > PARQUET-1217. ({{isEmpty()}} now considers whether there are 0 or more nulls, > and by default {{num_nulls}} is 0 for 'empty' stats objects). > The test really cares about whether the statistics object has values, so a > very simple fix to use {{hasNonNullValue}} instead corrects the issue. Filing > it now because it's a backwards-compatible fix to the current Parquet version > so we can fix it right now before we hit the issue in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23604) ParquetInteroperabilityTest timestamp test should use Statistics.hasNonNullValue
[ https://issues.apache.org/jira/browse/SPARK-23604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23604. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20740 [https://github.com/apache/spark/pull/20740] > ParquetInteroperabilityTest timestamp test should use > Statistics.hasNonNullValue > > > Key: SPARK-23604 > URL: https://issues.apache.org/jira/browse/SPARK-23604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Assignee: Henry Robinson >Priority: Minor > Fix For: 2.4.0 > > > We ran into an issue with a downstream build of Spark running against a > custom Parquet build where {{ParquetInteroperabilityTestSuite}} started > failing because {{Statistics.isEmpty}} changed its behavior as of > PARQUET-1217. ({{isEmpty()}} now considers whether there are 0 or more nulls, > and by default {{num_nulls}} is 0 for 'empty' stats objects). > The test really cares about whether the statistics object has values, so a > very simple fix to use {{hasNonNullValue}} instead corrects the issue. Filing > it now because it's a backwards-compatible fix to the current Parquet version > so we can fix it right now before we hit the issue in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23606) Flakey FileBasedDataSourceSuite
Henry Robinson created SPARK-23606: -- Summary: Flakey FileBasedDataSourceSuite Key: SPARK-23606 URL: https://issues.apache.org/jira/browse/SPARK-23606 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Henry Robinson I've seen the following exception twice today in PR builds (one example: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87978/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/). It's not deterministic, as I've had one PR build pass in the same span. {code:java} sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 15 times over 10.016101897 seconds. Last failure message: There are 1 possibly leaked file streams.. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) at org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:30) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) at org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:30) at org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) at org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:30) at org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) at org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) at org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) at org.scalatest.Status$class.withAfterEffect(Status.scala:375) at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) at org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:30) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) at scala.collection.immutable.List.foreach(List.scala:381) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) at org.scalatest.Suite$class.run(Suite.scala:1147) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.SuperEngine.runImpl(Engine.scala:521) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are 1 possibly leaked file streams. at org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54) at org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply$mcV$sp(SharedSparkSession.scala:115) at org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply(SharedSparkSession.scala:115)
[jira] [Resolved] (SPARK-23586) Add interpreted execution for WrapOption expression
[ https://issues.apache.org/jira/browse/SPARK-23586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-23586. --- Resolution: Fixed Assignee: Marco Gaido Fix Version/s: 2.4.0 > Add interpreted execution for WrapOption expression > --- > > Key: SPARK-23586 > URL: https://issues.apache.org/jira/browse/SPARK-23586 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387047#comment-16387047 ] Apache Spark commented on SPARK-23020: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20743 > Re-enable Flaky Test: > org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Sameer Agarwal >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.4.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.
[ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387045#comment-16387045 ] Joseph K. Bradley commented on SPARK-23562: --- [~yogeshgarg] is going to work on this > RFormula handleInvalid should handle invalid values in non-string columns. > -- > > Key: SPARK-23562 > URL: https://issues.apache.org/jira/browse/SPARK-23562 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > Currently when handleInvalid is set to 'keep' or 'skip' this only applies to > String fields. Numeric fields that are null will either cause the transformer > to fail or might be null in the resulting label column. > I'm not sure what the semantics of keep might be for numeric columns with > null values, but we should be able to at least support skip for these types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387036#comment-16387036 ] Marcelo Vanzin commented on SPARK-23020: Things look pretty stable on master, so I'll post a backport for 2.3.1 so we get the fixes in the next maintenance release. https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-master-test-maven-hadoop-2.7/4571/testReport/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ > Re-enable Flaky Test: > org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Sameer Agarwal >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.4.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23581) Add an interpreted version of GenerateUnsafeProjection
[ https://issues.apache.org/jira/browse/SPARK-23581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-23581: - Assignee: Herman van Hovell > Add an interpreted version of GenerateUnsafeProjection > -- > > Key: SPARK-23581 > URL: https://issues.apache.org/jira/browse/SPARK-23581 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > > GenerateUnsafeProjection should have an interpreted cousin. See the parent > ticket for the motivation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386991#comment-16386991 ] Kazuaki Ishizaki commented on SPARK-23580: -- Sure, I will work for them > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18630) PySpark ML memory leak
[ https://issues.apache.org/jira/browse/SPARK-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18630. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20724 [https://github.com/apache/spark/pull/20724] > PySpark ML memory leak > -- > > Key: SPARK-18630 > URL: https://issues.apache.org/jira/browse/SPARK-18630 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: holdenk >Assignee: yogesh garg >Priority: Minor > Fix For: 2.4.0 > > > After SPARK-18274 is fixed by https://github.com/apache/spark/pull/15843, it > would be good to follow up and address the potential memory leak for all > items handled by the `JavaWrapper`, not just `JavaParams`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18630) PySpark ML memory leak
[ https://issues.apache.org/jira/browse/SPARK-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-18630: - Assignee: yogesh garg > PySpark ML memory leak > -- > > Key: SPARK-18630 > URL: https://issues.apache.org/jira/browse/SPARK-18630 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: holdenk >Assignee: yogesh garg >Priority: Minor > > After SPARK-18274 is fixed by https://github.com/apache/spark/pull/15843, it > would be good to follow up and address the potential memory leak for all > items handled by the `JavaWrapper`, not just `JavaParams`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386982#comment-16386982 ] Richard Yu edited comment on SPARK-18057 at 3/5/18 11:48 PM: - Hi all, I'm currently working on KAFKA-6608 which is part of KAFKA-4879. Do you have any thoughts on how {{position()}} should be bounded using time restraints? The discussion about this change is on the following thread: [https://www.mail-archive.com/dev@kafka.apache.org/msg85757.html] Thanks was (Author: yohan123): Hi all, I'm currently working on KAFKA-6608 which is part of KAFKA-4879. Do you have any thoughts on how {{position()}} should be bounded using time restraints? Thanks > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386982#comment-16386982 ] Richard Yu commented on SPARK-18057: Hi all, I'm currently working on KAFKA-6608 which is part of KAFKA-4879. Do you have any thoughts on how {{position()}} should be bounded using time restraints? Thanks > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23604) ParquetInteroperabilityTest timestamp test should use Statistics.hasNonNullValue
[ https://issues.apache.org/jira/browse/SPARK-23604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henry Robinson updated SPARK-23604: --- Description: We ran into an issue with a downstream build of Spark running against a custom Parquet build where {{ParquetInteroperabilityTestSuite}} started failing because {{Statistics.isEmpty}} changed its behavior as of PARQUET-1217. ({{isEmpty()}} now considers whether there are 0 or more nulls, and by default {{num_nulls}} is 0 for 'empty' stats objects). The test really cares about whether the statistics object has values, so a very simple fix to use {{hasNonNullValue}} instead corrects the issue. Filing it now because it's a backwards-compatible fix to the current Parquet version so we can fix it right now before we hit the issue in the future. was: We ran into an issue with a downstream build of Spark running against a custom Parquet build where {{ParquetInteroperabilityTestSuite}} started failing because {{Statistics.isEmpty}} changed its behavior as of PARQUET-1217. ({{isEmpty() now considers whether there are 0 or more nulls, and by default {{num_nulls}} is 0 for 'empty' stats objects). The test really cares about whether the statistics object has values, so a very simple fix to use {{hasNonNullValue}} instead corrects the issue. Filing it now because it's a backwards-compatible fix to the current Parquet version so we can fix it right now before we hit the issue in the future. > ParquetInteroperabilityTest timestamp test should use > Statistics.hasNonNullValue > > > Key: SPARK-23604 > URL: https://issues.apache.org/jira/browse/SPARK-23604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Priority: Minor > > We ran into an issue with a downstream build of Spark running against a > custom Parquet build where {{ParquetInteroperabilityTestSuite}} started > failing because {{Statistics.isEmpty}} changed its behavior as of > PARQUET-1217. ({{isEmpty()}} now considers whether there are 0 or more nulls, > and by default {{num_nulls}} is 0 for 'empty' stats objects). > The test really cares about whether the statistics object has values, so a > very simple fix to use {{hasNonNullValue}} instead corrects the issue. Filing > it now because it's a backwards-compatible fix to the current Parquet version > so we can fix it right now before we hit the issue in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23572) Update security.md to cover new features
[ https://issues.apache.org/jira/browse/SPARK-23572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386935#comment-16386935 ] Apache Spark commented on SPARK-23572: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20742 > Update security.md to cover new features > > > Key: SPARK-23572 > URL: https://issues.apache.org/jira/browse/SPARK-23572 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Major > > I just took a look at {{security.md}} and while it is correct, it covers > functionality that is now sort of obsolete (such as SASL-based encryption > instead of the newer AES encryption support). > We should go over that document and make sure everything is up to date. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23572) Update security.md to cover new features
[ https://issues.apache.org/jira/browse/SPARK-23572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23572: Assignee: (was: Apache Spark) > Update security.md to cover new features > > > Key: SPARK-23572 > URL: https://issues.apache.org/jira/browse/SPARK-23572 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Major > > I just took a look at {{security.md}} and while it is correct, it covers > functionality that is now sort of obsolete (such as SASL-based encryption > instead of the newer AES encryption support). > We should go over that document and make sure everything is up to date. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23572) Update security.md to cover new features
[ https://issues.apache.org/jira/browse/SPARK-23572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23572: Assignee: Apache Spark > Update security.md to cover new features > > > Key: SPARK-23572 > URL: https://issues.apache.org/jira/browse/SPARK-23572 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > I just took a look at {{security.md}} and while it is correct, it covers > functionality that is now sort of obsolete (such as SASL-based encryption > instead of the newer AES encryption support). > We should go over that document and make sure everything is up to date. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23538) Simplify SSL configuration for https client
[ https://issues.apache.org/jira/browse/SPARK-23538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23538. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20723 [https://github.com/apache/spark/pull/20723] > Simplify SSL configuration for https client > --- > > Key: SPARK-23538 > URL: https://issues.apache.org/jira/browse/SPARK-23538 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.4.0 > > > There's code in {{SecurityManager}} that is used to configure SSL for the > code that downloads dependencies from https servers: > {code} > // SSL configuration for the file server. This is used by > Utils.setupSecureURLConnection(). > val fileServerSSLOptions = getSSLOptions("fs") > val (sslSocketFactory, hostnameVerifier) = if > (fileServerSSLOptions.enabled) { > ... > {code} > It was added for an old feature that doesn't exist anymore (the "file server" > referenced in the comment), but can still be used to configure the built-in > JRE SSL code with a custom trust store, for example. > We should instead: > - move this code out of SecurityManager, and place it where it's actually > used ({{Utils.setupSecureURLConnection}}. > - remove the dummy trust manager / host verifier since they don't make a lot > of sense for the client code (and only made slightly more sense for the file > server case). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23538) Simplify SSL configuration for https client
[ https://issues.apache.org/jira/browse/SPARK-23538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23538: -- Assignee: Marcelo Vanzin > Simplify SSL configuration for https client > --- > > Key: SPARK-23538 > URL: https://issues.apache.org/jira/browse/SPARK-23538 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.4.0 > > > There's code in {{SecurityManager}} that is used to configure SSL for the > code that downloads dependencies from https servers: > {code} > // SSL configuration for the file server. This is used by > Utils.setupSecureURLConnection(). > val fileServerSSLOptions = getSSLOptions("fs") > val (sslSocketFactory, hostnameVerifier) = if > (fileServerSSLOptions.enabled) { > ... > {code} > It was added for an old feature that doesn't exist anymore (the "file server" > referenced in the comment), but can still be used to configure the built-in > JRE SSL code with a custom trust store, for example. > We should instead: > - move this code out of SecurityManager, and place it where it's actually > used ({{Utils.setupSecureURLConnection}}. > - remove the dummy trust manager / host verifier since they don't make a lot > of sense for the client code (and only made slightly more sense for the file > server case). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23040) BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified
[ https://issues.apache.org/jira/browse/SPARK-23040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23040. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20449 [https://github.com/apache/spark/pull/20449] > BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator > or ordering is specified > > > Key: SPARK-23040 > URL: https://issues.apache.org/jira/browse/SPARK-23040 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > Fix For: 2.4.0 > > > For example, if ordering is specified, the returned iterator is an > CompletionIterator > {code:java} > dep.keyOrdering match { > case Some(keyOrd: Ordering[K]) => > // Create an ExternalSorter to sort the data. > val sorter = > new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), > serializer = dep.serializer) > sorter.insertAll(aggregatedIter) > context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled) > context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled) > > context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes) > CompletionIterator[Product2[K, C], Iterator[Product2[K, > C]]](sorter.iterator, sorter.stop()) > case None => > aggregatedIter > } > {code} > However the sorter would consume(in sorter.insertAll) the > aggregatedIter(which may be interruptible), then creates an iterator which > isn't interruptible. > The problem with this is that Spark task cannot be cancelled due to stage > fail(without interruptThread enabled, which is disabled by default), which > wasting executor resource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23040) BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified
[ https://issues.apache.org/jira/browse/SPARK-23040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23040: --- Assignee: Xianjin YE > BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator > or ordering is specified > > > Key: SPARK-23040 > URL: https://issues.apache.org/jira/browse/SPARK-23040 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > Fix For: 2.4.0 > > > For example, if ordering is specified, the returned iterator is an > CompletionIterator > {code:java} > dep.keyOrdering match { > case Some(keyOrd: Ordering[K]) => > // Create an ExternalSorter to sort the data. > val sorter = > new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), > serializer = dep.serializer) > sorter.insertAll(aggregatedIter) > context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled) > context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled) > > context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes) > CompletionIterator[Product2[K, C], Iterator[Product2[K, > C]]](sorter.iterator, sorter.stop()) > case None => > aggregatedIter > } > {code} > However the sorter would consume(in sorter.insertAll) the > aggregatedIter(which may be interruptible), then creates an iterator which > isn't interruptible. > The problem with this is that Spark task cannot be cancelled due to stage > fail(without interruptThread enabled, which is disabled by default), which > wasting executor resource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path
[ https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23434: Fix Version/s: 2.2.2 > Spark should not warn `metadata directory` for a HDFS file path > --- > > Key: SPARK-23434 > URL: https://issues.apache.org/jira/browse/SPARK-23434 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), > it warns with a wrong error message during looking up > `people.json/_spark_metadata`. The root cause of this istuation is the > difference between `LocalFileSystem` and `DistributedFileSystem`. > `LocalFileSystem.exists()` returns `false`, but > `DistributedFileSystem.exists` raises Exception. > {code} > scala> spark.version > res0: String = 2.4.0-SNAPSHOT > scala> > spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show > ++---+ > | age| name| > ++---+ > |null|Michael| > | 30| Andy| > | 19| Justin| > ++---+ > scala> spark.read.json("hdfs:///tmp/people.json") > 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for > metadata directory. > 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for > metadata directory. > res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string] > {code} > {code} > scala> spark.version > res0: String = 2.2.1 > scala> spark.read.json("hdfs:///tmp/people.json").show > 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata > directory. > 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata > directory. > {code} > {code} > scala> spark.version > res0: String = 2.1.2 > scala> spark.read.json("hdfs:///tmp/people.json").show > 18/02/15 05:29:53 WARN DataSource: Error while looking for metadata directory. > ++---+ > | age| name| > ++---+ > |null|Michael| > | 30| Andy| > | 19| Justin| > ++---+ > {code} > {code} > scala> spark.version > res0: String = 2.0.2 > scala> spark.read.json("hdfs:///tmp/people.json").show > 18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23605) Conflicting dependencies for janino in 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-23605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Liu updated SPARK-23605: Labels: maven (was: ) > Conflicting dependencies for janino in 2.3.0 > > > Key: SPARK-23605 > URL: https://issues.apache.org/jira/browse/SPARK-23605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tao Liu >Priority: Minor > Labels: maven > > spark-catalyst_2.11 2.3.0 has both a janino 2.7.8 and a commons-compiler > 3.0.8 dependency which are conflicting with one another resulting in > ClassNotFoundExceptions. > java.lang.ClassNotFoundException: > org.codehaus.janino.InternalCompilerException > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1421) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) > at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1369) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:412) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:366) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1325) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.extractProjection$lzycompute(ExpressionEncoder.scala:264) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.extractProjection(ExpressionEncoder.scala:264) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:288) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:468) > at > org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:507) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23605) Conflicting dependencies for janino in 2.3.0
Tao Liu created SPARK-23605: --- Summary: Conflicting dependencies for janino in 2.3.0 Key: SPARK-23605 URL: https://issues.apache.org/jira/browse/SPARK-23605 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Tao Liu spark-catalyst_2.11 2.3.0 has both a janino 2.7.8 and a commons-compiler 3.0.8 dependency which are conflicting with one another resulting in ClassNotFoundExceptions. java.lang.ClassNotFoundException: org.codehaus.janino.InternalCompilerException at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1421) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494) at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1369) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:412) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:366) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1325) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.extractProjection$lzycompute(ExpressionEncoder.scala:264) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.extractProjection(ExpressionEncoder.scala:264) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:288) at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468) at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:468) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:507) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path
[ https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23434: Fix Version/s: 2.3.1 > Spark should not warn `metadata directory` for a HDFS file path > --- > > Key: SPARK-23434 > URL: https://issues.apache.org/jira/browse/SPARK-23434 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), > it warns with a wrong error message during looking up > `people.json/_spark_metadata`. The root cause of this istuation is the > difference between `LocalFileSystem` and `DistributedFileSystem`. > `LocalFileSystem.exists()` returns `false`, but > `DistributedFileSystem.exists` raises Exception. > {code} > scala> spark.version > res0: String = 2.4.0-SNAPSHOT > scala> > spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show > ++---+ > | age| name| > ++---+ > |null|Michael| > | 30| Andy| > | 19| Justin| > ++---+ > scala> spark.read.json("hdfs:///tmp/people.json") > 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for > metadata directory. > 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for > metadata directory. > res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string] > {code} > {code} > scala> spark.version > res0: String = 2.2.1 > scala> spark.read.json("hdfs:///tmp/people.json").show > 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata > directory. > 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata > directory. > {code} > {code} > scala> spark.version > res0: String = 2.1.2 > scala> spark.read.json("hdfs:///tmp/people.json").show > 18/02/15 05:29:53 WARN DataSource: Error while looking for metadata directory. > ++---+ > | age| name| > ++---+ > |null|Michael| > | 30| Andy| > | 19| Justin| > ++---+ > {code} > {code} > scala> spark.version > res0: String = 2.0.2 > scala> spark.read.json("hdfs:///tmp/people.json").show > 18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386839#comment-16386839 ] Xiao Li commented on SPARK-23580: - cc [~kiszk] Could you help this umbrella ticket? > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386789#comment-16386789 ] Attila Zsolt Piros commented on SPARK-16630: I have checked the existing sources and I would like to open a discussion about the possible solution. As I have seen YarnAllocator#processCompletedContainers could be extended to track the number of failures by host. Also YarnAllocator is responsible to update the task-level backlisted nodes with YARN (calling AMRMClient#updateBlacklist). So a relatively easy solution would be to have a separate counter here (which is independent from task level failures) with its own configured limit and updating YARN with the union of task-level backlisted nodes and "allocator-level" backlisted nodes. What is your opinion? > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23559) add epoch ID to data writer factory
[ https://issues.apache.org/jira/browse/SPARK-23559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23559. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 20710 [https://github.com/apache/spark/pull/20710] > add epoch ID to data writer factory > --- > > Key: SPARK-23559 > URL: https://issues.apache.org/jira/browse/SPARK-23559 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 3.0.0 > > > To support the StreamWriter lifecycle described in SPARK-22910, epoch ID has > to be specifiable at DataWriter creation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23559) add epoch ID to data writer factory
[ https://issues.apache.org/jira/browse/SPARK-23559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-23559: - Assignee: Jose Torres > add epoch ID to data writer factory > --- > > Key: SPARK-23559 > URL: https://issues.apache.org/jira/browse/SPARK-23559 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 3.0.0 > > > To support the StreamWriter lifecycle described in SPARK-22910, epoch ID has > to be specifiable at DataWriter creation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23586) Add interpreted execution for WrapOption expression
[ https://issues.apache.org/jira/browse/SPARK-23586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386780#comment-16386780 ] Apache Spark commented on SPARK-23586: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20741 > Add interpreted execution for WrapOption expression > --- > > Key: SPARK-23586 > URL: https://issues.apache.org/jira/browse/SPARK-23586 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23604) ParquetInteroperabilityTest timestamp test should use Statistics.hasNonNullValue
[ https://issues.apache.org/jira/browse/SPARK-23604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23604: Assignee: Apache Spark > ParquetInteroperabilityTest timestamp test should use > Statistics.hasNonNullValue > > > Key: SPARK-23604 > URL: https://issues.apache.org/jira/browse/SPARK-23604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Assignee: Apache Spark >Priority: Minor > > We ran into an issue with a downstream build of Spark running against a > custom Parquet build where {{ParquetInteroperabilityTestSuite}} started > failing because {{Statistics.isEmpty}} changed its behavior as of > PARQUET-1217. ({{isEmpty() now considers whether there are 0 or more nulls, > and by default {{num_nulls}} is 0 for 'empty' stats objects). > The test really cares about whether the statistics object has values, so a > very simple fix to use {{hasNonNullValue}} instead corrects the issue. Filing > it now because it's a backwards-compatible fix to the current Parquet version > so we can fix it right now before we hit the issue in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23604) ParquetInteroperabilityTest timestamp test should use Statistics.hasNonNullValue
[ https://issues.apache.org/jira/browse/SPARK-23604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386769#comment-16386769 ] Apache Spark commented on SPARK-23604: -- User 'henryr' has created a pull request for this issue: https://github.com/apache/spark/pull/20740 > ParquetInteroperabilityTest timestamp test should use > Statistics.hasNonNullValue > > > Key: SPARK-23604 > URL: https://issues.apache.org/jira/browse/SPARK-23604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Priority: Minor > > We ran into an issue with a downstream build of Spark running against a > custom Parquet build where {{ParquetInteroperabilityTestSuite}} started > failing because {{Statistics.isEmpty}} changed its behavior as of > PARQUET-1217. ({{isEmpty() now considers whether there are 0 or more nulls, > and by default {{num_nulls}} is 0 for 'empty' stats objects). > The test really cares about whether the statistics object has values, so a > very simple fix to use {{hasNonNullValue}} instead corrects the issue. Filing > it now because it's a backwards-compatible fix to the current Parquet version > so we can fix it right now before we hit the issue in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23604) ParquetInteroperabilityTest timestamp test should use Statistics.hasNonNullValue
[ https://issues.apache.org/jira/browse/SPARK-23604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23604: Assignee: (was: Apache Spark) > ParquetInteroperabilityTest timestamp test should use > Statistics.hasNonNullValue > > > Key: SPARK-23604 > URL: https://issues.apache.org/jira/browse/SPARK-23604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Priority: Minor > > We ran into an issue with a downstream build of Spark running against a > custom Parquet build where {{ParquetInteroperabilityTestSuite}} started > failing because {{Statistics.isEmpty}} changed its behavior as of > PARQUET-1217. ({{isEmpty() now considers whether there are 0 or more nulls, > and by default {{num_nulls}} is 0 for 'empty' stats objects). > The test really cares about whether the statistics object has values, so a > very simple fix to use {{hasNonNullValue}} instead corrects the issue. Filing > it now because it's a backwards-compatible fix to the current Parquet version > so we can fix it right now before we hit the issue in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23604) ParquetInteroperabilityTest timestamp test should use Statistics.hasNonNullValue
Henry Robinson created SPARK-23604: -- Summary: ParquetInteroperabilityTest timestamp test should use Statistics.hasNonNullValue Key: SPARK-23604 URL: https://issues.apache.org/jira/browse/SPARK-23604 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Henry Robinson We ran into an issue with a downstream build of Spark running against a custom Parquet build where {{ParquetInteroperabilityTestSuite}} started failing because {{Statistics.isEmpty}} changed its behavior as of PARQUET-1217. ({{isEmpty() now considers whether there are 0 or more nulls, and by default {{num_nulls}} is 0 for 'empty' stats objects). The test really cares about whether the statistics object has values, so a very simple fix to use {{hasNonNullValue}} instead corrects the issue. Filing it now because it's a backwards-compatible fix to the current Parquet version so we can fix it right now before we hit the issue in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23457) Register task completion listeners first for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23457: Fix Version/s: 2.3.1 > Register task completion listeners first for ParquetFileFormat > -- > > Key: SPARK-23457 > URL: https://issues.apache.org/jira/browse/SPARK-23457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > ParquetFileFormat leaks open files in some cases. This issue aims to register > task completion listener first. > {code} > test("SPARK-23390 Register task completion listeners first in > ParquetFileFormat") { > withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> > s"${Int.MaxValue}") { > withTempDir { dir => > val basePath = dir.getCanonicalPath > Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, > "first").toString) > Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, > "second").toString) > val df = spark.read.parquet( > new Path(basePath, "first").toString, > new Path(basePath, "second").toString) > val e = intercept[SparkException] { > df.collect() > } > assert(e.getCause.isInstanceOf[OutOfMemoryError]) > } > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23585) Add interpreted execution for UnwrapOption expression
[ https://issues.apache.org/jira/browse/SPARK-23585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-23585. --- Resolution: Fixed Assignee: Marco Gaido Fix Version/s: 2.4.0 > Add interpreted execution for UnwrapOption expression > - > > Key: SPARK-23585 > URL: https://issues.apache.org/jira/browse/SPARK-23585 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22882) ML test for StructuredStreaming: spark.ml.classification
[ https://issues.apache.org/jira/browse/SPARK-22882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22882. --- Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20121 [https://github.com/apache/spark/pull/20121] > ML test for StructuredStreaming: spark.ml.classification > > > Key: SPARK-22882 > URL: https://issues.apache.org/jira/browse/SPARK-22882 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23500) Filters on named_structs could be pushed into scans
[ https://issues.apache.org/jira/browse/SPARK-23500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23500: Assignee: Apache Spark > Filters on named_structs could be pushed into scans > --- > > Key: SPARK-23500 > URL: https://issues.apache.org/jira/browse/SPARK-23500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Assignee: Apache Spark >Priority: Major > > Simple filters on dataframes joined with {{joinWith()}} are missing an > opportunity to get pushed into the scan because they're written in terms of > {{named_struct}} that could be removed by the optimizer. > Given the following simple query over two dataframes: > {code:java} > scala> val df = spark.read.parquet("one_million") > df: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df2 = spark.read.parquet("one_million") > df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> df.joinWith(df2, df2.col("id") === df.col("id2")).filter("_2.id > > 30").explain > == Physical Plan == > *(2) BroadcastHashJoin [_1#94.id2], [_2#95.id], Inner, BuildRight > :- *(2) Project [named_struct(id, id#0L, id2, id2#1L) AS _1#94] > : +- *(2) FileScan parquet [id#0L,id2#1L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > struct, false].id)) >+- *(1) Project [named_struct(id, id#90L, id2, id2#91L) AS _2#95] > +- *(1) Filter (named_struct(id, id#90L, id2, id2#91L).id > 30) > +- *(1) FileScan parquet [id#90L,id2#91L] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct > {code} > Using {{joinWith}} means that the filter is placed on a {{named_struct}}, and > is then pushed down. When the filter is just above the scan, the > wrapping-and-projection of {{named_struct(id...).id}} is a no-op and could be > removed. Then the filter can be pushed down to Parquet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23500) Filters on named_structs could be pushed into scans
[ https://issues.apache.org/jira/browse/SPARK-23500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386604#comment-16386604 ] Apache Spark commented on SPARK-23500: -- User 'henryr' has created a pull request for this issue: https://github.com/apache/spark/pull/20687 > Filters on named_structs could be pushed into scans > --- > > Key: SPARK-23500 > URL: https://issues.apache.org/jira/browse/SPARK-23500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Priority: Major > > Simple filters on dataframes joined with {{joinWith()}} are missing an > opportunity to get pushed into the scan because they're written in terms of > {{named_struct}} that could be removed by the optimizer. > Given the following simple query over two dataframes: > {code:java} > scala> val df = spark.read.parquet("one_million") > df: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df2 = spark.read.parquet("one_million") > df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> df.joinWith(df2, df2.col("id") === df.col("id2")).filter("_2.id > > 30").explain > == Physical Plan == > *(2) BroadcastHashJoin [_1#94.id2], [_2#95.id], Inner, BuildRight > :- *(2) Project [named_struct(id, id#0L, id2, id2#1L) AS _1#94] > : +- *(2) FileScan parquet [id#0L,id2#1L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > struct, false].id)) >+- *(1) Project [named_struct(id, id#90L, id2, id2#91L) AS _2#95] > +- *(1) Filter (named_struct(id, id#90L, id2, id2#91L).id > 30) > +- *(1) FileScan parquet [id#90L,id2#91L] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct > {code} > Using {{joinWith}} means that the filter is placed on a {{named_struct}}, and > is then pushed down. When the filter is just above the scan, the > wrapping-and-projection of {{named_struct(id...).id}} is a no-op and could be > removed. Then the filter can be pushed down to Parquet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23500) Filters on named_structs could be pushed into scans
[ https://issues.apache.org/jira/browse/SPARK-23500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23500: Assignee: (was: Apache Spark) > Filters on named_structs could be pushed into scans > --- > > Key: SPARK-23500 > URL: https://issues.apache.org/jira/browse/SPARK-23500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Priority: Major > > Simple filters on dataframes joined with {{joinWith()}} are missing an > opportunity to get pushed into the scan because they're written in terms of > {{named_struct}} that could be removed by the optimizer. > Given the following simple query over two dataframes: > {code:java} > scala> val df = spark.read.parquet("one_million") > df: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df2 = spark.read.parquet("one_million") > df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> df.joinWith(df2, df2.col("id") === df.col("id2")).filter("_2.id > > 30").explain > == Physical Plan == > *(2) BroadcastHashJoin [_1#94.id2], [_2#95.id], Inner, BuildRight > :- *(2) Project [named_struct(id, id#0L, id2, id2#1L) AS _1#94] > : +- *(2) FileScan parquet [id#0L,id2#1L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > struct, false].id)) >+- *(1) Project [named_struct(id, id#90L, id2, id2#91L) AS _2#95] > +- *(1) Filter (named_struct(id, id#90L, id2, id2#91L).id > 30) > +- *(1) FileScan parquet [id#90L,id2#91L] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct > {code} > Using {{joinWith}} means that the filter is placed on a {{named_struct}}, and > is then pushed down. When the filter is just above the scan, the > wrapping-and-projection of {{named_struct(id...).id}} is a no-op and could be > removed. Then the filter can be pushed down to Parquet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18791) Stream-Stream Joins
[ https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386583#comment-16386583 ] Yuriy Bondaruk commented on SPARK-18791: Shouldn't it be marked as resolved? Stream-stream joins are already supported in Spark 2.3: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins > Stream-Stream Joins > --- > > Key: SPARK-18791 > URL: https://issues.apache.org/jira/browse/SPARK-18791 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Tathagata Das >Priority: Major > > Stream stream join is a much requested, but missing feature in Structured > Streaming. While the join API exists in Datasets and DataFrames, it throws > UnsupportedOperationException when applied between two streaming > Datasets/DataFrames. To support this, we have to maintain the same semantics > as other Structured Streaming operations - the result of the operation after > consuming two data streams data till positions/offsets X and Y, respectively, > must be the same as a single batch join operation on all the data till > positions X and Y, respectively. To achieve this, the execution has to buffer > past data (i.e. streaming state) from each stream, so that future data can be > matched against past data. Here is the set of a few high-level requirements. > - Buffer past rows as streaming state (using StateStore), and joining with > the past rows. > - Support state cleanup using the event time watermark when possible. > - Support different types of joins (inner, left outer, right outer is in > highest demand for ETL/enrichment type use cases [kafka -> best-effort enrich > -> write to S3]) > - Support cascading join operations (i.e. joining more than 2 streams) > - Support multiple output modes (Append mode is in highest demand for > enabling ETL/enrichment type use cases) > All the work to incrementally build this is going represented by this JIRA, > with specific subtasks for each step. At this point, this is the rough > direction as follows: > - Implement stream-stream inner join in Append Mode, supporting multiple > cascaded joins. > - Extends it stream-stream left/right outer join in Append Mode -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386554#comment-16386554 ] ASF GitHub Bot commented on SPARK-15343: Github user kr-arjun commented on the issue: https://github.com/apache/drill/pull/1011 @paul-rogers I was able to resolve this issue by workaround of setting 'yarn.timeline-service.enabled' to false ( Copied yarn-site.xml with this property set to $DRILL_SITE directory). This issue is specific to environment where Timeline server is enabled. Initially , it failed with 'java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig'. On copying required jars to Drill classpath , it failed with exception I have shared in the previous attachment. The same issue is reported in Spark as well (https://issues.apache.org/jira/browse/SPARK-15343). To find the error stack trace, I had to modify the DrillOnYarn.java to print StackTrace. Thought it would be useful if stack trace can be logged for troubleshooting purpose. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoke
[jira] [Assigned] (SPARK-22882) ML test for StructuredStreaming: spark.ml.classification
[ https://issues.apache.org/jira/browse/SPARK-22882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-22882: - Assignee: Weichen Xu > ML test for StructuredStreaming: spark.ml.classification > > > Key: SPARK-22882 > URL: https://issues.apache.org/jira/browse/SPARK-22882 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386522#comment-16386522 ] Joseph K. Bradley commented on SPARK-22883: --- Preparing part 2 > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > > *For featurizers with names from A - M* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
[ https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386520#comment-16386520 ] Joseph K. Bradley commented on SPARK-22446: --- Maybe not then unless someone complains? > Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" > exception incorrectly for filtered data. > --- > > Key: SPARK-22446 > URL: https://issues.apache.org/jira/browse/SPARK-22446 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.1.2, 2.2.1 > Environment: spark-shell, local mode, macOS Sierra 10.12.6 >Reporter: Greg Bellchambers >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.0 > > > In the following, the `indexer` UDF defined inside the > `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an > "Unseen label" error, despite the label not being present in the transformed > DataFrame. > Here is the definition of the indexer UDF in the transform method: > {code:java} > val indexer = udf { label: String => > if (labelToIndex.contains(label)) { > labelToIndex(label) > } else { > throw new SparkException(s"Unseen label: $label.") > } > } > {code} > We can demonstrate the error with a very simple example DataFrame. > {code:java} > scala> import org.apache.spark.ml.feature.StringIndexer > import org.apache.spark.ml.feature.StringIndexer > scala> // first we create a DataFrame with three cities > scala> val df = List( > | ("A", "London", "StrA"), > | ("B", "Bristol", null), > | ("C", "New York", "StrC") > | ).toDF("ID", "CITY", "CONTENT") > df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more > field] > scala> df.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | B| Bristol| null| > | C|New York| StrC| > +---++---+ > scala> // then we remove the row with null in CONTENT column, which removes > Bristol > scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) > dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: > string, CITY: string ... 1 more field] > scala> dfNoBristol.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | C|New York| StrC| > +---++---+ > scala> // now create a StringIndexer for the CITY column and fit to > dfNoBristol > scala> val model = { > | new StringIndexer() > | .setInputCol("CITY") > | .setOutputCol("CITYIndexed") > | .fit(dfNoBristol) > | } > model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb > scala> // the StringIndexerModel has only two labels: "London" and "New York" > scala> str.labels foreach println > London > New York > scala> // transform our DataFrame to add an index column > scala> val dfWithIndex = model.transform(dfNoBristol) > dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 > more fields] > scala> dfWithIndex.show > +---++---+---+ > | ID|CITY|CONTENT|CITYIndexed| > +---++---+---+ > | A| London| StrA|0.0| > | C|New York| StrC|1.0| > +---++---+---+ > {code} > The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` > equal to 1.0 and perform an action. The `indexer` UDF in `transform` method > throws an exception reporting unseen label "Bristol". This is irrational > behaviour as far as the user of the API is concerned, because there is no > such value as "Bristol" when do show all rows of `dfWithIndex`: > {code:java} > scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count > 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$5: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(Shuffle
[jira] [Assigned] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1
[ https://issues.apache.org/jira/browse/SPARK-22430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reassigned SPARK-22430: Assignee: Rekha Joshi > Unknown tag warnings when building R docs with Roxygen 6.0.1 > > > Key: SPARK-22430 > URL: https://issues.apache.org/jira/browse/SPARK-22430 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 > Environment: Roxygen 6.0.1 >Reporter: Joel Croteau >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.4.0 > > > When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of > unknown tag warnings are generated: > {noformat} > Warning: @export [schema.R#33]: unknown tag > Warning: @export [schema.R#53]: unknown tag > Warning: @export [schema.R#63]: unknown tag > Warning: @export [schema.R#80]: unknown tag > Warning: @export [schema.R#123]: unknown tag > Warning: @export [schema.R#141]: unknown tag > Warning: @export [schema.R#216]: unknown tag > Warning: @export [generics.R#388]: unknown tag > Warning: @export [generics.R#403]: unknown tag > Warning: @export [generics.R#407]: unknown tag > Warning: @export [generics.R#414]: unknown tag > Warning: @export [generics.R#418]: unknown tag > Warning: @export [generics.R#422]: unknown tag > Warning: @export [generics.R#428]: unknown tag > Warning: @export [generics.R#432]: unknown tag > Warning: @export [generics.R#438]: unknown tag > Warning: @export [generics.R#442]: unknown tag > Warning: @export [generics.R#446]: unknown tag > Warning: @export [generics.R#450]: unknown tag > Warning: @export [generics.R#454]: unknown tag > Warning: @export [generics.R#459]: unknown tag > Warning: @export [generics.R#467]: unknown tag > Warning: @export [generics.R#475]: unknown tag > Warning: @export [generics.R#479]: unknown tag > Warning: @export [generics.R#483]: unknown tag > Warning: @export [generics.R#487]: unknown tag > Warning: @export [generics.R#498]: unknown tag > Warning: @export [generics.R#502]: unknown tag > Warning: @export [generics.R#506]: unknown tag > Warning: @export [generics.R#512]: unknown tag > Warning: @export [generics.R#518]: unknown tag > Warning: @export [generics.R#526]: unknown tag > Warning: @export [generics.R#530]: unknown tag > Warning: @export [generics.R#534]: unknown tag > Warning: @export [generics.R#538]: unknown tag > Warning: @export [generics.R#542]: unknown tag > Warning: @export [generics.R#549]: unknown tag > Warning: @export [generics.R#556]: unknown tag > Warning: @export [generics.R#560]: unknown tag > Warning: @export [generics.R#567]: unknown tag > Warning: @export [generics.R#571]: unknown tag > Warning: @export [generics.R#575]: unknown tag > Warning: @export [generics.R#579]: unknown tag > Warning: @export [generics.R#583]: unknown tag > Warning: @export [generics.R#587]: unknown tag > Warning: @export [generics.R#591]: unknown tag > Warning: @export [generics.R#595]: unknown tag > Warning: @export [generics.R#599]: unknown tag > Warning: @export [generics.R#603]: unknown tag > Warning: @export [generics.R#607]: unknown tag > Warning: @export [generics.R#611]: unknown tag > Warning: @export [generics.R#615]: unknown tag > Warning: @export [generics.R#619]: unknown tag > Warning: @export [generics.R#623]: unknown tag > Warning: @export [generics.R#627]: unknown tag > Warning: @export [generics.R#631]: unknown tag > Warning: @export [generics.R#635]: unknown tag > Warning: @export [generics.R#639]: unknown tag > Warning: @export [generics.R#643]: unknown tag > Warning: @export [generics.R#647]: unknown tag > Warning: @export [generics.R#654]: unknown tag > Warning: @export [generics.R#658]: unknown tag > Warning: @export [generics.R#663]: unknown tag > Warning: @export [generics.R#667]: unknown tag > Warning: @export [generics.R#672]: unknown tag > Warning: @export [generics.R#676]: unknown tag > Warning: @export [generics.R#680]: unknown tag > Warning: @export [generics.R#684]: unknown tag > Warning: @export [generics.R#690]: unknown tag > Warning: @export [generics.R#696]: unknown tag > Warning: @export [generics.R#702]: unknown tag > Warning: @export [generics.R#706]: unknown tag > Warning: @export [generics.R#710]: unknown tag > Warning: @export [generics.R#716]: unknown tag > Warning: @export [generics.R#720]: unknown tag > Warning: @export [generics.R#726]: unknown tag > Warning: @export [generics.R#730]: unknown tag > Warning: @export [generics.R#734]: unknown tag > Warning: @export [generics.R#738]: unknown tag > Warning: @export [generics.R#742]: unknown tag > Warning: @export [generics.R#750]: unknown tag > Warning: @export [generics.R#754]: unknown tag > Warning: @export [generics.R#758]: unknown tag > Warning: @export [generics.R#766]: unknown tag > Warnin
[jira] [Resolved] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1
[ https://issues.apache.org/jira/browse/SPARK-22430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-22430. -- Resolution: Fixed Fix Version/s: 2.4.0 Target Version/s: 2.4.0 > Unknown tag warnings when building R docs with Roxygen 6.0.1 > > > Key: SPARK-22430 > URL: https://issues.apache.org/jira/browse/SPARK-22430 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 > Environment: Roxygen 6.0.1 >Reporter: Joel Croteau >Priority: Trivial > Fix For: 2.4.0 > > > When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of > unknown tag warnings are generated: > {noformat} > Warning: @export [schema.R#33]: unknown tag > Warning: @export [schema.R#53]: unknown tag > Warning: @export [schema.R#63]: unknown tag > Warning: @export [schema.R#80]: unknown tag > Warning: @export [schema.R#123]: unknown tag > Warning: @export [schema.R#141]: unknown tag > Warning: @export [schema.R#216]: unknown tag > Warning: @export [generics.R#388]: unknown tag > Warning: @export [generics.R#403]: unknown tag > Warning: @export [generics.R#407]: unknown tag > Warning: @export [generics.R#414]: unknown tag > Warning: @export [generics.R#418]: unknown tag > Warning: @export [generics.R#422]: unknown tag > Warning: @export [generics.R#428]: unknown tag > Warning: @export [generics.R#432]: unknown tag > Warning: @export [generics.R#438]: unknown tag > Warning: @export [generics.R#442]: unknown tag > Warning: @export [generics.R#446]: unknown tag > Warning: @export [generics.R#450]: unknown tag > Warning: @export [generics.R#454]: unknown tag > Warning: @export [generics.R#459]: unknown tag > Warning: @export [generics.R#467]: unknown tag > Warning: @export [generics.R#475]: unknown tag > Warning: @export [generics.R#479]: unknown tag > Warning: @export [generics.R#483]: unknown tag > Warning: @export [generics.R#487]: unknown tag > Warning: @export [generics.R#498]: unknown tag > Warning: @export [generics.R#502]: unknown tag > Warning: @export [generics.R#506]: unknown tag > Warning: @export [generics.R#512]: unknown tag > Warning: @export [generics.R#518]: unknown tag > Warning: @export [generics.R#526]: unknown tag > Warning: @export [generics.R#530]: unknown tag > Warning: @export [generics.R#534]: unknown tag > Warning: @export [generics.R#538]: unknown tag > Warning: @export [generics.R#542]: unknown tag > Warning: @export [generics.R#549]: unknown tag > Warning: @export [generics.R#556]: unknown tag > Warning: @export [generics.R#560]: unknown tag > Warning: @export [generics.R#567]: unknown tag > Warning: @export [generics.R#571]: unknown tag > Warning: @export [generics.R#575]: unknown tag > Warning: @export [generics.R#579]: unknown tag > Warning: @export [generics.R#583]: unknown tag > Warning: @export [generics.R#587]: unknown tag > Warning: @export [generics.R#591]: unknown tag > Warning: @export [generics.R#595]: unknown tag > Warning: @export [generics.R#599]: unknown tag > Warning: @export [generics.R#603]: unknown tag > Warning: @export [generics.R#607]: unknown tag > Warning: @export [generics.R#611]: unknown tag > Warning: @export [generics.R#615]: unknown tag > Warning: @export [generics.R#619]: unknown tag > Warning: @export [generics.R#623]: unknown tag > Warning: @export [generics.R#627]: unknown tag > Warning: @export [generics.R#631]: unknown tag > Warning: @export [generics.R#635]: unknown tag > Warning: @export [generics.R#639]: unknown tag > Warning: @export [generics.R#643]: unknown tag > Warning: @export [generics.R#647]: unknown tag > Warning: @export [generics.R#654]: unknown tag > Warning: @export [generics.R#658]: unknown tag > Warning: @export [generics.R#663]: unknown tag > Warning: @export [generics.R#667]: unknown tag > Warning: @export [generics.R#672]: unknown tag > Warning: @export [generics.R#676]: unknown tag > Warning: @export [generics.R#680]: unknown tag > Warning: @export [generics.R#684]: unknown tag > Warning: @export [generics.R#690]: unknown tag > Warning: @export [generics.R#696]: unknown tag > Warning: @export [generics.R#702]: unknown tag > Warning: @export [generics.R#706]: unknown tag > Warning: @export [generics.R#710]: unknown tag > Warning: @export [generics.R#716]: unknown tag > Warning: @export [generics.R#720]: unknown tag > Warning: @export [generics.R#726]: unknown tag > Warning: @export [generics.R#730]: unknown tag > Warning: @export [generics.R#734]: unknown tag > Warning: @export [generics.R#738]: unknown tag > Warning: @export [generics.R#742]: unknown tag > Warning: @export [generics.R#750]: unknown tag > Warning: @export [generics.R#754]: unknown tag > Warning: @export [generics.R#758]: unknown tag > Warning: @export [generics.R#766]: un
[jira] [Commented] (SPARK-23528) Expose vital statistics of GaussianMixtureModel
[ https://issues.apache.org/jira/browse/SPARK-23528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386318#comment-16386318 ] Erich Schubert commented on SPARK-23528: I had only been looking at the mllib API. There is no summary there. What a mess that is. > Expose vital statistics of GaussianMixtureModel > --- > > Key: SPARK-23528 > URL: https://issues.apache.org/jira/browse/SPARK-23528 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.1 >Reporter: Erich Schubert >Priority: Minor > > Spark ML should expose vital statistics of the GMM model: > * *Number of iterations* (actual, not max) until the tolerance threshold was > hit: we can set a maximum, but how do we know the limit was large enough, and > how many iterations it really took? > * Final *log likelihood* of the model: if we run multiple times with > different starting conditions, how do we know which run converged to the > better fit? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23585) Add interpreted execution for UnwrapOption expression
[ https://issues.apache.org/jira/browse/SPARK-23585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386275#comment-16386275 ] Apache Spark commented on SPARK-23585: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20736 > Add interpreted execution for UnwrapOption expression > - > > Key: SPARK-23585 > URL: https://issues.apache.org/jira/browse/SPARK-23585 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23585) Add interpreted execution for UnwrapOption expression
[ https://issues.apache.org/jira/browse/SPARK-23585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23585: Assignee: Apache Spark > Add interpreted execution for UnwrapOption expression > - > > Key: SPARK-23585 > URL: https://issues.apache.org/jira/browse/SPARK-23585 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23585) Add interpreted execution for UnwrapOption expression
[ https://issues.apache.org/jira/browse/SPARK-23585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23585: Assignee: (was: Apache Spark) > Add interpreted execution for UnwrapOption expression > - > > Key: SPARK-23585 > URL: https://issues.apache.org/jira/browse/SPARK-23585 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data
[ https://issues.apache.org/jira/browse/SPARK-23603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386239#comment-16386239 ] Apache Spark commented on SPARK-23603: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/20739 > When the length of the json is in a range,get_json_object will result in > missing tail data > -- > > Key: SPARK-23603 > URL: https://issues.apache.org/jira/browse/SPARK-23603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Priority: Major > > Jackson(>=2.7.7) fixes the possibility of missing tail data when the length > of the value is in a range > [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7] > [https://github.com/FasterXML/jackson-core/issues/307] > > spark-shell: > > {code:java} > val value = "x" * 3000 > val json = s"""{"big": "$value"}""" > spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect > res0: Array[org.apache.spark.sql.Row] = Array([2991]) > {code} > correct result : 3000 > > > There are two solutions > One is > bump jackson version to 2.7.7 > The other one is > Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data
[ https://issues.apache.org/jira/browse/SPARK-23603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23603: Assignee: Apache Spark > When the length of the json is in a range,get_json_object will result in > missing tail data > -- > > Key: SPARK-23603 > URL: https://issues.apache.org/jira/browse/SPARK-23603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Assignee: Apache Spark >Priority: Major > > Jackson(>=2.7.7) fixes the possibility of missing tail data when the length > of the value is in a range > [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7] > [https://github.com/FasterXML/jackson-core/issues/307] > > spark-shell: > > {code:java} > val value = "x" * 3000 > val json = s"""{"big": "$value"}""" > spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect > res0: Array[org.apache.spark.sql.Row] = Array([2991]) > {code} > correct result : 3000 > > > There are two solutions > One is > bump jackson version to 2.7.7 > The other one is > Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data
[ https://issues.apache.org/jira/browse/SPARK-23603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386235#comment-16386235 ] Apache Spark commented on SPARK-23603: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/20738 > When the length of the json is in a range,get_json_object will result in > missing tail data > -- > > Key: SPARK-23603 > URL: https://issues.apache.org/jira/browse/SPARK-23603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Priority: Major > > Jackson(>=2.7.7) fixes the possibility of missing tail data when the length > of the value is in a range > [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7] > [https://github.com/FasterXML/jackson-core/issues/307] > > spark-shell: > > {code:java} > val value = "x" * 3000 > val json = s"""{"big": "$value"}""" > spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect > res0: Array[org.apache.spark.sql.Row] = Array([2991]) > {code} > correct result : 3000 > > > There are two solutions > One is > bump jackson version to 2.7.7 > The other one is > Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data
[ https://issues.apache.org/jira/browse/SPARK-23603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23603: Assignee: (was: Apache Spark) > When the length of the json is in a range,get_json_object will result in > missing tail data > -- > > Key: SPARK-23603 > URL: https://issues.apache.org/jira/browse/SPARK-23603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Priority: Major > > Jackson(>=2.7.7) fixes the possibility of missing tail data when the length > of the value is in a range > [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7] > [https://github.com/FasterXML/jackson-core/issues/307] > > spark-shell: > > {code:java} > val value = "x" * 3000 > val json = s"""{"big": "$value"}""" > spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect > res0: Array[org.apache.spark.sql.Row] = Array([2991]) > {code} > correct result : 3000 > > > There are two solutions > One is > bump jackson version to 2.7.7 > The other one is > Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23598) WholeStageCodegen can lead to IllegalAccessError calling append for HashAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-23598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386234#comment-16386234 ] Marco Gaido commented on SPARK-23598: - thanks for reporting this. Actually the one which you designed as a possible fix isn't really an option, because it would mean basically inlining everything to the outer class, which causes other problems (namely the Java limit for constant pool entries). Redeclaring it seems a bit hacky to me (and it causes an extra function call for each row...). I'd go for making the method public, if no better option comes out. > WholeStageCodegen can lead to IllegalAccessError calling append for > HashAggregateExec > -- > > Key: SPARK-23598 > URL: https://issues.apache.org/jira/browse/SPARK-23598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: David Vogelbacher >Priority: Major > > Got the following stacktrace for a large QueryPlan using WholeStageCodeGen: > {noformat} > java.lang.IllegalAccessError: tried to access method > org.apache.spark.sql.execution.BufferedRowIterator.append(Lorg/apache/spark/sql/catalyst/InternalRow;)V > from class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass.agg_doAggregateWithKeysOutput$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345){noformat} > After disabling codegen, everything works. > The root cause seems to be that we are trying to call the protected _append_ > method of > [BufferedRowIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java#L68] > from an inner-class of a sub-class that is loaded by a different > class-loader (after codegen compilation). > [https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-5.html#jvms-5.4.4] > states that a protected method _R_ can be accessed only if one of the > following two conditions is fulfilled: > # R is protected and is declared in a class C, and D is either a subclass of > C or C itself. Furthermore, if R is not static, then the symbolic reference > to R must contain a symbolic reference to a class T, such that T is either a > subclass of D, a superclass of D, or D itself. > # R is either protected or has default access (that is, neither public nor > protected nor private), and is declared by a class in the same run-time > package as D. > 2.) doesn't apply as we have loaded the class with a different class loader > (and are in a different package) and 1.) doesn't apply because we are > apparently trying to call the method from an inner class of a subclass of > _BufferedRowIterator_. > Looking at the Code path of _WholeStageCodeGen_, the following happens: > # In > [WholeStageCodeGen|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L527], > we create the subclass of _BufferedRowIterator_, along with a _processNext_ > method for processing the output of the child plan. > # In the child, which is a > [HashAggregateExec|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L517], > we create the method which shows up at the top of the stack trace (called > _doAggregateWithKeysOutput_ ) > # We add this method to the compiled code invoking _addNewFunction_ of > [CodeGenerator|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L460] > In the generated function body we call the _append_ method.| > Now, the _addNewFunction_ method states that: > {noformat} > If the code for the `OuterClass` grows too large, the function will be > inlined into a new private, inner class >
[jira] [Created] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data
dzcxzl created SPARK-23603: -- Summary: When the length of the json is in a range,get_json_object will result in missing tail data Key: SPARK-23603 URL: https://issues.apache.org/jira/browse/SPARK-23603 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.0, 2.0.0 Reporter: dzcxzl Jackson(>=2.7.7) fixes the possibility of missing tail data when the length of the value is in a range [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7] [https://github.com/FasterXML/jackson-core/issues/307] spark-shell: {code:java} val value = "x" * 3000 val json = s"""{"big": "$value"}""" spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect res0: Array[org.apache.spark.sql.Row] = Array([2991]) {code} correct result : 3000 There are two solutions One is bump jackson version to 2.7.7 The other one is Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23602) PrintToStderr should behave the same in interpreted mode
Herman van Hovell created SPARK-23602: - Summary: PrintToStderr should behave the same in interpreted mode Key: SPARK-23602 URL: https://issues.apache.org/jira/browse/SPARK-23602 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Herman van Hovell The {{PrintToStderr}} behaves differently for the interpreted and code generated code paths. We should fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23600) conda_panda_example test fails to import panda lib with Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-23600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386177#comment-16386177 ] Hyukjin Kwon commented on SPARK-23600: -- BTW, where is "conda_panda_example.py" ? > conda_panda_example test fails to import panda lib with Spark 2.3 > - > > Key: SPARK-23600 > URL: https://issues.apache.org/jira/browse/SPARK-23600 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 > Environment: ambari-server --version 2.7.0.2-64 > HDP-3.0.0.2-132 >Reporter: Supreeth Sharma >Priority: Major > > With Spark2.3, conda panda test is failing to import panda. > python version: Python 2.7.5 > 1) Create Requirement file. > virtual_env_type : Native > {code:java} > packaging==16.8 > panda==0.3.1 > pyparsing==2.1.10 > requests==2.13.0 > six==1.10.0 > numpy==1.12.0 > pandas==0.19.2 > python-dateutil==2.6.0 > pytz==2016.10 > {code} > virtual_env_type : conda > {code:java} > mkl=2017.0.1=0 > numpy=1.12.0=py27_0 > openssl=1.0.2k=0 > pandas=0.19.2=np112py27_1 > pip=9.0.1=py27_1 > python=2.7.13=0 > python-dateutil=2.6.0=py27_0 > pytz=2016.10=py27_0 > readline=6.2=2 > setuptools=27.2.0=py27_0 > six=1.10.0=py27_0 > sqlite=3.13.0=0 > tk=8.5.18=0 > wheel=0.29.0=py27_0 > zlib=1.2.8=3 > {code} > 2) Run conda panda test > {code:java} > spark-submit --master yarn-client --jars > /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.3.0.0.2-132.jar --conf > spark.pyspark.virtualenv.enabled=true --conf > spark.pyspark.virtualenv.type=native --conf > spark.pyspark.virtualenv.requirements=/tmp/requirements.txt --conf > spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv > /hwqe/hadoopqe/tests/spark/data/conda_panda_example.py 2>&1 | tee > /tmp/1/Spark_clientLogs/pyenv_conda_panda_example_native_yarn-client.log > {code} > 3) Application fail to import panda. > {code:java} > 2018-03-05 13:43:31,493|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|18/03/05 13:43:31 INFO > YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling > beginning after reached minRegisteredResourcesRatio: 0.8 > 2018-03-05 13:43:31,527|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|Traceback (most recent call > last): > 2018-03-05 13:43:31,527|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|File > "/hwqe/hadoopqe/tests/spark/data/conda_panda_example.py", line 5, in > 2018-03-05 13:43:31,528|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|import pandas as pd > 2018-03-05 13:43:31,528|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|ImportError: No module named > pandas > 2018-03-05 13:43:31,547|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|18/03/05 13:43:31 INFO > BlockManagerMasterEndpoint: Registering block manager > ctr-e138-1518143905142-67599-01-05.hwx.site:44861 with 366.3 MB RAM, > BlockManagerId(2, ctr-e138-1518143905142-67599-01-05.hwx.site, 44861, > None){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23600) conda_panda_example test fails to import panda lib with Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-23600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386169#comment-16386169 ] Hyukjin Kwon edited comment on SPARK-23600 at 3/5/18 3:01 PM: -- Let's don't set the fix version which we usually set when actually fixed. was (Author: hyukjin.kwon): Let's don't set the fix version which we usually set. > conda_panda_example test fails to import panda lib with Spark 2.3 > - > > Key: SPARK-23600 > URL: https://issues.apache.org/jira/browse/SPARK-23600 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 > Environment: ambari-server --version 2.7.0.2-64 > HDP-3.0.0.2-132 >Reporter: Supreeth Sharma >Priority: Major > > With Spark2.3, conda panda test is failing to import panda. > python version: Python 2.7.5 > 1) Create Requirement file. > virtual_env_type : Native > {code:java} > packaging==16.8 > panda==0.3.1 > pyparsing==2.1.10 > requests==2.13.0 > six==1.10.0 > numpy==1.12.0 > pandas==0.19.2 > python-dateutil==2.6.0 > pytz==2016.10 > {code} > virtual_env_type : conda > {code:java} > mkl=2017.0.1=0 > numpy=1.12.0=py27_0 > openssl=1.0.2k=0 > pandas=0.19.2=np112py27_1 > pip=9.0.1=py27_1 > python=2.7.13=0 > python-dateutil=2.6.0=py27_0 > pytz=2016.10=py27_0 > readline=6.2=2 > setuptools=27.2.0=py27_0 > six=1.10.0=py27_0 > sqlite=3.13.0=0 > tk=8.5.18=0 > wheel=0.29.0=py27_0 > zlib=1.2.8=3 > {code} > 2) Run conda panda test > {code:java} > spark-submit --master yarn-client --jars > /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.3.0.0.2-132.jar --conf > spark.pyspark.virtualenv.enabled=true --conf > spark.pyspark.virtualenv.type=native --conf > spark.pyspark.virtualenv.requirements=/tmp/requirements.txt --conf > spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv > /hwqe/hadoopqe/tests/spark/data/conda_panda_example.py 2>&1 | tee > /tmp/1/Spark_clientLogs/pyenv_conda_panda_example_native_yarn-client.log > {code} > 3) Application fail to import panda. > {code:java} > 2018-03-05 13:43:31,493|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|18/03/05 13:43:31 INFO > YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling > beginning after reached minRegisteredResourcesRatio: 0.8 > 2018-03-05 13:43:31,527|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|Traceback (most recent call > last): > 2018-03-05 13:43:31,527|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|File > "/hwqe/hadoopqe/tests/spark/data/conda_panda_example.py", line 5, in > 2018-03-05 13:43:31,528|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|import pandas as pd > 2018-03-05 13:43:31,528|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|ImportError: No module named > pandas > 2018-03-05 13:43:31,547|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|18/03/05 13:43:31 INFO > BlockManagerMasterEndpoint: Registering block manager > ctr-e138-1518143905142-67599-01-05.hwx.site:44861 with 366.3 MB RAM, > BlockManagerId(2, ctr-e138-1518143905142-67599-01-05.hwx.site, 44861, > None){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23600) conda_panda_example test fails to import panda lib with Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-23600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23600: - Priority: Major (was: Critical) Fix Version/s: (was: 2.3.0) Let's don't set the fix version which we usually set. > conda_panda_example test fails to import panda lib with Spark 2.3 > - > > Key: SPARK-23600 > URL: https://issues.apache.org/jira/browse/SPARK-23600 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 > Environment: ambari-server --version 2.7.0.2-64 > HDP-3.0.0.2-132 >Reporter: Supreeth Sharma >Priority: Major > > With Spark2.3, conda panda test is failing to import panda. > python version: Python 2.7.5 > 1) Create Requirement file. > virtual_env_type : Native > {code:java} > packaging==16.8 > panda==0.3.1 > pyparsing==2.1.10 > requests==2.13.0 > six==1.10.0 > numpy==1.12.0 > pandas==0.19.2 > python-dateutil==2.6.0 > pytz==2016.10 > {code} > virtual_env_type : conda > {code:java} > mkl=2017.0.1=0 > numpy=1.12.0=py27_0 > openssl=1.0.2k=0 > pandas=0.19.2=np112py27_1 > pip=9.0.1=py27_1 > python=2.7.13=0 > python-dateutil=2.6.0=py27_0 > pytz=2016.10=py27_0 > readline=6.2=2 > setuptools=27.2.0=py27_0 > six=1.10.0=py27_0 > sqlite=3.13.0=0 > tk=8.5.18=0 > wheel=0.29.0=py27_0 > zlib=1.2.8=3 > {code} > 2) Run conda panda test > {code:java} > spark-submit --master yarn-client --jars > /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.3.0.0.2-132.jar --conf > spark.pyspark.virtualenv.enabled=true --conf > spark.pyspark.virtualenv.type=native --conf > spark.pyspark.virtualenv.requirements=/tmp/requirements.txt --conf > spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv > /hwqe/hadoopqe/tests/spark/data/conda_panda_example.py 2>&1 | tee > /tmp/1/Spark_clientLogs/pyenv_conda_panda_example_native_yarn-client.log > {code} > 3) Application fail to import panda. > {code:java} > 2018-03-05 13:43:31,493|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|18/03/05 13:43:31 INFO > YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling > beginning after reached minRegisteredResourcesRatio: 0.8 > 2018-03-05 13:43:31,527|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|Traceback (most recent call > last): > 2018-03-05 13:43:31,527|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|File > "/hwqe/hadoopqe/tests/spark/data/conda_panda_example.py", line 5, in > 2018-03-05 13:43:31,528|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|import pandas as pd > 2018-03-05 13:43:31,528|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|ImportError: No module named > pandas > 2018-03-05 13:43:31,547|INFO|MainThread|machine.py:167 - > run()||GUID=a3cb88f7-bf55-4d9e-9cfe-3e44eae3a72b|18/03/05 13:43:31 INFO > BlockManagerMasterEndpoint: Registering block manager > ctr-e138-1518143905142-67599-01-05.hwx.site:44861 with 366.3 MB RAM, > BlockManagerId(2, ctr-e138-1518143905142-67599-01-05.hwx.site, 44861, > None){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions
[ https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23329. -- Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 Fixed in https://github.com/apache/spark/pull/20618 > Update the function descriptions with the arguments and returned values of > the trigonometric functions > -- > > Key: SPARK-23329 > URL: https://issues.apache.org/jira/browse/SPARK-23329 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Mihaly Toth >Priority: Minor > Labels: starter > Fix For: 2.3.1, 2.4.0 > > > We need an update on the function descriptions for all the trigonometric > functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the > implementation is based on the java.lang.Math. We need a clear description > about the units of the input arguments and the returned values. > For example, the following descriptions are lacking such info. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555 > https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions
[ https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23329: Assignee: Mihaly Toth > Update the function descriptions with the arguments and returned values of > the trigonometric functions > -- > > Key: SPARK-23329 > URL: https://issues.apache.org/jira/browse/SPARK-23329 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Mihaly Toth >Priority: Minor > Labels: starter > Fix For: 2.3.1, 2.4.0 > > > We need an update on the function descriptions for all the trigonometric > functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the > implementation is based on the java.lang.Math. We need a clear description > about the units of the input arguments and the returned values. > For example, the following descriptions are lacking such info. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555 > https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23566) Arguement name fix
[ https://issues.apache.org/jira/browse/SPARK-23566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23566. -- Resolution: Fixed Fix Version/s: 2.4.0 Fixed in https://github.com/apache/spark/pull/20716 > Arguement name fix > -- > > Key: SPARK-23566 > URL: https://issues.apache.org/jira/browse/SPARK-23566 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.2, 2.3.0 >Reporter: Anirudh >Assignee: Anirudh >Priority: Minor > Labels: docs, minor, newbie > Fix For: 2.4.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > Doc String of `withColumnRenamed` has a wrongly marked argument. > > {{ def withColumnRenamed(self, existing, new):}} > {{ """Returns a new :class:`DataFrame` by renaming an existing > column.}} > {{ This is a no-op if schema doesn't contain the given column name.}} > {{ :param existing: string, name of the existing column to rename.}} > {{ :param col: string, new name of the column.}} > {{ >>> df.withColumnRenamed('age', 'age2').collect()}} > {{ [Row(age2=2, name=u'Alice'), Row(age2=5, name=u'Bob')]}} > {{ """}} > Variable is `new` in argument list and `col` in the doc string. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23566) Arguement name fix
[ https://issues.apache.org/jira/browse/SPARK-23566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23566: Assignee: Anirudh > Arguement name fix > -- > > Key: SPARK-23566 > URL: https://issues.apache.org/jira/browse/SPARK-23566 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.2, 2.3.0 >Reporter: Anirudh >Assignee: Anirudh >Priority: Minor > Labels: docs, minor, newbie > Original Estimate: 1m > Remaining Estimate: 1m > > Doc String of `withColumnRenamed` has a wrongly marked argument. > > {{ def withColumnRenamed(self, existing, new):}} > {{ """Returns a new :class:`DataFrame` by renaming an existing > column.}} > {{ This is a no-op if schema doesn't contain the given column name.}} > {{ :param existing: string, name of the existing column to rename.}} > {{ :param col: string, new name of the column.}} > {{ >>> df.withColumnRenamed('age', 'age2').collect()}} > {{ [Row(age2=2, name=u'Alice'), Row(age2=5, name=u'Bob')]}} > {{ """}} > Variable is `new` in argument list and `col` in the doc string. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386125#comment-16386125 ] Yuming Wang commented on SPARK-20202: - How about upgrade Hive directly to 2.3.2. In fact, I've completed the initial work and have been running for a few days. [https://github.com/apache/spark/pull/20659] > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Major > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23601) Remove .md5 files from release
[ https://issues.apache.org/jira/browse/SPARK-23601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23601: Assignee: Apache Spark (was: Sean Owen) > Remove .md5 files from release > -- > > Key: SPARK-23601 > URL: https://issues.apache.org/jira/browse/SPARK-23601 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > Per email from Henk to PMCs: > {code} >The Release Distribution Policy[1] changed regarding checksum files. > See under "Cryptographic Signatures and Checksums Requirements" [2]. > MD5-file == a .md5 file > SHA-file == a .sha1, sha256 or .sha512 file >Old policy : > -- MUST provide a MD5-file > -- SHOULD provide a SHA-file [SHA-512 recommended] >New policy : > -- MUST provide a SHA- or MD5-file > -- SHOULD provide a SHA-file > -- SHOULD NOT provide a MD5-file > Providing MD5 checksum files is now discouraged for new releases, > but still allowed for past releases. >Why this change : > -- MD5 is broken for many purposes ; we should move away from it. > https://en.wikipedia.org/wiki/MD5#Overview_of_security_issues >Impact for PMCs : > -- for new releases : > -- please do provide a SHA-file (one or more, if you like) > -- do NOT provide a MD5-file > -- for past releases : > -- you are not required to change anything > -- for artifacts accompanied by a SHA-file /and/ a MD5-file, > it would be nice if you removed the MD5-file > -- if, at the moment, you provide MD5-files, > please adjust your release tooling. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23601) Remove .md5 files from release
[ https://issues.apache.org/jira/browse/SPARK-23601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386106#comment-16386106 ] Apache Spark commented on SPARK-23601: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/20737 > Remove .md5 files from release > -- > > Key: SPARK-23601 > URL: https://issues.apache.org/jira/browse/SPARK-23601 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > Per email from Henk to PMCs: > {code} >The Release Distribution Policy[1] changed regarding checksum files. > See under "Cryptographic Signatures and Checksums Requirements" [2]. > MD5-file == a .md5 file > SHA-file == a .sha1, sha256 or .sha512 file >Old policy : > -- MUST provide a MD5-file > -- SHOULD provide a SHA-file [SHA-512 recommended] >New policy : > -- MUST provide a SHA- or MD5-file > -- SHOULD provide a SHA-file > -- SHOULD NOT provide a MD5-file > Providing MD5 checksum files is now discouraged for new releases, > but still allowed for past releases. >Why this change : > -- MD5 is broken for many purposes ; we should move away from it. > https://en.wikipedia.org/wiki/MD5#Overview_of_security_issues >Impact for PMCs : > -- for new releases : > -- please do provide a SHA-file (one or more, if you like) > -- do NOT provide a MD5-file > -- for past releases : > -- you are not required to change anything > -- for artifacts accompanied by a SHA-file /and/ a MD5-file, > it would be nice if you removed the MD5-file > -- if, at the moment, you provide MD5-files, > please adjust your release tooling. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23601) Remove .md5 files from release
[ https://issues.apache.org/jira/browse/SPARK-23601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23601: Assignee: Sean Owen (was: Apache Spark) > Remove .md5 files from release > -- > > Key: SPARK-23601 > URL: https://issues.apache.org/jira/browse/SPARK-23601 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > Per email from Henk to PMCs: > {code} >The Release Distribution Policy[1] changed regarding checksum files. > See under "Cryptographic Signatures and Checksums Requirements" [2]. > MD5-file == a .md5 file > SHA-file == a .sha1, sha256 or .sha512 file >Old policy : > -- MUST provide a MD5-file > -- SHOULD provide a SHA-file [SHA-512 recommended] >New policy : > -- MUST provide a SHA- or MD5-file > -- SHOULD provide a SHA-file > -- SHOULD NOT provide a MD5-file > Providing MD5 checksum files is now discouraged for new releases, > but still allowed for past releases. >Why this change : > -- MD5 is broken for many purposes ; we should move away from it. > https://en.wikipedia.org/wiki/MD5#Overview_of_security_issues >Impact for PMCs : > -- for new releases : > -- please do provide a SHA-file (one or more, if you like) > -- do NOT provide a MD5-file > -- for past releases : > -- you are not required to change anything > -- for artifacts accompanied by a SHA-file /and/ a MD5-file, > it would be nice if you removed the MD5-file > -- if, at the moment, you provide MD5-files, > please adjust your release tooling. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23598) WholeStageCodegen can lead to IllegalAccessError calling append for HashAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-23598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386100#comment-16386100 ] Herman van Hovell commented on SPARK-23598: --- [~dvogelbacher] Do you have some code which we can use to reproduce this? > WholeStageCodegen can lead to IllegalAccessError calling append for > HashAggregateExec > -- > > Key: SPARK-23598 > URL: https://issues.apache.org/jira/browse/SPARK-23598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: David Vogelbacher >Priority: Major > > Got the following stacktrace for a large QueryPlan using WholeStageCodeGen: > {noformat} > java.lang.IllegalAccessError: tried to access method > org.apache.spark.sql.execution.BufferedRowIterator.append(Lorg/apache/spark/sql/catalyst/InternalRow;)V > from class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass.agg_doAggregateWithKeysOutput$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345){noformat} > After disabling codegen, everything works. > The root cause seems to be that we are trying to call the protected _append_ > method of > [BufferedRowIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java#L68] > from an inner-class of a sub-class that is loaded by a different > class-loader (after codegen compilation). > [https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-5.html#jvms-5.4.4] > states that a protected method _R_ can be accessed only if one of the > following two conditions is fulfilled: > # R is protected and is declared in a class C, and D is either a subclass of > C or C itself. Furthermore, if R is not static, then the symbolic reference > to R must contain a symbolic reference to a class T, such that T is either a > subclass of D, a superclass of D, or D itself. > # R is either protected or has default access (that is, neither public nor > protected nor private), and is declared by a class in the same run-time > package as D. > 2.) doesn't apply as we have loaded the class with a different class loader > (and are in a different package) and 1.) doesn't apply because we are > apparently trying to call the method from an inner class of a subclass of > _BufferedRowIterator_. > Looking at the Code path of _WholeStageCodeGen_, the following happens: > # In > [WholeStageCodeGen|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L527], > we create the subclass of _BufferedRowIterator_, along with a _processNext_ > method for processing the output of the child plan. > # In the child, which is a > [HashAggregateExec|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L517], > we create the method which shows up at the top of the stack trace (called > _doAggregateWithKeysOutput_ ) > # We add this method to the compiled code invoking _addNewFunction_ of > [CodeGenerator|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L460] > In the generated function body we call the _append_ method.| > Now, the _addNewFunction_ method states that: > {noformat} > If the code for the `OuterClass` grows too large, the function will be > inlined into a new private, inner class > {noformat} > This indeed seems to happen: the _doAggregateWithKeysOutput_ method is put > into a new private inner class. Thus, it doesn't have access to the protected > _append_ method anymore but still tries to call it, which results in the > _IllegalAccessError._ > Possible fixes: > * Pass in the _inlineToOuterClass_ flag when i
[jira] [Created] (SPARK-23601) Remove .md5 files from release
Sean Owen created SPARK-23601: - Summary: Remove .md5 files from release Key: SPARK-23601 URL: https://issues.apache.org/jira/browse/SPARK-23601 Project: Spark Issue Type: Task Components: Build Affects Versions: 2.4.0 Reporter: Sean Owen Assignee: Sean Owen Per email from Henk to PMCs: {code} The Release Distribution Policy[1] changed regarding checksum files. See under "Cryptographic Signatures and Checksums Requirements" [2]. MD5-file == a .md5 file SHA-file == a .sha1, sha256 or .sha512 file Old policy : -- MUST provide a MD5-file -- SHOULD provide a SHA-file [SHA-512 recommended] New policy : -- MUST provide a SHA- or MD5-file -- SHOULD provide a SHA-file -- SHOULD NOT provide a MD5-file Providing MD5 checksum files is now discouraged for new releases, but still allowed for past releases. Why this change : -- MD5 is broken for many purposes ; we should move away from it. https://en.wikipedia.org/wiki/MD5#Overview_of_security_issues Impact for PMCs : -- for new releases : -- please do provide a SHA-file (one or more, if you like) -- do NOT provide a MD5-file -- for past releases : -- you are not required to change anything -- for artifacts accompanied by a SHA-file /and/ a MD5-file, it would be nice if you removed the MD5-file -- if, at the moment, you provide MD5-files, please adjust your release tooling. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org