[jira] [Commented] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs
[ https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255980#comment-15255980 ] Prashant Sharma commented on SPARK-14597: - {quote} On digging further, we found that processingDelay is only clocking time spent in the ForEachRDD closure of the Streaming application and that JobGenerator's graph.generateJobs {quote} Correct me, if I am wrong, generateJobs also include the time of scheduling and running the actual task and waiting for them to finish. Major portion of time should be spent here. > Streaming Listener timing metrics should include time spent in JobGenerator's > graph.generateJobs > > > Key: SPARK-14597 > URL: https://issues.apache.org/jira/browse/SPARK-14597 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Streaming >Affects Versions: 1.6.1, 2.0.0 >Reporter: Sachin Aggarwal >Priority: Minor > > While looking to tune our streaming application, the piece of info we were > looking for was actual processing time per batch. The > StreamingListener.onBatchCompleted event provides a BatchInfo object that > provided this information. It provides the following data > - processingDelay > - schedulingDelay > - totalDelay > - Submission Time > The above are essentially calculated from the streaming JobScheduler > clocking the processingStartTime and processingEndTime for each JobSet. > Another metric available is submissionTime which is when a Jobset was put on > the Streaming Scheduler's Queue. > > So we took processing delay as our actual processing time per batch. However > to maintain a stable streaming application, we found that the our batch > interval had to be a little less than DOUBLE of the processingDelay metric > reported. (We are using a DirectKafkaInputStream). On digging further, we > found that processingDelay is only clocking time spent in the ForEachRDD > closure of the Streaming application and that JobGenerator's > graph.generateJobs > (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248) > method takes a significant more amount of time. > Thus a true reflection of processing time is > a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay) > b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay) > c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay > metric) > d - Time spent in Jobset's job run (existing processingDelay metric) > > Additionally a JobGeneratorQueue delay (#a) could be due to either > graph.generateJobs taking longer than batchInterval or other JobGenerator > events like checkpointing adding up time. Thus it would be beneficial to > report time taken by the checkpointing Job as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14193) Skip unnecessary sorts if input data have been already ordered in InMemoryRelation
[ https://issues.apache.org/jira/browse/SPARK-14193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-14193: - Description: This ticket describes an opportunity to skip unnecessary sorts if input data have been already ordered in InMemoryTable. Let's say we have a cached table with column 'a' sorted; {code} val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b") val df2 = df1.sort("a").cache df2.show // just cache data {code} If you say `df2.sort("a")`, the current spark generates a plan like; {code} == Physical Plan == Sort [a#13 ASC], true, 0 +- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None {code} Since the current implementation cannot tell a difference between global sorted columns and partition-locally sorted ones from `SparkPan#outputOrdering`. was: This ticket is to skip unnecessary sorts if input data have been already ordered in InMemoryTable. Let's say we have a cached table with column 'a' sorted; {code} val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b") val df2 = df1.sort("a").cache df2.show // just cache data {code} If you say `df2.sort("a")`, the current spark generates a plan like; {code} == Physical Plan == Sort [a#13 ASC], true, 0 +- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None {code} This ticket targets at removing this unncessary sort. > Skip unnecessary sorts if input data have been already ordered in > InMemoryRelation > -- > > Key: SPARK-14193 > URL: https://issues.apache.org/jira/browse/SPARK-14193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Takeshi Yamamuro > > This ticket describes an opportunity to skip unnecessary sorts if input data > have been already ordered in InMemoryTable. > Let's say we have a cached table with column 'a' sorted; > {code} > val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b") > val df2 = df1.sort("a").cache > df2.show // just cache data > {code} > If you say `df2.sort("a")`, the current spark generates a plan like; > {code} > == Physical Plan == > Sort [a#13 ASC], true, 0 > +- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, > 1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, > None > {code} > Since the current implementation cannot tell a difference between global > sorted columns and partition-locally sorted ones from > `SparkPan#outputOrdering`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14751) SparkR fails on Cassandra map with numeric key
[ https://issues.apache.org/jira/browse/SPARK-14751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255959#comment-15255959 ] Michał Matłoka commented on SPARK-14751: Even toString with warning would be much better than current ERROR with message that does not point why it does not work :) > SparkR fails on Cassandra map with numeric key > -- > > Key: SPARK-14751 > URL: https://issues.apache.org/jira/browse/SPARK-14751 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Michał Matłoka > > Hi, > I have created an issue for spark cassandra connector ( > https://datastax-oss.atlassian.net/projects/SPARKC/issues/SPARKC-366 ) but > after a bit of digging it seems this is a better place for this issue: > {code} > CREATE TABLE test.map ( > id text, > somemap map, > PRIMARY KEY (id) > ); > insert into test.map(id, somemap) values ('a', { 0 : 12 }); > {code} > {code} > sqlContext <- sparkRSQL.init(sc) > test <-read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "test", table = "map") > head(test) > {code} > Results in: > {code} > 16/04/19 14:47:02 ERROR RBackendHandler: dfToCols on > org.apache.spark.sql.api.r.SQLUtils failed > Error in readBin(con, raw(), stringLen, endian = "big") : > invalid 'n' argument > {code} > Problem occurs even for int key. For text key it works. Every scenario works > under scala & python. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14759) After join one cannot drop dynamically added column
[ https://issues.apache.org/jira/browse/SPARK-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255935#comment-15255935 ] praveen dareddy commented on SPARK-14759: - Hi Tomasz, I am new to Spark but would like to help on this issue. I have the environment set up in my local system. I have quite recently started going through the code base and am eager to contribute. Can you point me towards the specific module i need to understand to solve this issue? Thanks, Red > After join one cannot drop dynamically added column > --- > > Key: SPARK-14759 > URL: https://issues.apache.org/jira/browse/SPARK-14759 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Tomasz Bartczak >Priority: Minor > > running following code: > {code} > from pyspark.sql.functions import * > df1 = sqlContext.createDataFrame([(1,10,)], ['any','hour']) > df2 = sqlContext.createDataFrame([(1,)], ['any']).withColumn('hour',lit(10)) > j = df1.join(df2,[df1.hour == df2.hour],how='left') > print("columns after join:{0}".format(j.columns)) > jj = j.drop(df2.hour) > print("columns after removing 'hour':{0}".format(jj.columns)) > {code} > should show that after join and remove df2.hour I end up with only one 'hour' > column in dataframe. > Unfortunately this column is not dropped. > {code} > columns after join:['any', 'hour', 'any', 'hour'] > columns after removing 'hour': ['any', 'hour', 'any', 'hour'] > {code} > I found out that it behaves like that only when the column is added > dynamically before the join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14887) Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits
fang fang chen created SPARK-14887: -- Summary: Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits Key: SPARK-14887 URL: https://issues.apache.org/jira/browse/SPARK-14887 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Reporter: fang fang chen Similiar issue with SPARK-14138 and SPARK-8443: With large sql syntax(673K), following error happened: Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14870) NPE in generate aggregate
[ https://issues.apache.org/jira/browse/SPARK-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-14870. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12651 [https://github.com/apache/spark/pull/12651] > NPE in generate aggregate > - > > Key: SPARK-14870 > URL: https://issues.apache.org/jira/browse/SPARK-14870 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > > When ran TPCDS Q14a > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 > (TID 234, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576) > at > org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) > at org.apache.spark.rdd.RDD.collect(RDD.scala:879) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) > at > org.apache.spark.sql.execution.SQLExecution$.withNewE
[jira] [Resolved] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala
[ https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-14881. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12648 [https://github.com/apache/spark/pull/12648] > pyspark and sparkR shell default log level should match spark-shell/Scala > - > > Key: SPARK-14881 > URL: https://issues.apache.org/jira/browse/SPARK-14881 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung >Priority: Minor > Fix For: 2.0.0 > > > Scala spark-shell defaults to log level WARN. pyspark and sparkR should match > that by default (user can change it later) > # ./bin/spark-shell > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14883) Fix wrong R examples and make them up-to-date
[ https://issues.apache.org/jira/browse/SPARK-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-14883. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12649 [https://github.com/apache/spark/pull/12649] > Fix wrong R examples and make them up-to-date > - > > Key: SPARK-14883 > URL: https://issues.apache.org/jira/browse/SPARK-14883 > Project: Spark > Issue Type: Bug > Components: Documentation, Examples >Reporter: Dongjoon Hyun > Fix For: 2.0.0 > > > This issue aims to fix some errors in R examples and make them up-to-date in > docs and example modules. > - Remove the wrong usage of map. We need to use `lapply` in `SparkR` if > needed. However, `lapply` is private now. The correct usage will be added > later. > {code} > -teenNames <- map(teenagers, function(p) { paste("Name:", p$name)}) > ... > {code} > - Fix the wrong example in Section `Generic Load/Save Functions` of > `docs/sql-programming-guide.md` for consistency. > {code} > -df <- loadDF(sqlContext, "people.parquet") > -saveDF(select(df, "name", "age"), "namesAndAges.parquet") > +df <- read.df(sqlContext, "examples/src/main/resources/users.parquet") > +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet") > {code} > - Fix datatypes in `sparkr.md`. > {code} > -# |-- age: integer (nullable = true) > +# |-- age: long (nullable = true) > {code} > {code} > -## DataFrame[eruptions:double, waiting:double] > +## SparkDataFrame[eruptions:double, waiting:double] > {code} > - Update data results > {code} > head(summarize(groupBy(df, df$waiting), count = n(df$waiting))) > ## waiting count > -##1 8113 > -##2 60 6 > -##3 68 1 > +##1 70 4 > +##2 67 1 > +##3 69 2 > {code} > - Replace deprecated functions: jsonFile -> read.json, parquetFile -> > read.parquet > {code} > df <- jsonFile(sqlContext, "examples/src/main/resources/people.json") > Warning message: > 'jsonFile' is deprecated. > Use 'read.json' instead. > See help("Deprecated") > {code} > - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, > saveAsParquetFile -> write.parquet > - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and > `data-manipulation.R`. > - Other minor syntax fixes and typos. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14883) Fix wrong R examples and make them up-to-date
[ https://issues.apache.org/jira/browse/SPARK-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-14883: -- Assignee: Dongjoon Hyun > Fix wrong R examples and make them up-to-date > - > > Key: SPARK-14883 > URL: https://issues.apache.org/jira/browse/SPARK-14883 > Project: Spark > Issue Type: Bug > Components: Documentation, Examples >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > This issue aims to fix some errors in R examples and make them up-to-date in > docs and example modules. > - Remove the wrong usage of map. We need to use `lapply` in `SparkR` if > needed. However, `lapply` is private now. The correct usage will be added > later. > {code} > -teenNames <- map(teenagers, function(p) { paste("Name:", p$name)}) > ... > {code} > - Fix the wrong example in Section `Generic Load/Save Functions` of > `docs/sql-programming-guide.md` for consistency. > {code} > -df <- loadDF(sqlContext, "people.parquet") > -saveDF(select(df, "name", "age"), "namesAndAges.parquet") > +df <- read.df(sqlContext, "examples/src/main/resources/users.parquet") > +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet") > {code} > - Fix datatypes in `sparkr.md`. > {code} > -# |-- age: integer (nullable = true) > +# |-- age: long (nullable = true) > {code} > {code} > -## DataFrame[eruptions:double, waiting:double] > +## SparkDataFrame[eruptions:double, waiting:double] > {code} > - Update data results > {code} > head(summarize(groupBy(df, df$waiting), count = n(df$waiting))) > ## waiting count > -##1 8113 > -##2 60 6 > -##3 68 1 > +##1 70 4 > +##2 67 1 > +##3 69 2 > {code} > - Replace deprecated functions: jsonFile -> read.json, parquetFile -> > read.parquet > {code} > df <- jsonFile(sqlContext, "examples/src/main/resources/people.json") > Warning message: > 'jsonFile' is deprecated. > Use 'read.json' instead. > See help("Deprecated") > {code} > - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, > saveAsParquetFile -> write.parquet > - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and > `data-manipulation.R`. > - Other minor syntax fixes and typos. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.
[ https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-13902: -- Description: {{DAGScheduler}} sometimes generate incorrect stage graph. Some stages are generated for the same shuffleId twice or more and they are referenced by the child stages because the building order of the graph is not correct. Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this in {{monospaced}} font): {noformat} < / \ [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] \ / < {noformat} {{DAGScheduler}} generates the following stages and their parents for each shuffle id: | shuffle id | stage | parents | | 0 | ShuffleMapStage 2 | List() | | 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) | | 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) | | 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) | | 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) | | \- | ResultStage 6 | List(ShuffleMapStage 5) | The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}. was: {{DAGScheduler}} sometimes generate incorrect stage graph. Some stages are generated for the same shuffleId twice or more and they are referenced by the child stages because the building order of the graph is not correct. Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this in {{monospace}} font): {noformat} < / \ [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] \ / < {noformat} {{DAGScheduler}} generates the following stages and their parents for each shuffle id: | shuffle id | stage | parents | | 0 | ShuffleMapStage 2 | List() | | 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) | | 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) | | 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) | | 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) | | \- | ResultStage 6 | List(ShuffleMapStage 5) | The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}. > Make DAGScheduler.getAncestorShuffleDependencies() return in topological > order to ensure building ancestor stages first. > > > Key: SPARK-13902 > URL: https://issues.apache.org/jira/browse/SPARK-13902 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Takuya Ueshin > > {{DAGScheduler}} sometimes generate incorrect stage graph. > Some stages are generated for the same shuffleId twice or more and they are > referenced by the child stages because the building order of the graph is not > correct. > Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see > this in {{monospaced}} font): > {noformat} > < > / \ > [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] >\ / > < > {noformat} > {{DAGScheduler}} generates the following stages and their parents for each > shuffle id: > | shuffle id | stage | parents | > | 0 | ShuffleMapStage 2 | List() | > | 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) | > | 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) | > | 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) | > | 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) | > | \- | ResultStage 6 | List(ShuffleMapStage 5) | > The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage > for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and > {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage > {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To un
[jira] [Updated] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.
[ https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-13902: -- Description: {{DAGScheduler}} sometimes generate incorrect stage graph. Some stages are generated for the same shuffleId twice or more and they are referenced by the child stages because the building order of the graph is not correct. Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this in {{monospace}} font): {noformat} < / \ [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] \ / < {noformat} {{DAGScheduler}} generates the following stages and their parents for each shuffle id: | shuffle id | stage | parents | | 0 | ShuffleMapStage 2 | List() | | 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) | | 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) | | 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) | | 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) | | \- | ResultStage 6 | List(ShuffleMapStage 5) | The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}. was: {{DAGScheduler}} sometimes generate incorrect stage graph. Some stages are generated for the same shuffleId twice or more and they are referenced by the child stages because the building order of the graph is not correct. Here, we submit an RDD\[F\] having a linage of RDDs as follows: {code} < / \ [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] \ / < {code} I added the sample RDD graph to show the illegal stage graph to {{DAGSchedulerSuite}} and then fixed it. > Make DAGScheduler.getAncestorShuffleDependencies() return in topological > order to ensure building ancestor stages first. > > > Key: SPARK-13902 > URL: https://issues.apache.org/jira/browse/SPARK-13902 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Takuya Ueshin > > {{DAGScheduler}} sometimes generate incorrect stage graph. > Some stages are generated for the same shuffleId twice or more and they are > referenced by the child stages because the building order of the graph is not > correct. > Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see > this in {{monospace}} font): > {noformat} > < > / \ > [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] >\ / > < > {noformat} > {{DAGScheduler}} generates the following stages and their parents for each > shuffle id: > | shuffle id | stage | parents | > | 0 | ShuffleMapStage 2 | List() | > | 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) | > | 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) | > | 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) | > | 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) | > | \- | ResultStage 6 | List(ShuffleMapStage 5) | > The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage > for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and > {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage > {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite
[ https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Bhanawat reopened SPARK-13693: - > Flaky test: o.a.s.streaming.MapWithStateSuite > - > > Key: SPARK-13693 > URL: https://issues.apache.org/jira/browse/SPARK-13693 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.0.0 > > > Fixed the following flaky test: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/ > {code} > sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite
[ https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255881#comment-15255881 ] Hemant Bhanawat commented on SPARK-13693: - Latest Jenkins builds are failing with this issue. See: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2863 https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2865 > Flaky test: o.a.s.streaming.MapWithStateSuite > - > > Key: SPARK-13693 > URL: https://issues.apache.org/jira/browse/SPARK-13693 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.0.0 > > > Fixed the following flaky test: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/ > {code} > sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14883) Fix wrong R examples and make them up-to-date
[ https://issues.apache.org/jira/browse/SPARK-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-14883: -- Description: This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules. - Remove the wrong usage of map. We need to use `lapply` in `SparkR` if needed. However, `lapply` is private now. The correct usage will be added later. {code} -teenNames <- map(teenagers, function(p) { paste("Name:", p$name)}) ... {code} - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency. {code} -df <- loadDF(sqlContext, "people.parquet") -saveDF(select(df, "name", "age"), "namesAndAges.parquet") +df <- read.df(sqlContext, "examples/src/main/resources/users.parquet") +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet") {code} - Fix datatypes in `sparkr.md`. {code} -# |-- age: integer (nullable = true) +# |-- age: long (nullable = true) {code} {code} -## DataFrame[eruptions:double, waiting:double] +## SparkDataFrame[eruptions:double, waiting:double] {code} - Update data results {code} head(summarize(groupBy(df, df$waiting), count = n(df$waiting))) ## waiting count -##1 8113 -##2 60 6 -##3 68 1 +##1 70 4 +##2 67 1 +##3 69 2 {code} - Replace deprecated functions: jsonFile -> read.json, parquetFile -> read.parquet {code} df <- jsonFile(sqlContext, "examples/src/main/resources/people.json") Warning message: 'jsonFile' is deprecated. Use 'read.json' instead. See help("Deprecated") {code} - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`. - Other minor syntax fixes and typos. was: This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules. - Fix the wrong usage of map. We need to use `lapply` if needed. However, the usage of `lapply` also needs to be reviewed since it's private. {code} -teenNames <- map(teenagers, function(p) { paste("Name:", p$name)}) +teenNames <- SparkR:::lapply(teenagers, function(p) { paste("Name:", p$name) }) {code} - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency. {code} -df <- loadDF(sqlContext, "people.parquet") -saveDF(select(df, "name", "age"), "namesAndAges.parquet") +df <- read.df(sqlContext, "examples/src/main/resources/users.parquet") +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet") {code} - Fix datatypes in `sparkr.md`. {code} -# |-- age: integer (nullable = true) +# |-- age: long (nullable = true) {code} {code} -## DataFrame[eruptions:double, waiting:double] +## SparkDataFrame[eruptions:double, waiting:double] {code} - Update data results {code} head(summarize(groupBy(df, df$waiting), count = n(df$waiting))) ## waiting count -##1 8113 -##2 60 6 -##3 68 1 +##1 70 4 +##2 67 1 +##3 69 2 {code} - Replace deprecated functions: jsonFile -> read.json, parquetFile -> read.parquet {code} df <- jsonFile(sqlContext, "examples/src/main/resources/people.json") Warning message: 'jsonFile' is deprecated. Use 'read.json' instead. See help("Deprecated") {code} - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`. - Other minor syntax fixes and typos. > Fix wrong R examples and make them up-to-date > - > > Key: SPARK-14883 > URL: https://issues.apache.org/jira/browse/SPARK-14883 > Project: Spark > Issue Type: Bug > Components: Documentation, Examples >Reporter: Dongjoon Hyun > > This issue aims to fix some errors in R examples and make them up-to-date in > docs and example modules. > - Remove the wrong usage of map. We need to use `lapply` in `SparkR` if > needed. However, `lapply` is private now. The correct usage will be added > later. > {code} > -teenNames <- map(teenagers, function(p) { paste("Name:", p$name)}) > ... > {code} > - Fix the wrong example in Section `Generic Load/Save Functions` of > `docs/sql-programming-guide.md` for consistency. > {code} > -df <- loadDF(sqlContext, "people.parquet") > -saveDF(select(df, "name", "age"), "namesAndAges.parquet") > +df <- read.df(sqlContext, "examples/src/main/resources/users.parquet") > +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet") > {code} > - Fix datatypes in `sparkr.md`. > {code} > -# |-- age: integer (nullable = true) > +# |-- age: long (nullable = true) > {code} > {code} > -## DataFrame[eruptions:double, w
[jira] [Resolved] (SPARK-14885) When creating a CatalogColumn, we should use catalogString of a DataType
[ https://issues.apache.org/jira/browse/SPARK-14885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14885. - Resolution: Fixed Assignee: Yin Huai Fix Version/s: 2.0.0 > When creating a CatalogColumn, we should use catalogString of a DataType > > > Key: SPARK-14885 > URL: https://issues.apache.org/jira/browse/SPARK-14885 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > Fix For: 2.0.0 > > > Right now, the data type field of a CatalogColumn is using the string > representation. When we create this string from a DataType object, there are > places where we use simpleString. Although catalogString is the same as > simpleString right now, it is better to use catalogString. So, we will not > introduce issues when we change the semantic of simpleString or the > implementation of catalogString. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14868) Enable NewLineAtEofChecker in checkstyle and fix lint-java errors
[ https://issues.apache.org/jira/browse/SPARK-14868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14868. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.0.0 > Enable NewLineAtEofChecker in checkstyle and fix lint-java errors > - > > Key: SPARK-14868 > URL: https://issues.apache.org/jira/browse/SPARK-14868 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java > code also comply with the rule. This issue aims to enforce the same rule > `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java > errors since SPARK-14465. > This issue does the following items. > * Adds a new line at the end of the files (19 files) > * Fixes 25 lint-java errors (12 RedundantModifier, 6 ArrayTypeStyle, 2 > LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.
[ https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-13902: -- Description: {{DAGScheduler}} sometimes generate incorrect stage graph. Some stages are generated for the same shuffleId twice or more and they are referenced by the child stages because the building order of the graph is not correct. Here, we submit an RDD\[F\] having a linage of RDDs as follows: {code} < / \ [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] \ / < {code} I added the sample RDD graph to show the illegal stage graph to {{DAGSchedulerSuite}} and then fixed it. was: {{DAGScheduler}} sometimes generate incorrect stage graph. Some stages are generated for the same shuffleId twice or more and they are referenced by the child stages because the building order of the graph is not correct. I added the sample RDD graph to show the illegal stage graph to {{DAGSchedulerSuite}} and then fixed it. > Make DAGScheduler.getAncestorShuffleDependencies() return in topological > order to ensure building ancestor stages first. > > > Key: SPARK-13902 > URL: https://issues.apache.org/jira/browse/SPARK-13902 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Takuya Ueshin > > {{DAGScheduler}} sometimes generate incorrect stage graph. > Some stages are generated for the same shuffleId twice or more and they are > referenced by the child stages because the building order of the graph is not > correct. > Here, we submit an RDD\[F\] having a linage of RDDs as follows: > {code} > < > / \ > [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] > \ / > < > {code} > I added the sample RDD graph to show the illegal stage graph to > {{DAGSchedulerSuite}} and then fixed it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255840#comment-15255840 ] holdenk commented on SPARK-14654: - Ah yes, in DAGScheduler right now we cast it to [Any, Any] when we get it from the map so I suppose the type safety would be lost and we just hide the cast from the developer which probably isn't worth it. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new UnsupportedOperationException( > s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") > } > override def value: jl.Long = _sum > def sum: Long = _sum > } > {code} > and SparkContext... > {code} > class SparkContext { > ... > def newLongAccumulator(): LongAccumulator > def newLongAccumulator(name: Long): LongAccumulator > def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator > def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): > Accumulator[IN, OUT] > ... > } > {code} > To use it ... > {code} > val acc = sc.newLongAccumulator() > sc.parallelize(1 to 1000).map { i => > acc.add(1) > i > } > {code} > A work-in-progress prototype here: > https://github.com/rxin/spark/tree/accumulator-refactor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14886) RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-14886: --- Description: @Since("1.2.0") def ndcgAt(k: Int): Double = { require(k > 0, "ranking position k should be positive") predictionAndLabels.map { case (pred, lab) => val labSet = lab.toSet if (labSet.nonEmpty) { val labSetSize = labSet.size val n = math.min(math.max(pred.length, labSetSize), k) var maxDcg = 0.0 var dcg = 0.0 var i = 0 while (i < n) { val gain = 1.0 / math.log(i + 2) if (labSet.contains(pred(i))) { dcg += gain } if (i < labSetSize) { maxDcg += gain } i += 1 } dcg / maxDcg } else { logWarning("Empty ground truth set, check input data") 0.0 } }.mean() } "if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException when pred's size less then k. That meas the true relevant documents has less size then the param k. just try this with sample_movielens_data.txt precisionAt is ok just because it has val n = math.min(pred.length, k) was: @Since("1.2.0") def ndcgAt(k: Int): Double = { require(k > 0, "ranking position k should be positive") predictionAndLabels.map { case (pred, lab) => val labSet = lab.toSet if (labSet.nonEmpty) { val labSetSize = labSet.size val n = math.min(math.max(pred.length, labSetSize), k) var maxDcg = 0.0 var dcg = 0.0 var i = 0 while (i < n) { val gain = 1.0 / math.log(i + 2) if (labSet.contains(pred(i))) { dcg += gain } if (i < labSetSize) { maxDcg += gain } i += 1 } dcg / maxDcg } else { logWarning("Empty ground truth set, check input data") 0.0 } }.mean() } if (labSet.contains(pred(i))) will throw ArrayIndexOutOfBoundsException when the true relevant documents has less size the the param k. just try this with sample_movielens_data.txt precisionAt is ok just because it has val n = math.min(pred.length, k) > RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException > -- > > Key: SPARK-14886 > URL: https://issues.apache.org/jira/browse/SPARK-14886 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: lichenglin > > @Since("1.2.0") > def ndcgAt(k: Int): Double = { > require(k > 0, "ranking position k should be positive") > predictionAndLabels.map { case (pred, lab) => > val labSet = lab.toSet > if (labSet.nonEmpty) { > val labSetSize = labSet.size > val n = math.min(math.max(pred.length, labSetSize), k) > var maxDcg = 0.0 > var dcg = 0.0 > var i = 0 > while (i < n) { > val gain = 1.0 / math.log(i + 2) > if (labSet.contains(pred(i))) { > dcg += gain > } > if (i < labSetSize) { > maxDcg += gain > } > i += 1 > } > dcg / maxDcg > } else { > logWarning("Empty ground truth set, check input data") > 0.0 > } > }.mean() > } > "if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException > when pred's size less then k. > That meas the true relevant documents has less size then the param k. > just try this with sample_movielens_data.txt > precisionAt is ok just because it has > val n = math.min(pred.length, k) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14886) RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException
lichenglin created SPARK-14886: -- Summary: RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException Key: SPARK-14886 URL: https://issues.apache.org/jira/browse/SPARK-14886 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.6.1 Reporter: lichenglin @Since("1.2.0") def ndcgAt(k: Int): Double = { require(k > 0, "ranking position k should be positive") predictionAndLabels.map { case (pred, lab) => val labSet = lab.toSet if (labSet.nonEmpty) { val labSetSize = labSet.size val n = math.min(math.max(pred.length, labSetSize), k) var maxDcg = 0.0 var dcg = 0.0 var i = 0 while (i < n) { val gain = 1.0 / math.log(i + 2) if (labSet.contains(pred(i))) { dcg += gain } if (i < labSetSize) { maxDcg += gain } i += 1 } dcg / maxDcg } else { logWarning("Empty ground truth set, check input data") 0.0 } }.mean() } if (labSet.contains(pred(i))) will throw ArrayIndexOutOfBoundsException when the true relevant documents has less size the the param k. just try this with sample_movielens_data.txt precisionAt is ok just because it has val n = math.min(pred.length, k) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255820#comment-15255820 ] Reynold Xin commented on SPARK-14654: - But Holden the call sites in Spark are already erasing the types, because the accumulators are put into a collection. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new UnsupportedOperationException( > s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") > } > override def value: jl.Long = _sum > def sum: Long = _sum > } > {code} > and SparkContext... > {code} > class SparkContext { > ... > def newLongAccumulator(): LongAccumulator > def newLongAccumulator(name: Long): LongAccumulator > def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator > def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): > Accumulator[IN, OUT] > ... > } > {code} > To use it ... > {code} > val acc = sc.newLongAccumulator() > sc.parallelize(1 to 1000).map { i => > acc.add(1) > i > } > {code} > A work-in-progress prototype here: > https://github.com/rxin/spark/tree/accumulator-refactor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255819#comment-15255819 ] holdenk commented on SPARK-14654: - Even if merge isn't supposed to be called by the end user it would be nice to have it be type safe - we also replace a conditional with reflection or cast we expect the developers to remember with a type parameter. By adding this I'd argue we reduce complexity/cost of the API. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new UnsupportedOperationException( > s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") > } > override def value: jl.Long = _sum > def sum: Long = _sum > } > {code} > and SparkContext... > {code} > class SparkContext { > ... > def newLongAccumulator(): LongAccumulator > def newLongAccumulator(name: Long): LongAccumulator > def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator > def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): > Accumulator[IN, OUT] > ... > } > {code} > To use it ... > {code} > val acc = sc.newLongAccumulator() > sc.parallelize(1 to 1000).map { i => > acc.add(1) > i > } > {code} > A work-in-progress prototype here: > https://github.com/rxin/spark/tree/accumulator-refactor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14845) spark.files in properties file is not distributed to driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14845: Assignee: Apache Spark > spark.files in properties file is not distributed to driver in yarn-cluster > mode > > > Key: SPARK-14845 > URL: https://issues.apache.org/jira/browse/SPARK-14845 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > > I use the following command to run SparkPi example. And in this property > file, define spark.files. It turns out that in yarn-cluster mode this > README.md is only distirbuted to executor but not in driver. > content of properties file > {noformat} > spark.files=/Users/jzhang/github/spark/README.md > {noformat} > launch command > {noformat} > bin/spark-submit --master yarn-cluster --properties-file a.properties --class > org.apache.spark.examples.SparkPi > examples/target/original-spark-examples_2.11-2.0.0-SNAPSHOT.jar > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14845) spark.files in properties file is not distributed to driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255810#comment-15255810 ] Apache Spark commented on SPARK-14845: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/12656 > spark.files in properties file is not distributed to driver in yarn-cluster > mode > > > Key: SPARK-14845 > URL: https://issues.apache.org/jira/browse/SPARK-14845 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Reporter: Jeff Zhang >Priority: Minor > > I use the following command to run SparkPi example. And in this property > file, define spark.files. It turns out that in yarn-cluster mode this > README.md is only distirbuted to executor but not in driver. > content of properties file > {noformat} > spark.files=/Users/jzhang/github/spark/README.md > {noformat} > launch command > {noformat} > bin/spark-submit --master yarn-cluster --properties-file a.properties --class > org.apache.spark.examples.SparkPi > examples/target/original-spark-examples_2.11-2.0.0-SNAPSHOT.jar > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14845) spark.files in properties file is not distributed to driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14845: Assignee: (was: Apache Spark) > spark.files in properties file is not distributed to driver in yarn-cluster > mode > > > Key: SPARK-14845 > URL: https://issues.apache.org/jira/browse/SPARK-14845 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Reporter: Jeff Zhang >Priority: Minor > > I use the following command to run SparkPi example. And in this property > file, define spark.files. It turns out that in yarn-cluster mode this > README.md is only distirbuted to executor but not in driver. > content of properties file > {noformat} > spark.files=/Users/jzhang/github/spark/README.md > {noformat} > launch command > {noformat} > bin/spark-submit --master yarn-cluster --properties-file a.properties --class > org.apache.spark.examples.SparkPi > examples/target/original-spark-examples_2.11-2.0.0-SNAPSHOT.jar > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14845) spark.files in properties file is not distributed to driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-14845: --- Component/s: Spark Submit > spark.files in properties file is not distributed to driver in yarn-cluster > mode > > > Key: SPARK-14845 > URL: https://issues.apache.org/jira/browse/SPARK-14845 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Reporter: Jeff Zhang >Priority: Minor > > I use the following command to run SparkPi example. And in this property > file, define spark.files. It turns out that in yarn-cluster mode this > README.md is only distirbuted to executor but not in driver. > content of properties file > {noformat} > spark.files=/Users/jzhang/github/spark/README.md > {noformat} > launch command > {noformat} > bin/spark-submit --master yarn-cluster --properties-file a.properties --class > org.apache.spark.examples.SparkPi > examples/target/original-spark-examples_2.11-2.0.0-SNAPSHOT.jar > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14876) SparkSession should be case insensitive by default
[ https://issues.apache.org/jira/browse/SPARK-14876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14876. - Resolution: Fixed Fix Version/s: 2.0.0 > SparkSession should be case insensitive by default > -- > > Key: SPARK-14876 > URL: https://issues.apache.org/jira/browse/SPARK-14876 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > This would match most database systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.
[ https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255798#comment-15255798 ] Apache Spark commented on SPARK-13902: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/12655 > Make DAGScheduler.getAncestorShuffleDependencies() return in topological > order to ensure building ancestor stages first. > > > Key: SPARK-13902 > URL: https://issues.apache.org/jira/browse/SPARK-13902 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Takuya Ueshin > > {{DAGScheduler}} sometimes generate incorrect stage graph. > Some stages are generated for the same shuffleId twice or more and they are > referenced by the child stages because the building order of the graph is not > correct. > I added the sample RDD graph to show the illegal stage graph to > {{DAGSchedulerSuite}} and then fixed it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation
[ https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255765#comment-15255765 ] Jeff Zhang edited comment on SPARK-14834 at 4/25/16 1:09 AM: - This is for enforcing user to add python doc when adding new python api with @since annotation. But I think about it again, this is only suitable for adding new api for existing python module. If it is a new python module migrating from scala api, python doc is not mandatory. was (Author: zjffdu): This is for enforcing user to add python doc when adding new python api with @since annotation. > Force adding doc for new api in pyspark with @since annotation > -- > > Key: SPARK-14834 > URL: https://issues.apache.org/jira/browse/SPARK-14834 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation
[ https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255765#comment-15255765 ] Jeff Zhang commented on SPARK-14834: This is for enforcing user to add python doc when adding new python api with @since annotation. > Force adding doc for new api in pyspark with @since annotation > -- > > Key: SPARK-14834 > URL: https://issues.apache.org/jira/browse/SPARK-14834 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14693) Spark Streaming Context Hangs on Start
[ https://issues.apache.org/jira/browse/SPARK-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255734#comment-15255734 ] Evan Oman commented on SPARK-14693: --- I do get the following error after waiting for two hours: {code} java.rmi.RemoteException: java.util.concurrent.TimeoutException: Timed out retrying send to http://10.210.224.74:7070: 2 hours; nested exception is: java.util.concurrent.TimeoutException: Timed out retrying send to http://10.210.224.74:7070: 2 hours at com.databricks.backend.daemon.data.client.DbfsClient.send0(DbfsClient.scala:71) at com.databricks.backend.daemon.data.client.DbfsClient.sendIdempotent(DbfsClient.scala:40) at com.databricks.backend.daemon.data.client.DatabricksFileSystem.listStatus(DatabricksFileSystem.scala:189) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.initializeOrRecover(FileBasedWriteAheadLog.scala:228) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.(FileBasedWriteAheadLog.scala:72) at org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:141) at org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:141) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.streaming.util.WriteAheadLogUtils$.createLog(WriteAheadLogUtils.scala:140) at org.apache.spark.streaming.util.WriteAheadLogUtils$.createLogForDriver(WriteAheadLogUtils.scala:98) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:254) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:252) at scala.Option.map(Option.scala:145) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.createWriteAheadLog(ReceivedBlockTracker.scala:252) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.(ReceivedBlockTracker.scala:75) at org.apache.spark.streaming.scheduler.ReceiverTracker.(ReceiverTracker.scala:106) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:80) at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:610) at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:606) at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:606) at ... run in separate thread using org.apache.spark.util.ThreadUtils ... () at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:606) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600) Caused by: java.util.concurrent.TimeoutException: Timed out retrying send to http://10.210.224.74:7070: 2 hours at com.databricks.rpc.ReliableJettyClient.retryOnNetworkError(ReliableJettyClient.scala:138) at com.databricks.rpc.ReliableJettyClient.sendIdempotent(ReliableJettyClient.scala:46) at com.databricks.backend.daemon.data.client.DbfsClient.doSend(DbfsClient.scala:83) at com.databricks.backend.daemon.data.client.DbfsClient.send0(DbfsClient.scala:60) at com.databricks.backend.daemon.data.client.DbfsClient.sendIdempotent(DbfsClient.scala:40) at com.databricks.backend.daemon.data.client.DatabricksFileSystem.listStatus(DatabricksFileSystem.scala:189) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.initializeOrRecover(FileBasedWriteAheadLog.scala:228) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.(FileBasedWriteAheadLog.scala:72) at org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:141) at org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:141) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.streaming.util.WriteAheadLogUtils$.createLog(WriteAheadLogUtils.scala:140) at org.apache.spark.streaming.util.WriteAheadLogUtils$.createLogForDriver(WriteAheadLogUtils.scala:98) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:254) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:252) at scala.Option.map(Option.scala:145) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.createWriteAheadLog(ReceivedBlockTracker.scala:252) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.(ReceivedBlockTracker.scala:75) at org.apache.spark.streaming.scheduler.ReceiverTracker.(Recei
[jira] [Assigned] (SPARK-14885) When creating a CatalogColumn, we should use catalogString of a DataType
[ https://issues.apache.org/jira/browse/SPARK-14885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14885: Assignee: Apache Spark > When creating a CatalogColumn, we should use catalogString of a DataType > > > Key: SPARK-14885 > URL: https://issues.apache.org/jira/browse/SPARK-14885 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Minor > > Right now, the data type field of a CatalogColumn is using the string > representation. When we create this string from a DataType object, there are > places where we use simpleString. Although catalogString is the same as > simpleString right now, it is better to use catalogString. So, we will not > introduce issues when we change the semantic of simpleString or the > implementation of catalogString. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14885) When creating a CatalogColumn, we should use catalogString of a DataType
[ https://issues.apache.org/jira/browse/SPARK-14885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14885: Assignee: (was: Apache Spark) > When creating a CatalogColumn, we should use catalogString of a DataType > > > Key: SPARK-14885 > URL: https://issues.apache.org/jira/browse/SPARK-14885 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Minor > > Right now, the data type field of a CatalogColumn is using the string > representation. When we create this string from a DataType object, there are > places where we use simpleString. Although catalogString is the same as > simpleString right now, it is better to use catalogString. So, we will not > introduce issues when we change the semantic of simpleString or the > implementation of catalogString. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14885) When creating a CatalogColumn, we should use catalogString of a DataType
[ https://issues.apache.org/jira/browse/SPARK-14885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255708#comment-15255708 ] Apache Spark commented on SPARK-14885: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/12654 > When creating a CatalogColumn, we should use catalogString of a DataType > > > Key: SPARK-14885 > URL: https://issues.apache.org/jira/browse/SPARK-14885 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Minor > > Right now, the data type field of a CatalogColumn is using the string > representation. When we create this string from a DataType object, there are > places where we use simpleString. Although catalogString is the same as > simpleString right now, it is better to use catalogString. So, we will not > introduce issues when we change the semantic of simpleString or the > implementation of catalogString. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14885) When creating a CatalogColumn, we should use catalogString of a DataType
Yin Huai created SPARK-14885: Summary: When creating a CatalogColumn, we should use catalogString of a DataType Key: SPARK-14885 URL: https://issues.apache.org/jira/browse/SPARK-14885 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Minor Right now, the data type field of a CatalogColumn is using the string representation. When we create this string from a DataType object, there are places where we use simpleString. Although catalogString is the same as simpleString right now, it is better to use catalogString. So, we will not introduce issues when we change the semantic of simpleString or the implementation of catalogString. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14761) PySpark DataFrame.join should reject invalid join methods even when join columns are not specified
[ https://issues.apache.org/jira/browse/SPARK-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255697#comment-15255697 ] Josh Rosen commented on SPARK-14761: There's already logic for handling this in the JVM Spark SQL's analyzer, so we don't need to add any specific error-handling logic in PySpark. Instead, we need to make sure that we always pass the complete set of arguments to the JVM API so that we can rely on it to perform the validation for us. > PySpark DataFrame.join should reject invalid join methods even when join > columns are not specified > -- > > Key: SPARK-14761 > URL: https://issues.apache.org/jira/browse/SPARK-14761 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Josh Rosen >Priority: Minor > Labels: starter > > In PySpark, the following invalid DataFrame join will not result an error: > {code} > df1.join(df2, how='not-a-valid-join-type') > {code} > The signature for `join` is > {code} > def join(self, other, on=None, how=None): > {code} > and its code ends up completely skipping handling of the `how` parameter when > `on` is `None`: > {code} > if on is not None and not isinstance(on, list): > on = [on] > if on is None or len(on) == 0: > jdf = self._jdf.join(other._jdf) > elif isinstance(on[0], basestring): > if how is None: > jdf = self._jdf.join(other._jdf, self._jseq(on), "inner") > else: > assert isinstance(how, basestring), "how should be basestring" > jdf = self._jdf.join(other._jdf, self._jseq(on), how) > else: > {code} > Given that this behavior can mask user errors (as in the above example), I > think that we should refactor this to first process all arguments and then > call the three-argument {{_.jdf.join}}. This would handle the above invalid > example by passing all arguments to the JVM DataFrame for analysis. > I'm not planning to work on this myself, so this bugfix (+ regression test!) > is up for grabs in case someone else wants to do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14708) Repl Serialization Issue
[ https://issues.apache.org/jira/browse/SPARK-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255690#comment-15255690 ] Josh Rosen commented on SPARK-14708: [~srowen], I helped [~bill_chambers] to look into this. My best guess as to what's happening here is that the {{IntWrapper}} case class defined in the REPL cell is capturing a reference to the enclosing REPL cell, which itself references all previous REPL cells. The ClosureCleaner only runs on closures, not arbitrary Java objects passed to {{parallelize}}, so I think that serializing the entries of the {{pairs}} RDD ends up serializing a huge object graph. If I'm right, this may not be easily fixable short of fixing the REPL so that it omits the hidden outer pointer from the case class when it can be proven to be unnecessary. > Repl Serialization Issue > > > Key: SPARK-14708 > URL: https://issues.apache.org/jira/browse/SPARK-14708 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Bill Chambers >Priority: Minor > > Run this code 6 times with the :paste command in Spark. You'll see > exponential slow downs. > class IntWrapper(val i: Int) extends Serializable { } > var pairs = sc.parallelize(Array((0, new IntWrapper(0 > for (_ <- 0 until 3) { > val wrapper = pairs.values.reduce((x,_) => x) > pairs = pairs.mapValues(_ => wrapper) > } > val result = pairs.collect() > https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14548) Support !> and !< operator in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14548. - Resolution: Fixed Assignee: Jia Li Fix Version/s: 2.0.0 > Support !> and !< operator in Spark SQL > --- > > Key: SPARK-14548 > URL: https://issues.apache.org/jira/browse/SPARK-14548 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jia Li >Assignee: Jia Li >Priority: Minor > Fix For: 2.0.0 > > > !< means not less than which is equivalent to >= > !> means not greater than which is equivalent to <= > I'd to create a PR to support these two operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255658#comment-15255658 ] Romi Kuntsman commented on SPARK-4452: -- Hi, what's the reason this will only be available in Spark 2.0.0, and not 1.6.4 or 1.7.0? > Shuffle data structures can starve others on the same thread for memory > > > Key: SPARK-4452 > URL: https://issues.apache.org/jira/browse/SPARK-4452 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Tianshuo Deng >Assignee: Tianshuo Deng > Fix For: 2.0.0 > > > When an Aggregator is used with ExternalSorter in a task, spark will create > many small files and could cause too many files open error during merging. > Currently, ShuffleMemoryManager does not work well when there are 2 spillable > objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used > by Aggregator) in this case. Here is an example: Due to the usage of mapside > aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may > ask as much memory as it can, which is totalMem/numberOfThreads. Then later > on when ExternalSorter is created in the same thread, the > ShuffleMemoryManager could refuse to allocate more memory to it, since the > memory is already given to the previous requested > object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling > small files(due to the lack of memory) > I'm currently working on a PR to address these two issues. It will include > following changes: > 1. The ShuffleMemoryManager should not only track the memory usage for each > thread, but also the object who holds the memory > 2. The ShuffleMemoryManager should be able to trigger the spilling of a > spillable object. In this way, if a new object in a thread is requesting > memory, the old occupant could be evicted/spilled. Previously the spillable > objects trigger spilling by themselves. So one may not trigger spilling even > if another object in the same thread needs more memory. After this change The > ShuffleMemoryManager could trigger the spilling of an object if it needs to. > 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously > ExternalAppendOnlyMap returns an destructive iterator and can not be spilled > after the iterator is returned. This should be changed so that even after the > iterator is returned, the ShuffleMemoryManager can still spill it. > Currently, I have a working branch in progress: > https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made > change 3 and have a prototype of change 1 and 2 to evict spillable from > memory manager, still in progress. I will send a PR when it's done. > Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14691) Simplify and Unify Error Generation for Unsupported Alter Table DDL
[ https://issues.apache.org/jira/browse/SPARK-14691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-14691. --- Resolution: Fixed Assignee: Xiao Li Fix Version/s: 2.0.0 > Simplify and Unify Error Generation for Unsupported Alter Table DDL > --- > > Key: SPARK-14691 > URL: https://issues.apache.org/jira/browse/SPARK-14691 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > So far, we are capturing each unsupported Alter Table in separate visit > functions. They should be unified and issue a ParseException instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]
[ https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14875: Assignee: Apache Spark (was: Cheng Lian) > OutputWriterFactory.newInstance shouldn't be private[sql] > - > > Key: SPARK-14875 > URL: https://issues.apache.org/jira/browse/SPARK-14875 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > Existing packages like spark-avro need to access > {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in > Spark 2.0. Should make it public again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]
[ https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255633#comment-15255633 ] Apache Spark commented on SPARK-14875: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/12652 > OutputWriterFactory.newInstance shouldn't be private[sql] > - > > Key: SPARK-14875 > URL: https://issues.apache.org/jira/browse/SPARK-14875 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Existing packages like spark-avro need to access > {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in > Spark 2.0. Should make it public again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]
[ https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14875: Assignee: Cheng Lian (was: Apache Spark) > OutputWriterFactory.newInstance shouldn't be private[sql] > - > > Key: SPARK-14875 > URL: https://issues.apache.org/jira/browse/SPARK-14875 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Existing packages like spark-avro need to access > {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in > Spark 2.0. Should make it public again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14761) PySpark DataFrame.join should reject invalid join methods even when join columns are not specified
[ https://issues.apache.org/jira/browse/SPARK-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255595#comment-15255595 ] Krishna Kalyan edited comment on SPARK-14761 at 4/24/16 2:04 PM: - Could someone please advice on how the error should be handled, if its an invalid join type?. I am assuming it should return an error message like "Incorrect join parameter", is that okay?. Thanks, Krishna was (Author: krishnakalyan3): Could someone please advice on how the error should be handled, if its an invalid join type?. I am assuming it should return an error message like "Please specify a correct join parameter", is that okay?. Thanks, Krishna > PySpark DataFrame.join should reject invalid join methods even when join > columns are not specified > -- > > Key: SPARK-14761 > URL: https://issues.apache.org/jira/browse/SPARK-14761 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Josh Rosen >Priority: Minor > Labels: starter > > In PySpark, the following invalid DataFrame join will not result an error: > {code} > df1.join(df2, how='not-a-valid-join-type') > {code} > The signature for `join` is > {code} > def join(self, other, on=None, how=None): > {code} > and its code ends up completely skipping handling of the `how` parameter when > `on` is `None`: > {code} > if on is not None and not isinstance(on, list): > on = [on] > if on is None or len(on) == 0: > jdf = self._jdf.join(other._jdf) > elif isinstance(on[0], basestring): > if how is None: > jdf = self._jdf.join(other._jdf, self._jseq(on), "inner") > else: > assert isinstance(how, basestring), "how should be basestring" > jdf = self._jdf.join(other._jdf, self._jseq(on), how) > else: > {code} > Given that this behavior can mask user errors (as in the above example), I > think that we should refactor this to first process all arguments and then > call the three-argument {{_.jdf.join}}. This would handle the above invalid > example by passing all arguments to the JVM DataFrame for analysis. > I'm not planning to work on this myself, so this bugfix (+ regression test!) > is up for grabs in case someone else wants to do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14761) PySpark DataFrame.join should reject invalid join methods even when join columns are not specified
[ https://issues.apache.org/jira/browse/SPARK-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255595#comment-15255595 ] Krishna Kalyan commented on SPARK-14761: Could someone please advice on how the error should be handled, if its an invalid join type?. I am assuming it should return an error message like "Please specify a correct join parameter", is that okay?. Thanks, Krishna > PySpark DataFrame.join should reject invalid join methods even when join > columns are not specified > -- > > Key: SPARK-14761 > URL: https://issues.apache.org/jira/browse/SPARK-14761 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Josh Rosen >Priority: Minor > Labels: starter > > In PySpark, the following invalid DataFrame join will not result an error: > {code} > df1.join(df2, how='not-a-valid-join-type') > {code} > The signature for `join` is > {code} > def join(self, other, on=None, how=None): > {code} > and its code ends up completely skipping handling of the `how` parameter when > `on` is `None`: > {code} > if on is not None and not isinstance(on, list): > on = [on] > if on is None or len(on) == 0: > jdf = self._jdf.join(other._jdf) > elif isinstance(on[0], basestring): > if how is None: > jdf = self._jdf.join(other._jdf, self._jseq(on), "inner") > else: > assert isinstance(how, basestring), "how should be basestring" > jdf = self._jdf.join(other._jdf, self._jseq(on), how) > else: > {code} > Given that this behavior can mask user errors (as in the above example), I > think that we should refactor this to first process all arguments and then > call the three-argument {{_.jdf.join}}. This would handle the above invalid > example by passing all arguments to the JVM DataFrame for analysis. > I'm not planning to work on this myself, so this bugfix (+ regression test!) > is up for grabs in case someone else wants to do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255587#comment-15255587 ] Herman van Hovell edited comment on SPARK-14591 at 4/24/16 1:42 PM: [~yhuai] We define the reserved keywords in here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4. Any keyword that is not a part of the {{nonReserved}} rule is a reserved keyword. The current reserved keywords can be divided into three groups: - Symbols: {{+, -, /, !=, <>, ...}} - Keywords that cannot be made non-reserved (confirmed): {{ANTI, FULL, INNER, LEFT, SEMI, RIGHT, NATURAL, UNION, INTERSECT, EXCEPT, DATABASE, SCHEMA}} - Keyword that can probably be made non-reserved: {{AND, CASE, CAST, CROSS, DISTINCT, DIV, ELSE, END, FROM, FUNCTION, HAVING, INTERVAL, JOIN, MACRO, NOT, ON, OR, SELECT, STRATIFY, THEN, UNBOUNDED, WHEN, WHERE}} We could add all reserved keywords to the {{looseNonReserved}} rule and use that rule for all names except table aliases (which is where we run into trouble). was (Author: hvanhovell): [~yhuai] We define the reserved keywords in here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4. Any keyword that is not a part of the {{nonReserved}} rule is a reserved keyword. The current reserved keywords can be divided into three groups: - Symbols: {{+, -, /, !=, <>, ...}} - Keywords that cannot be made non-reserved (confirmed): {{ANTI, FULL, INNER, LEFT, SEMI, RIGHT, NATURAL, UNION, INTERSECT, EXCEPT, DATABASE, SCHEMA}} - Keyword that can probably be made non-reserved: {{AND, CASE, CAST, CROSS, DISTINCT, DIV, ELSE, END, FROM, FUNCTION, HAVING, INTERVAL, JOIN, MACRO, NOT, ON, OR, SELECT, STRATIFY, THEN, UNBOUNDED, WHEN, WHERE}} We could add all reserved keywords to the {{looseNonReserved}} rule and use that rule for all names except aliases (which is where we run into trouble). > Remove org.apache.spark.sql.catalyst.parser.DataTypeParser > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > Since our parser defined based on antlr 4 can parse data type (see > CatalystSqlParser), we can remove > org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new > parser's functionality is a super set of DataTypeParser. Then, we can remove > DataTypeParser. For the object DataTypeParser, we can keep it and let it just > call the parserDataType method of CatalystSqlParser. > *The original description is shown below* > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255587#comment-15255587 ] Herman van Hovell commented on SPARK-14591: --- [~yhuai] We define the reserved keywords in here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4. Any keyword that is not a part of the {{nonReserved}} rule is a reserved keyword. The current reserved keywords can be divided into three groups: - Symbols: {{+, -, /, !=, <>, ...}} - Keywords that cannot be made non-reserved (confirmed): {{ANTI, FULL, INNER, LEFT, SEMI, RIGHT, NATURAL, UNION, INTERSECT, EXCEPT, DATABASE, SCHEMA}} - Keyword that can probably be made non-reserved: {{AND, CASE, CAST, CROSS, DISTINCT, DIV, ELSE, END, FROM, FUNCTION, HAVING, INTERVAL, JOIN, MACRO, NOT, ON, OR, SELECT, STRATIFY, THEN, UNBOUNDED, WHEN, WHERE}} We could add all reserved keywords to the {{looseNonReserved}} rule and use that rule for all names except aliases (which is where we run into trouble). > Remove org.apache.spark.sql.catalyst.parser.DataTypeParser > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > Since our parser defined based on antlr 4 can parse data type (see > CatalystSqlParser), we can remove > org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new > parser's functionality is a super set of DataTypeParser. Then, we can remove > DataTypeParser. For the object DataTypeParser, we can keep it and let it just > call the parserDataType method of CatalystSqlParser. > *The original description is shown below* > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13267) Document ?params for the v1 REST API
[ https://issues.apache.org/jira/browse/SPARK-13267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13267: -- Assignee: Steve Loughran > Document ?params for the v1 REST API > > > Key: SPARK-13267 > URL: https://issues.apache.org/jira/browse/SPARK-13267 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 2.0.0 > > > There's some various ? param options in the v1 rest API, which don't get any > mention except in the HistoryServerSuite. They should be documented in > monitoring.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13267) Document ?params for the v1 REST API
[ https://issues.apache.org/jira/browse/SPARK-13267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13267. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11152 [https://github.com/apache/spark/pull/11152] > Document ?params for the v1 REST API > > > Key: SPARK-13267 > URL: https://issues.apache.org/jira/browse/SPARK-13267 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Steve Loughran >Priority: Minor > Fix For: 2.0.0 > > > There's some various ? param options in the v1 rest API, which don't get any > mention except in the HistoryServerSuite. They should be documented in > monitoring.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14870) NPE in generate aggregate
[ https://issues.apache.org/jira/browse/SPARK-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14870: Assignee: Apache Spark (was: Sameer Agarwal) > NPE in generate aggregate > - > > Key: SPARK-14870 > URL: https://issues.apache.org/jira/browse/SPARK-14870 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > When ran TPCDS Q14a > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 > (TID 234, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576) > at > org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) > at org.apache.spark.rdd.RDD.collect(RDD.scala:879) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:
[jira] [Assigned] (SPARK-14870) NPE in generate aggregate
[ https://issues.apache.org/jira/browse/SPARK-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14870: Assignee: Sameer Agarwal (was: Apache Spark) > NPE in generate aggregate > - > > Key: SPARK-14870 > URL: https://issues.apache.org/jira/browse/SPARK-14870 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Sameer Agarwal > > When ran TPCDS Q14a > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 > (TID 234, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576) > at > org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) > at org.apache.spark.rdd.RDD.collect(RDD.scala:879) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scal
[jira] [Commented] (SPARK-14870) NPE in generate aggregate
[ https://issues.apache.org/jira/browse/SPARK-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255528#comment-15255528 ] Apache Spark commented on SPARK-14870: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/12651 > NPE in generate aggregate > - > > Key: SPARK-14870 > URL: https://issues.apache.org/jira/browse/SPARK-14870 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Sameer Agarwal > > When ran TPCDS Q14a > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 > (TID 234, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576) > at > org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) > at org.apache.spark.rdd.RDD.collect(RDD.scala:879) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) > at > org.apache.spark.sql.execution.SQLExecution$.withN
[jira] [Updated] (SPARK-14884) Fix call site for continuous queries
[ https://issues.apache.org/jira/browse/SPARK-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-14884: -- Description: Since we've been processing continuous queries in separate threads, the call sites are then +run at :0+. It's not wrong but provides very little information; in addition, we can not distinguish two queries only from their call sites. !https://cloud.githubusercontent.com/assets/15843379/14766101/a47246b2-0a30-11e6-8d81-06a9a600113b.png! !https://cloud.githubusercontent.com/assets/15843379/14766102/a4750226-0a30-11e6-9ada-773d977d902b.png! was:Since we've been processing continuous queries in separate threads, the call sites are then +run at :0+. It's not wrong but provides very little information; in addition, we can not distinguish two queries only from their call sites. > Fix call site for continuous queries > > > Key: SPARK-14884 > URL: https://issues.apache.org/jira/browse/SPARK-14884 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > > Since we've been processing continuous queries in separate threads, the call > sites are then +run at :0+. It's not wrong but provides very little > information; in addition, we can not distinguish two queries only from their > call sites. > !https://cloud.githubusercontent.com/assets/15843379/14766101/a47246b2-0a30-11e6-8d81-06a9a600113b.png! > !https://cloud.githubusercontent.com/assets/15843379/14766102/a4750226-0a30-11e6-9ada-773d977d902b.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14884) Fix call site for continuous queries
[ https://issues.apache.org/jira/browse/SPARK-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14884: Assignee: Apache Spark > Fix call site for continuous queries > > > Key: SPARK-14884 > URL: https://issues.apache.org/jira/browse/SPARK-14884 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Assignee: Apache Spark >Priority: Minor > > Since we've been processing continuous queries in separate threads, the call > sites are then +run at :0+. It's not wrong but provides very little > information; in addition, we can not distinguish two queries only from their > call sites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14884) Fix call site for continuous queries
[ https://issues.apache.org/jira/browse/SPARK-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14884: Assignee: (was: Apache Spark) > Fix call site for continuous queries > > > Key: SPARK-14884 > URL: https://issues.apache.org/jira/browse/SPARK-14884 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > > Since we've been processing continuous queries in separate threads, the call > sites are then +run at :0+. It's not wrong but provides very little > information; in addition, we can not distinguish two queries only from their > call sites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14884) Fix call site for continuous queries
[ https://issues.apache.org/jira/browse/SPARK-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255525#comment-15255525 ] Apache Spark commented on SPARK-14884: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/12650 > Fix call site for continuous queries > > > Key: SPARK-14884 > URL: https://issues.apache.org/jira/browse/SPARK-14884 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > > Since we've been processing continuous queries in separate threads, the call > sites are then +run at :0+. It's not wrong but provides very little > information; in addition, we can not distinguish two queries only from their > call sites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14884) Fix call site for continuous queries
Liwei Lin created SPARK-14884: - Summary: Fix call site for continuous queries Key: SPARK-14884 URL: https://issues.apache.org/jira/browse/SPARK-14884 Project: Spark Issue Type: Bug Components: SQL, Web UI Affects Versions: 2.0.0 Reporter: Liwei Lin Priority: Minor Since we've been processing continuous queries in separate threads, the call sites are then +run at :0+. It's not wrong but provides very little information; in addition, we can not distinguish two queries only from their call sites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org