[jira] [Closed] (SPARK-15183) Adding outputMode to structure Streaming Experimental Api
[ https://issues.apache.org/jira/browse/SPARK-15183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sachin Aggarwal closed SPARK-15183. --- Resolution: Duplicate > Adding outputMode to structure Streaming Experimental Api > - > > Key: SPARK-15183 > URL: https://issues.apache.org/jira/browse/SPARK-15183 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Reporter: Sachin Aggarwal >Priority: Trivial > > while experimenting with structure streaming. I found that mode() is used for > non-continuous queries while outputMode() is used for continuous queries. > ouputMode is not defined, so I have written the some raw implementation and > test cases just to make sure the streaming app works > Note:- > /** Start a query */ > private[sql] def startQuery( > name: String, > checkpointLocation: String, > df: DataFrame, > sink: Sink, > trigger: Trigger = ProcessingTime(0), > triggerClock: Clock = new SystemClock(), > outputMode: OutputMode = Append): ContinuousQuery = { > As per me outputMode should be defined before triggerClock, the constructor > with outputMode defined will be used more often then triggerClock. > I have added triggerClock() method also -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316222#comment-15316222 ] Gaurav Shah commented on SPARK-2984: seeing similar errors with spark 1.6 {noformat} App > Caused by: java.io.FileNotFoundException: File s3n://{bucket}.../_temporary/0/task_201606041650_0053_m_00/year=2016 does not exist. App > at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:1471) App > at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:366) App > at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:368) App > at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:315) App > at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) App > at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230) App > at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151) {noformat} > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at >
[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316187#comment-15316187 ] menda venugopal commented on SPARK-13525: - Hi, We have problem with similar kind of issue where SPARKML doesn't work with SPARKR with yarn-client mode. In local mode, it works. Spark core and Spark SQLworks without problem with R in local mode or yarn client. # > head(filter(df01, df01$C3 < 014001)) C0 C1 C2 C3 C4 C5 C6 C7 1 01001A008W-1 1 Lotissement Bellevue 01400 L'Abergement-Cl\xe9menciat CAD 46.134565 4.924122 2 01001A008W-2 2 Lotissement Bellevue 01400 L'Abergement-Cl\xe9menciat CAD 46.134633 4.924168 3 01001A008W-4 4 Lotissement Bellevue 01400 L'Abergement-Cl\xe9menciat CAD 46.134637 4.924353 4 01001A008W-5 5 Lotissement Bellevue 01400 L'Abergement-Cl\xe9menciat CAD 46.134518 4.924333 5 01001A008W-6 6 Lotissement Bellevue 01400 L'Abergement-Cl\xe9menciat CAD 46.134345 4.924229 6 01001A015D-1 1 Lotissement Les Charmilles 01400 L'Abergement-Cl\xe9menciat C+O 46.151573 4.921591 > groupByCode_Postal <- groupBy(df01, df01$C3) > code_Postal_G <- summarize(groupByCode_Postal, count = count(df01$C3)) > df <- createDataFrame(sqlContext, iris) Warning messages: 1: In FUN(X[[5L]], ...) : Use Sepal_Length instead of Sepal.Length as column name 2: In FUN(X[[5L]], ...) : Use Sepal_Width instead of Sepal.Width as column name 3: In FUN(X[[5L]], ...) : Use Petal_Length instead of Petal.Length as column name 4: In FUN(X[[5L]], ...) : Use Petal_Width instead of Petal.Width as column name > model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = > "gaussian") model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") 16/05/27 12:33:59 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, host-172-30-125-248.openstacklocal): java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) at java.net.ServerSocket.implAccept(ServerSocket.java:530) at java.net.ServerSocket.accept(ServerSocket.java:498) # 16/05/27 12:34:29 ERROR TaskSetManager: Task 0 in stage 5.0 failed 4 times; aborting job 16/05/27 12:34:29 ERROR RBackendHandler: fitRModelFormula on org.apache.spark.ml.api.r.SparkRWrappers failed Error in invokeJava(isStatic = TRUE, className, methodName, ...) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 8, host-172-30-125-248.openstacklocal): java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) at java.net.ServerSocket.implAccept(ServerSocket.java:530) at java.net.ServerSocket.accept(ServerSocket.java:498) at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426) at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(Ma > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri
[jira] [Commented] (SPARK-15757) Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc file
[ https://issues.apache.org/jira/browse/SPARK-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316170#comment-15316170 ] marymwu commented on SPARK-15757: - Actually, Field "inv_date_sk" does exist! We have executed "desc inventory", the result is as attached. > Error occurs when using Spark sql "select" statement on orc file after hive > sql "insert overwrite tb1 select * from sourcTb" has been executed on this > orc file > --- > > Key: SPARK-15757 > URL: https://issues.apache.org/jira/browse/SPARK-15757 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > Attachments: Result.png > > > Error occurs when using Spark sql "select" statement on orc file after hive > sql "insert overwrite tb1 select * from sourcTb" has been executed > 0: jdbc:hive2://172.19.200.158:40099/default> select * from inventory; > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 7.0 failed 8 times, most recent failure: Lost task 0.7 in > stage 7.0 (TID 2532, smokeslave5.avatar.lenovomm.com): > java.lang.IllegalArgumentException: Field "inv_date_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:251) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:361) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:123) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:278) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:262) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: (state=,code=0) -- This message was sent by Atlassian JIRA
[jira] [Issue Comment Deleted] (SPARK-15757) Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc
[ https://issues.apache.org/jira/browse/SPARK-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marymwu updated SPARK-15757: Comment: was deleted (was: Actually, Field "inv_date_sk" does exist! We have executed "desc inventory", the result is as attached. ) > Error occurs when using Spark sql "select" statement on orc file after hive > sql "insert overwrite tb1 select * from sourcTb" has been executed on this > orc file > --- > > Key: SPARK-15757 > URL: https://issues.apache.org/jira/browse/SPARK-15757 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > Attachments: Result.png > > > Error occurs when using Spark sql "select" statement on orc file after hive > sql "insert overwrite tb1 select * from sourcTb" has been executed > 0: jdbc:hive2://172.19.200.158:40099/default> select * from inventory; > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 7.0 failed 8 times, most recent failure: Lost task 0.7 in > stage 7.0 (TID 2532, smokeslave5.avatar.lenovomm.com): > java.lang.IllegalArgumentException: Field "inv_date_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:251) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:361) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:123) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:278) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:262) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-15757) Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc file
[ https://issues.apache.org/jira/browse/SPARK-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marymwu updated SPARK-15757: Attachment: Result.png > Error occurs when using Spark sql "select" statement on orc file after hive > sql "insert overwrite tb1 select * from sourcTb" has been executed on this > orc file > --- > > Key: SPARK-15757 > URL: https://issues.apache.org/jira/browse/SPARK-15757 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > Attachments: Result.png > > > Error occurs when using Spark sql "select" statement on orc file after hive > sql "insert overwrite tb1 select * from sourcTb" has been executed > 0: jdbc:hive2://172.19.200.158:40099/default> select * from inventory; > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 7.0 failed 8 times, most recent failure: Lost task 0.7 in > stage 7.0 (TID 2532, smokeslave5.avatar.lenovomm.com): > java.lang.IllegalArgumentException: Field "inv_date_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:251) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:361) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:123) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:278) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:262) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-15757) Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc file
[ https://issues.apache.org/jira/browse/SPARK-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316168#comment-15316168 ] marymwu commented on SPARK-15757: - Actually, Field "inv_date_sk" does exist! We have executed "desc inventory", the result is as attached. > Error occurs when using Spark sql "select" statement on orc file after hive > sql "insert overwrite tb1 select * from sourcTb" has been executed on this > orc file > --- > > Key: SPARK-15757 > URL: https://issues.apache.org/jira/browse/SPARK-15757 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > > Error occurs when using Spark sql "select" statement on orc file after hive > sql "insert overwrite tb1 select * from sourcTb" has been executed > 0: jdbc:hive2://172.19.200.158:40099/default> select * from inventory; > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 7.0 failed 8 times, most recent failure: Lost task 0.7 in > stage 7.0 (TID 2532, smokeslave5.avatar.lenovomm.com): > java.lang.IllegalArgumentException: Field "inv_date_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:251) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:361) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:123) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:278) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:262) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException
[ https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316163#comment-15316163 ] Jie Huang commented on SPARK-15046: --- But the interesting thing is that the bug is not reproducing while changing to 1.6 version. according to my understanding, when we submit a new job, we won't have the credential file (at the first time). This function won't be called at that time. That is why we are not noticing that issue before. Actually, that code snippet has been there since 1.6. > When running hive-thriftserver with yarn on a secure cluster the workers fail > with java.lang.NumberFormatException > -- > > Key: SPARK-15046 > URL: https://issues.apache.org/jira/browse/SPARK-15046 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Trystan Leftwich > > When running hive-thriftserver with yarn on a secure cluster > (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with > the following error. > {code} > 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: > java.lang.NumberFormatException: For input string: "86400079ms" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:441) > at java.lang.Long.parseLong(Long.java:483) > at > scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276) > at scala.collection.immutable.StringOps.toLong(StringOps.scala:29) > at > org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) > at > org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) > at scala.Option.map(Option.scala:146) > at org.apache.spark.SparkConf.getLong(SparkConf.scala:380) > at > org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289) > at > org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89) > at > org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException
[ https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316162#comment-15316162 ] Jie Huang commented on SPARK-15046: --- Yes. I met the similar issue. And after that patch applied, it works. > When running hive-thriftserver with yarn on a secure cluster the workers fail > with java.lang.NumberFormatException > -- > > Key: SPARK-15046 > URL: https://issues.apache.org/jira/browse/SPARK-15046 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Trystan Leftwich > > When running hive-thriftserver with yarn on a secure cluster > (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with > the following error. > {code} > 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: > java.lang.NumberFormatException: For input string: "86400079ms" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:441) > at java.lang.Long.parseLong(Long.java:483) > at > scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276) > at scala.collection.immutable.StringOps.toLong(StringOps.scala:29) > at > org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) > at > org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) > at scala.Option.map(Option.scala:146) > at org.apache.spark.SparkConf.getLong(SparkConf.scala:380) > at > org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289) > at > org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89) > at > org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15704) TungstenAggregate crashes
[ https://issues.apache.org/jira/browse/SPARK-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-15704: Assignee: (was: Koert Kuipers) > TungstenAggregate crashes > -- > > Key: SPARK-15704 > URL: https://issues.apache.org/jira/browse/SPARK-15704 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hiroshi Inoue > Fix For: 2.0.0 > > > When I run DatasetBenchmark, the JVM crashes while executing "Dataset complex > Aggregator" test case due to IndexOutOfBoundsException. > The error happens in TungstenAggregate; the mappings between bufferSerializer > and bufferDeserializer are broken due to unresolved attribute. > {quote} > 16/06/02 01:41:05 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID > 232) > java.lang.IndexOutOfBoundsException: -1 > at > scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65) > at scala.collection.immutable.List.apply(List.scala:84) > at > org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate$RichAttribute.right(interfaces.scala:389) > at > org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:110) > at > org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:109) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:336) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:334) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at
[jira] [Updated] (SPARK-15704) TungstenAggregate crashes
[ https://issues.apache.org/jira/browse/SPARK-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-15704: Assignee: Koert Kuipers > TungstenAggregate crashes > -- > > Key: SPARK-15704 > URL: https://issues.apache.org/jira/browse/SPARK-15704 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hiroshi Inoue >Assignee: Koert Kuipers > Fix For: 2.0.0 > > > When I run DatasetBenchmark, the JVM crashes while executing "Dataset complex > Aggregator" test case due to IndexOutOfBoundsException. > The error happens in TungstenAggregate; the mappings between bufferSerializer > and bufferDeserializer are broken due to unresolved attribute. > {quote} > 16/06/02 01:41:05 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID > 232) > java.lang.IndexOutOfBoundsException: -1 > at > scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65) > at scala.collection.immutable.List.apply(List.scala:84) > at > org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate$RichAttribute.right(interfaces.scala:389) > at > org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:110) > at > org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:109) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:336) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:334) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at
[jira] [Resolved] (SPARK-15704) TungstenAggregate crashes
[ https://issues.apache.org/jira/browse/SPARK-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15704. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13446 [https://github.com/apache/spark/pull/13446] > TungstenAggregate crashes > -- > > Key: SPARK-15704 > URL: https://issues.apache.org/jira/browse/SPARK-15704 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hiroshi Inoue > Fix For: 2.0.0 > > > When I run DatasetBenchmark, the JVM crashes while executing "Dataset complex > Aggregator" test case due to IndexOutOfBoundsException. > The error happens in TungstenAggregate; the mappings between bufferSerializer > and bufferDeserializer are broken due to unresolved attribute. > {quote} > 16/06/02 01:41:05 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID > 232) > java.lang.IndexOutOfBoundsException: -1 > at > scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65) > at scala.collection.immutable.List.apply(List.scala:84) > at > org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate$RichAttribute.right(interfaces.scala:389) > at > org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:110) > at > org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:109) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:336) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:334) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at >
[jira] [Resolved] (SPARK-15748) Replace inefficient foldLeft() call in PartitionStatistics
[ https://issues.apache.org/jira/browse/SPARK-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15748. - Resolution: Fixed Fix Version/s: 2.0.0 > Replace inefficient foldLeft() call in PartitionStatistics > -- > > Key: SPARK-15748 > URL: https://issues.apache.org/jira/browse/SPARK-15748 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > PartitionStatistics uses foldLeft and list concatenation to flatten an > iterator of lists, but this is extremely inefficient compared to simply doing > flatMap/flatten because it performs many unnecessary object allocations. > Simply replacing this foldLeft by a flatMap results in fair performance gains > when constructing PartitionStatistics instances for tables with many columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15657) RowEncoder should validate the data type of input object
[ https://issues.apache.org/jira/browse/SPARK-15657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-15657. Resolution: Fixed Fix Version/s: 2.0.0 Resolved by https://github.com/apache/spark/pull/13401 > RowEncoder should validate the data type of input object > > > Key: SPARK-15657 > URL: https://issues.apache.org/jira/browse/SPARK-15657 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12923) Optimize successive dapply() calls in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316062#comment-15316062 ] Shivaram Venkataraman commented on SPARK-12923: --- [~sunrui] Can we move this from the UDFs umbrella and retarget this to 2.1.0 ? If this change is small it'll also be cool to have this in 2.0 > Optimize successive dapply() calls in SparkR > > > Key: SPARK-12923 > URL: https://issues.apache.org/jira/browse/SPARK-12923 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > For consecutive dapply() calls on a same DataFrame, optimize them to launch R > worker once instead of multiple times for performance improvement -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15521) Add high level APIs based on dapply and gapply for easier usage
[ https://issues.apache.org/jira/browse/SPARK-15521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316060#comment-15316060 ] Shivaram Venkataraman commented on SPARK-15521: --- [~sunrui] Can we move this from the UDFs umbrella and retarget this to 2.1.0 ? > Add high level APIs based on dapply and gapply for easier usage > --- > > Key: SPARK-15521 > URL: https://issues.apache.org/jira/browse/SPARK-15521 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Sun Rui > > dapply() and gapply() of SparkDataFrame are two basic functions. For easier > usage to users in the R community, some high level functions can be added > based on them. > Candidates are: > http://exposurescience.org/heR.doc/library/heR.Misc/html/dapply.html > http://exposurescience.org/heR.doc/library/stats/html/aggregate.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316050#comment-15316050 ] Yan Chen commented on SPARK-15716: -- We think the behavior of using local directory is due to this piece of code in RDD class: {code:java} def getCheckpointFile: Option[String] = { checkpointData match { case Some(reliable: ReliableRDDCheckpointData[T]) => reliable.getCheckpointDir case _ => None } } {code} Apparently local directory is not reliable one. > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0 > Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92 > SUSE Linux, CentOS 6 and CentOS 7 >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA
[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316045#comment-15316045 ] Yan Chen edited comment on SPARK-15716 at 6/5/16 10:11 PM: --- The same thing happens on 1.5.0, 1.6.0, 1.6.1, 2.0.0 preview. However, if I set the checkpointing directory to local one when only using one node as both master and worker, the issue goes aways. The behavior is the same for all the versions, regardless of which environment it's running. We already tried standalone and yarn-client mode. was (Author: yani.chen): The same thing happens on 1.6.0, 1.6.1, 2.0.0 preview. However, if I set the checkpointing directory to local one when only using one node as both master and worker, the issue goes aways. The behavior is the same for all the versions, regardless of which environment it's running. We already tried standalone and yarn-client mode. > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0 > Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92 > SUSE Linux, CentOS 6 and CentOS 7 >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval
[jira] [Updated] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Chen updated SPARK-15716: - Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92 SUSE Linux, CentOS 6 and CentOS 7 was:Oracle Java 1.8.0_51, SUSE Linux > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0 > Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92 > SUSE Linux, CentOS 6 and CentOS 7 >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Chen updated SPARK-15716: - Affects Version/s: 1.5.0 > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0 > Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92 > SUSE Linux, CentOS 6 and CentOS 7 >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316045#comment-15316045 ] Yan Chen edited comment on SPARK-15716 at 6/5/16 10:07 PM: --- The same thing happens on 1.6.0, 1.6.1, 2.0.0 preview. However, if I set the checkpointing directory to local one when only using one node as both master and worker, the issue goes aways. The behavior is the same for all the versions, regardless of which environment it's running. We already tried standalone and yarn-client mode. was (Author: yani.chen): The same thing happens on 1.6.0, 1.6.1, 2.0.0 preview. However, if I set the checkpointing directory to local one when only using one node as both master and worker, the issue goes aways. > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.6.0, 1.6.1, 2.0.0 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are
[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316047#comment-15316047 ] Yan Chen commented on SPARK-15716: -- The memory usage goes up, and finally will make the whole streaming process not responding. > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.6.0, 1.6.1, 2.0.0 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316045#comment-15316045 ] Yan Chen commented on SPARK-15716: -- The same thing happens on 1.6.0, 1.6.1, 2.0.0 preview. However, if I set the checkpointing directory to local one when only using one node as both master and worker, the issue goes aways. > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.6.0, 1.6.1, 2.0.0 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Chen updated SPARK-15716: - Affects Version/s: 2.0.0 1.6.0 1.6.1 > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.6.0, 1.6.1, 2.0.0 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15773) Avoid creating local variable `sc` in examples if possible
[ https://issues.apache.org/jira/browse/SPARK-15773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316040#comment-15316040 ] Apache Spark commented on SPARK-15773: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/13520 > Avoid creating local variable `sc` in examples if possible > -- > > Key: SPARK-15773 > URL: https://issues.apache.org/jira/browse/SPARK-15773 > Project: Spark > Issue Type: Task > Components: Examples >Reporter: Dongjoon Hyun >Priority: Minor > > Instead of using local variable `sc` like the following example, this issue > uses `spark.sparkContext`. This makes examples more concise, and also fixes > some misleading, i.e., creating SparkContext from SparkSession. > {code} > -println("Creating SparkContext") > -val sc = spark.sparkContext > - > println("Writing local file to DFS") > val dfsFilename = dfsDirPath + "/dfs_read_write_test" > -val fileRDD = sc.parallelize(fileContents) > +val fileRDD = spark.sparkContext.parallelize(fileContents) > {code} > This will change 12 files (+30 lines, -52 lines). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15773) Avoid creating local variable `sc` in examples if possible
[ https://issues.apache.org/jira/browse/SPARK-15773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15773: Assignee: (was: Apache Spark) > Avoid creating local variable `sc` in examples if possible > -- > > Key: SPARK-15773 > URL: https://issues.apache.org/jira/browse/SPARK-15773 > Project: Spark > Issue Type: Task > Components: Examples >Reporter: Dongjoon Hyun >Priority: Minor > > Instead of using local variable `sc` like the following example, this issue > uses `spark.sparkContext`. This makes examples more concise, and also fixes > some misleading, i.e., creating SparkContext from SparkSession. > {code} > -println("Creating SparkContext") > -val sc = spark.sparkContext > - > println("Writing local file to DFS") > val dfsFilename = dfsDirPath + "/dfs_read_write_test" > -val fileRDD = sc.parallelize(fileContents) > +val fileRDD = spark.sparkContext.parallelize(fileContents) > {code} > This will change 12 files (+30 lines, -52 lines). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15773) Avoid creating local variable `sc` in examples if possible
[ https://issues.apache.org/jira/browse/SPARK-15773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15773: Assignee: Apache Spark > Avoid creating local variable `sc` in examples if possible > -- > > Key: SPARK-15773 > URL: https://issues.apache.org/jira/browse/SPARK-15773 > Project: Spark > Issue Type: Task > Components: Examples >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Instead of using local variable `sc` like the following example, this issue > uses `spark.sparkContext`. This makes examples more concise, and also fixes > some misleading, i.e., creating SparkContext from SparkSession. > {code} > -println("Creating SparkContext") > -val sc = spark.sparkContext > - > println("Writing local file to DFS") > val dfsFilename = dfsDirPath + "/dfs_read_write_test" > -val fileRDD = sc.parallelize(fileContents) > +val fileRDD = spark.sparkContext.parallelize(fileContents) > {code} > This will change 12 files (+30 lines, -52 lines). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13480) Regression with percentile() + function in GROUP BY
[ https://issues.apache.org/jira/browse/SPARK-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316037#comment-15316037 ] Abhinit Kumar Sinha commented on SPARK-13480: - Can I work on this issue? > Regression with percentile() + function in GROUP BY > --- > > Key: SPARK-13480 > URL: https://issues.apache.org/jira/browse/SPARK-13480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jaka Jancar > > {code} > SELECT > percentile(load_time, 0.50) > FROM > ( > select '2000-01-01' queued_at, 100 load_time > union all > select '2000-01-01' queued_at, 110 load_time > union all > select '2000-01-01' queued_at, 120 load_time > ) t > GROUP BY > year(queued_at) > {code} > fails with > {code} > Error in SQL statement: SparkException: Job aborted due to stage failure: > Task 0 in stage 6067.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 6067.0 (TID 268774, ip-10-0-163-203.ec2.internal): > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: year(cast(queued_at#78201 as date))#78209 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:86) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:233) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:85) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:62) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:62) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(Projection.scala:62) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.Exchange.org$apache$spark$sql$execution$Exchange$$getPartitionKeyExtractor$1(Exchange.scala:197) > at > org.apache.spark.sql.execution.Exchange$$anonfun$3.apply(Exchange.scala:209) > at > org.apache.spark.sql.execution.Exchange$$anonfun$3.apply(Exchange.scala:208) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.RuntimeException: Couldn't find > year(cast(queued_at#78201 as date))#78209 in [queued_at#78201,load_time#78202] > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:92) > at >
[jira] [Updated] (SPARK-15773) Avoid creating local variable `sc` in examples if possible
[ https://issues.apache.org/jira/browse/SPARK-15773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15773: -- Summary: Avoid creating local variable `sc` in examples if possible (was: Avoid creating local variable `sc` in examples) > Avoid creating local variable `sc` in examples if possible > -- > > Key: SPARK-15773 > URL: https://issues.apache.org/jira/browse/SPARK-15773 > Project: Spark > Issue Type: Task > Components: Examples >Reporter: Dongjoon Hyun >Priority: Minor > > Instead of using local variable `sc` like the following example, this issue > uses `spark.sparkContext`. This makes examples more concise, and also fixes > some misleading, i.e., creating SparkContext from SparkSession. > {code} > -println("Creating SparkContext") > -val sc = spark.sparkContext > - > println("Writing local file to DFS") > val dfsFilename = dfsDirPath + "/dfs_read_write_test" > -val fileRDD = sc.parallelize(fileContents) > +val fileRDD = spark.sparkContext.parallelize(fileContents) > {code} > This will change 12 files (+30 lines, -52 lines). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15773) Avoid creating local variable `sc` in examples
Dongjoon Hyun created SPARK-15773: - Summary: Avoid creating local variable `sc` in examples Key: SPARK-15773 URL: https://issues.apache.org/jira/browse/SPARK-15773 Project: Spark Issue Type: Task Components: Examples Reporter: Dongjoon Hyun Priority: Minor Instead of using local variable `sc` like the following example, this issue uses `spark.sparkContext`. This makes examples more concise, and also fixes some misleading, i.e., creating SparkContext from SparkSession. {code} -println("Creating SparkContext") -val sc = spark.sparkContext - println("Writing local file to DFS") val dfsFilename = dfsDirPath + "/dfs_read_write_test" -val fileRDD = sc.parallelize(fileContents) +val fileRDD = spark.sparkContext.parallelize(fileContents) {code} This will change 12 files (+30 lines, -52 lines). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15159) SparkSession R API
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316031#comment-15316031 ] Shivaram Venkataraman commented on SPARK-15159: --- [~felixcheung] Are you working on this ? Just checking as this would be great to have for the 2.0 RCs. cc [~rxin] > SparkSession R API > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui >Priority: Blocker > > HiveContext is to be deprecated in 2.0. Replace them with > SparkSession.builder.enableHiveSupport in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15772) Improve Scala API docs
nirav patel created SPARK-15772: --- Summary: Improve Scala API docs Key: SPARK-15772 URL: https://issues.apache.org/jira/browse/SPARK-15772 Project: Spark Issue Type: Improvement Components: docs, Documentation Reporter: nirav patel Hi, I just found out that spark python APIs are much more elaborate then scala counterpart. e.g. https://spark.apache.org/docs/1.4.1/api/python/pyspark.html?highlight=treereduce#pyspark.RDD.treeReduce https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=treereduce#pyspark.RDD There are clear explanations of parameters; there are examples as well . I think this would be great improvement on Scala API as well. It will make API more friendly in first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15691) Refactor and improve Hive support
[ https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316017#comment-15316017 ] Xiao Li commented on SPARK-15691: - Sure, will follow your suggestions. Thanks! > Refactor and improve Hive support > - > > Key: SPARK-15691 > URL: https://issues.apache.org/jira/browse/SPARK-15691 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Hive support is important to Spark SQL, as many Spark users use it to read > from Hive. The current architecture is very difficult to maintain, and this > ticket tracks progress towards getting us to a sane state. > A number of things we want to accomplish are: > - Move the Hive specific catalog logic into HiveExternalCatalog. > -- Remove HiveSessionCatalog. All Hive-related stuff should go into > HiveExternalCatalog. This would require moving caching either into > HiveExternalCatalog, or just into SessionCatalog. > -- Move using properties to store data source options into > HiveExternalCatalog. > -- Potentially more. > - Remove HIve's specific ScriptTransform implementation and make it more > general so we can put it in sql/core. > - Implement HiveTableScan (and write path) as a data source, so we don't need > a special planner rule for HiveTableScan. > - Remove HiveSharedState and HiveSessionState. > One thing that is still unclear to me is how to work with Hive UDF support. > We might still need a special planner rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15691) Refactor and improve Hive support
[ https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316010#comment-15316010 ] Reynold Xin commented on SPARK-15691: - [~smilegator] -- do you mind using google docs for the design doc? It'd be easier to leave comments in line. One high level feedback I have is that the current doc is very bottom up: it talks about functions, apis, code to move from one place to another. These are great, but it'd be great to start the doc with something high level, e.g. what are the components/classes that we should have in an ideal end-state. Another high-level comment (you might be thinking about some of it already but it is not super clear from looking at your current doc): often there might be multiple alternatives and it is good to discuss their tradeoffs (or if one dominates the other). For example, parquet/orc conversion -- I think can of two ways to do it. One is to put it in HiveExternalCatalog, and another is to move all of those into the more general handling in SessionCatalog. The two have their own pros and cons. > Refactor and improve Hive support > - > > Key: SPARK-15691 > URL: https://issues.apache.org/jira/browse/SPARK-15691 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Hive support is important to Spark SQL, as many Spark users use it to read > from Hive. The current architecture is very difficult to maintain, and this > ticket tracks progress towards getting us to a sane state. > A number of things we want to accomplish are: > - Move the Hive specific catalog logic into HiveExternalCatalog. > -- Remove HiveSessionCatalog. All Hive-related stuff should go into > HiveExternalCatalog. This would require moving caching either into > HiveExternalCatalog, or just into SessionCatalog. > -- Move using properties to store data source options into > HiveExternalCatalog. > -- Potentially more. > - Remove HIve's specific ScriptTransform implementation and make it more > general so we can put it in sql/core. > - Implement HiveTableScan (and write path) as a data source, so we don't need > a special planner rule for HiveTableScan. > - Remove HiveSharedState and HiveSessionState. > One thing that is still unclear to me is how to work with Hive UDF support. > We might still need a special planner rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14761) PySpark DataFrame.join should reject invalid join methods even when join columns are not specified
[ https://issues.apache.org/jira/browse/SPARK-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bijay Kumar Pathak updated SPARK-14761: --- Comment: was deleted (was: Hi [~joshrosen], how do we handle the on=None while passing to JVM api, since passing None throws {{java.lang.NullPointerException}} in my regression test.) > PySpark DataFrame.join should reject invalid join methods even when join > columns are not specified > -- > > Key: SPARK-14761 > URL: https://issues.apache.org/jira/browse/SPARK-14761 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Josh Rosen >Priority: Minor > Labels: starter > > In PySpark, the following invalid DataFrame join will not result an error: > {code} > df1.join(df2, how='not-a-valid-join-type') > {code} > The signature for `join` is > {code} > def join(self, other, on=None, how=None): > {code} > and its code ends up completely skipping handling of the `how` parameter when > `on` is `None`: > {code} > if on is not None and not isinstance(on, list): > on = [on] > if on is None or len(on) == 0: > jdf = self._jdf.join(other._jdf) > elif isinstance(on[0], basestring): > if how is None: > jdf = self._jdf.join(other._jdf, self._jseq(on), "inner") > else: > assert isinstance(how, basestring), "how should be basestring" > jdf = self._jdf.join(other._jdf, self._jseq(on), how) > else: > {code} > Given that this behavior can mask user errors (as in the above example), I > think that we should refactor this to first process all arguments and then > call the three-argument {{_.jdf.join}}. This would handle the above invalid > example by passing all arguments to the JVM DataFrame for analysis. > I'm not planning to work on this myself, so this bugfix (+ regression test!) > is up for grabs in case someone else wants to do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15545) R remove non-exported unused methods, like jsonRDD
[ https://issues.apache.org/jira/browse/SPARK-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315991#comment-15315991 ] Shivaram Venkataraman commented on SPARK-15545: --- That discussion was before we had UDFs implemented in dapply, gapply etc. I think its fine to start working on this. But we might not pull this to 2.0 and will probably include the removals in 2.1.0 (it'll also give us a chance to announce their deprecation with the 2.0.0 release) > R remove non-exported unused methods, like jsonRDD > -- > > Key: SPARK-15545 > URL: https://issues.apache.org/jira/browse/SPARK-15545 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Felix Cheung >Priority: Minor > > Need to review what should be removed. > one reason to not remove this right away is because we have been talking > about calling internal methods via `SparkR:::jsonRDD` for this and other RDD > methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15770) annotation audit for Experimental and DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-15770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15770. - Resolution: Fixed Assignee: zhengruifeng Fix Version/s: 2.0.0 > annotation audit for Experimental and DeveloperApi > -- > > Key: SPARK-15770 > URL: https://issues.apache.org/jira/browse/SPARK-15770 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.0.0 > > > 1, remove comments {{:: Experimental ::}} for non-experimental API > 2, add comments {{:: Experimental ::}} for experimental API > 3, add comments {{:: DeveloperApi ::}} for developerApi API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Chen updated SPARK-15716: - Comment: was deleted (was: Actually I tried in another environment with version 1.6.1, memory leak showed up again. We are still investigating in it. The run of 1.6.0 was done on hortonworks sandbox image version 2.4 in virtualbox, the 1.6.1 run is done on both yarn and standalone mode on a real cluster. I also tried to run it on a single machine with master local[2] with version 1.4.1, memory leak did not appear.) > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Chen updated SPARK-15716: - Comment: was deleted (was: We already tried 1.6.0, it does not happen. Could I know what happens if the bug only happens on an old version? Do we fix it?) > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315972#comment-15315972 ] Yan Chen edited comment on SPARK-15716 at 6/5/16 6:29 PM: -- Actually I tried in another environment with version 1.6.1, memory leak showed up again. We are still investigating in it. The run of 1.6.0 was done on hortonworks sandbox image version 2.4 in virtualbox, the 1.6.1 run is done on both yarn and standalone mode on a real cluster. I also tried to run it on a single machine with master local[2] with version 1.4.1, memory leak did not appear. was (Author: yani.chen): Actually I tried in another environment with version 1.6.1, memory leak showed up again. We are still investigating in it. The run of 1.6.0 was done on hortonworks sandbox image version 2.4 in virtualbox, the 1.6.1 run is done on both yarn and standalone mode on a real cluster. > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds,
[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315972#comment-15315972 ] Yan Chen edited comment on SPARK-15716 at 6/5/16 6:25 PM: -- Actually I tried in another environment with version 1.6.1, memory leak showed up again. We are still investigating in it. The run of 1.6.0 was done on hortonworks sandbox image version 2.4 in virtualbox, the 1.6.1 run is done on both yarn and standalone mode on a real cluster. was (Author: yani.chen): Actually I tried in another environment with version 1.6.1, memory leak showed up again. We are still investigating in it. > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I
[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315972#comment-15315972 ] Yan Chen edited comment on SPARK-15716 at 6/5/16 6:19 PM: -- Actually I tried in another environment with version 1.6.1, memory leak showed up again. We are still investigating in it. was (Author: yani.chen): Actually I tried in another environment with version 1.6.1, the same behaviour showed up. We are still investigating in it. > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315972#comment-15315972 ] Yan Chen commented on SPARK-15716: -- Actually I tried in another environment with version 1.6.1, the same behaviour showed up. We are still investigating in it. > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15691) Refactor and improve Hive support
[ https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315964#comment-15315964 ] Xiao Li commented on SPARK-15691: - In the PDF document, all the underlined text is are hyperlinks that point to the related contents. > Refactor and improve Hive support > - > > Key: SPARK-15691 > URL: https://issues.apache.org/jira/browse/SPARK-15691 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Hive support is important to Spark SQL, as many Spark users use it to read > from Hive. The current architecture is very difficult to maintain, and this > ticket tracks progress towards getting us to a sane state. > A number of things we want to accomplish are: > - Move the Hive specific catalog logic into HiveExternalCatalog. > -- Remove HiveSessionCatalog. All Hive-related stuff should go into > HiveExternalCatalog. This would require moving caching either into > HiveExternalCatalog, or just into SessionCatalog. > -- Move using properties to store data source options into > HiveExternalCatalog. > -- Potentially more. > - Remove HIve's specific ScriptTransform implementation and make it more > general so we can put it in sql/core. > - Implement HiveTableScan (and write path) as a data source, so we don't need > a special planner rule for HiveTableScan. > - Remove HiveSharedState and HiveSessionState. > One thing that is still unclear to me is how to work with Hive UDF support. > We might still need a special planner rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15691) Refactor and improve Hive support
[ https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315960#comment-15315960 ] Xiao Li commented on SPARK-15691: - For refactoring {{HiveMetastoreCatalog.scala}}, I just finished the design doc. The PDF version is available in the following link: https://www.dropbox.com/s/tsaoq2joegkdh1h/2016.06.05.HiveMetastoreCatalog.scala%20Refactoring.pdf?dl=0 The original Markdown file can be downloaded via https://www.dropbox.com/s/uita63wkdrmuqr2/2016.06.05.HiveMetastoreCatalog.scala%20Refactoring.md?dl=0 Please let me know if this is the right direction and correct me anything is not appropriate. [~rxin] Thank you very much! > Refactor and improve Hive support > - > > Key: SPARK-15691 > URL: https://issues.apache.org/jira/browse/SPARK-15691 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Hive support is important to Spark SQL, as many Spark users use it to read > from Hive. The current architecture is very difficult to maintain, and this > ticket tracks progress towards getting us to a sane state. > A number of things we want to accomplish are: > - Move the Hive specific catalog logic into HiveExternalCatalog. > -- Remove HiveSessionCatalog. All Hive-related stuff should go into > HiveExternalCatalog. This would require moving caching either into > HiveExternalCatalog, or just into SessionCatalog. > -- Move using properties to store data source options into > HiveExternalCatalog. > -- Potentially more. > - Remove HIve's specific ScriptTransform implementation and make it more > general so we can put it in sql/core. > - Implement HiveTableScan (and write path) as a data source, so we don't need > a special planner rule for HiveTableScan. > - Remove HiveSharedState and HiveSessionState. > One thing that is still unclear to me is how to work with Hive UDF support. > We might still need a special planner rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15723) SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ name
[ https://issues.apache.org/jira/browse/SPARK-15723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315938#comment-15315938 ] Brett Randall commented on SPARK-15723: --- Thanks for merging. And thanks for the Scala repl test - I can confirm that this is driven by a combination of *both* default TimeZone and default Locale - the default Locale impacts the interpretation of the short TZ code, which makes sense. {{Australia/Sydney/en_AU}} -> {color:red}*false*{color} {noformat} scala -J-Duser.timezone="Australia/Sydney" -J-Duser.country=AU < val time = (new java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSz")).parse("2015-02-20T17:21:17.190EST").getTime time: Long = 1424413277190 scala> time == 1424470877190L res0: Boolean = false {noformat} {{Australia/Sydney/en_US}} -> {color:red}*false*{color} {noformat} scala -J-Duser.timezone="Australia/Sydney" -J-Duser.country=US < val time = (new java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSz")).parse("2015-02-20T17:21:17.190EST").getTime time: Long = 1424413277190 scala> time == 1424470877190L res0: Boolean = false {noformat} {{America/New_York/en_US}} -> {color:green}*true*{color} {noformat} scala -J-Duser.timezone="America/New_York" -J-Duser.country=US < val time = (new java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSz")).parse("2015-02-20T17:21:17.190EST").getTime time: Long = 1424470877190 scala> time == 1424470877190L res0: Boolean = true {noformat} So you were correct - this _can_ be disambiguated by applying a bias to the SDF in the code, but this would be necessarily a fixed bias, and it has to be done with a {{Calendar}} not a {{TimeZone}}: {code} sdf.setCalendar(Calendar.getInstance(TimeZone.getTimeZone("America/New_York"), new Locale("en_US"))) {code} I'm not certain this is better or more correct though, but it would remove any ambiguity in the short TZ codes - could be documented - all short TZ codes are evaluated as if they were in this default TZ/Locale. That might upset someone deploying that wants {{MST}} = Malaysia Standard Time and not Mountain Time. Make a note here if you think it is worth pursuing further, but I suspect we just have to honour the local env defaults and discourage abbreviated TZs. And the test fix is merged now, so all-good, thanks. > SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ > name > -- > > Key: SPARK-15723 > URL: https://issues.apache.org/jira/browse/SPARK-15723 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Brett Randall >Assignee: Brett Randall >Priority: Minor > Labels: test > Fix For: 1.6.2, 2.0.0 > > > {{org.apache.spark.status.api.v1.SimpleDateParamSuite}} has this assertion: > {code} > new SimpleDateParam("2015-02-20T17:21:17.190EST").timestamp should be > (1424470877190L) > {code} > This test is fragile and fails when executing in an environment where the > local default timezone causes {{EST}} to be interpreted as something other > than US Eastern Standard Time. If your local timezone is > {{Australia/Sydney}}, then {{EST}} equates to {{GMT+10}} and you will get: > {noformat} > date parsing *** FAILED *** > 1424413277190 was not equal to 1424470877190 (SimpleDateParamSuite.scala:29) > {noformat} > In short, {{SimpleDateFormat}} is sensitive to the local default {{TimeZone}} > when interpreting short zone names. According to the {{TimeZone}} javadoc, > they ought not be used: > {quote} > Three-letter time zone IDs > For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such > as "PST", "CTT", "AST") are also supported. However, their use is deprecated > because the same abbreviation is often used for multiple time zones (for > example, "CST" could be U.S. "Central Standard Time" and "China Standard > Time"), and the Java platform can then only recognize one of them. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-15771: Description: Since SPARK-15617 deprecated {{precision}} in {{MulticlassClassificationEvaluator}}, many ML examples broken. {code} pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.' {code} We should use {{accuracy}} to replace {{precision}} in these examples. was:Since SPARK-15617 deprecated {{precision}} in {{MulticlassClassificationEvaluator}}, many ML examples broken. We should use {{accuracy}} to replace {{precision}} in these examples. > Many ML examples broken since we deprecated `precision` in > MulticlassClassificationEvaluator > > > Key: SPARK-15771 > URL: https://issues.apache.org/jira/browse/SPARK-15771 > Project: Spark > Issue Type: Bug > Components: Examples, ML >Reporter: Yanbo Liang > > Since SPARK-15617 deprecated {{precision}} in > {{MulticlassClassificationEvaluator}}, many ML examples broken. > {code} > pyspark.sql.utils.IllegalArgumentException: > u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName > given invalid value precision.' > {code} > We should use {{accuracy}} to replace {{precision}} in these examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9834) Normal equation solver for ordinary least squares
[ https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935 ] Debasish Das edited comment on SPARK-9834 at 6/5/16 4:49 PM: - Do you have runtime comparisons that when features <= 4096, OLS using Normal Equations is faster than BFGS ? I am extending OLS for sparse features and it will be great if you can point to the runtime experiments you have done... was (Author: debasish83): Do you have runtime comparisons that when features <= 4096, OLS using Normal Equations is faster than BFGS ? > Normal equation solver for ordinary least squares > - > > Key: SPARK-9834 > URL: https://issues.apache.org/jira/browse/SPARK-9834 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.6.0 > > > Add normal equation solver for ordinary least squares with not many features. > The approach requires one pass to collect AtA and Atb, then solve the problem > on driver. It works well when the problem is not very ill-conditioned and not > having many columns. It also provides R-like summary statistics. > We can hide this implementation under LinearRegression. It is triggered when > there are no more than, e.g., 4096 features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-15771: Description: Since SPARK-15617 deprecated {{precision}} in {{MulticlassClassificationEvaluator}}, many ML examples broken. We should use {{accuracy}} to replace {{precision}} in these examples. (was: Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator.) > Many ML examples broken since we deprecated `precision` in > MulticlassClassificationEvaluator > > > Key: SPARK-15771 > URL: https://issues.apache.org/jira/browse/SPARK-15771 > Project: Spark > Issue Type: Bug > Components: Examples, ML >Reporter: Yanbo Liang > > Since SPARK-15617 deprecated {{precision}} in > {{MulticlassClassificationEvaluator}}, many ML examples broken. We should use > {{accuracy}} to replace {{precision}} in these examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15771: Assignee: Apache Spark > Many ML examples broken since we deprecated `precision` in > MulticlassClassificationEvaluator > > > Key: SPARK-15771 > URL: https://issues.apache.org/jira/browse/SPARK-15771 > Project: Spark > Issue Type: Bug > Components: Examples, ML >Reporter: Yanbo Liang >Assignee: Apache Spark > > Many ML examples broken since we deprecated `precision` in > MulticlassClassificationEvaluator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315936#comment-15315936 ] Apache Spark commented on SPARK-15771: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/13519 > Many ML examples broken since we deprecated `precision` in > MulticlassClassificationEvaluator > > > Key: SPARK-15771 > URL: https://issues.apache.org/jira/browse/SPARK-15771 > Project: Spark > Issue Type: Bug > Components: Examples, ML >Reporter: Yanbo Liang > > Many ML examples broken since we deprecated `precision` in > MulticlassClassificationEvaluator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15771: Assignee: (was: Apache Spark) > Many ML examples broken since we deprecated `precision` in > MulticlassClassificationEvaluator > > > Key: SPARK-15771 > URL: https://issues.apache.org/jira/browse/SPARK-15771 > Project: Spark > Issue Type: Bug > Components: Examples, ML >Reporter: Yanbo Liang > > Many ML examples broken since we deprecated `precision` in > MulticlassClassificationEvaluator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9834) Normal equation solver for ordinary least squares
[ https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935 ] Debasish Das commented on SPARK-9834: - Do you have runtime comparisons that when features <= 4096, OLS using Normal Equations is faster than BFGS ? > Normal equation solver for ordinary least squares > - > > Key: SPARK-9834 > URL: https://issues.apache.org/jira/browse/SPARK-9834 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.6.0 > > > Add normal equation solver for ordinary least squares with not many features. > The approach requires one pass to collect AtA and Atb, then solve the problem > on driver. It works well when the problem is not very ill-conditioned and not > having many columns. It also provides R-like summary statistics. > We can hide this implementation under LinearRegression. It is triggered when > there are no more than, e.g., 4096 features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator
Yanbo Liang created SPARK-15771: --- Summary: Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator Key: SPARK-15771 URL: https://issues.apache.org/jira/browse/SPARK-15771 Project: Spark Issue Type: Bug Components: Examples, ML Reporter: Yanbo Liang Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15723) SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ name
[ https://issues.apache.org/jira/browse/SPARK-15723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15723. --- Resolution: Fixed Assignee: Brett Randall Fix Version/s: 2.0.0 1.6.2 Resolved by https://github.com/apache/spark/pull/13462 > SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ > name > -- > > Key: SPARK-15723 > URL: https://issues.apache.org/jira/browse/SPARK-15723 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Brett Randall >Assignee: Brett Randall >Priority: Minor > Labels: test > Fix For: 1.6.2, 2.0.0 > > > {{org.apache.spark.status.api.v1.SimpleDateParamSuite}} has this assertion: > {code} > new SimpleDateParam("2015-02-20T17:21:17.190EST").timestamp should be > (1424470877190L) > {code} > This test is fragile and fails when executing in an environment where the > local default timezone causes {{EST}} to be interpreted as something other > than US Eastern Standard Time. If your local timezone is > {{Australia/Sydney}}, then {{EST}} equates to {{GMT+10}} and you will get: > {noformat} > date parsing *** FAILED *** > 1424413277190 was not equal to 1424470877190 (SimpleDateParamSuite.scala:29) > {noformat} > In short, {{SimpleDateFormat}} is sensitive to the local default {{TimeZone}} > when interpreting short zone names. According to the {{TimeZone}} javadoc, > they ought not be used: > {quote} > Three-letter time zone IDs > For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such > as "PST", "CTT", "AST") are also supported. However, their use is deprecated > because the same abbreviation is often used for multiple time zones (for > example, "CST" could be U.S. "Central Standard Time" and "China Standard > Time"), and the Java platform can then only recognize one of them. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15723) SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ name
[ https://issues.apache.org/jira/browse/SPARK-15723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315898#comment-15315898 ] Sean Owen commented on SPARK-15723: --- Yeah I get the ambiguity problem. OK, so you're saying it doesn't help to set the time zone on the SimpleDateFormat, that the disambiguation is always relative to the machine's time zone and not what is configured for the SimpleDateFormat? I tried this FWIW and it gave the 'right' answer according to the current test, not sure what's different: {code} $ scala -J-Duser.timezone="Australia/Sydney" scala> System.getProperty("user.timezone") res0: String = Australia/Sydney ... scala> val sdf = new SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSz") sdf: java.text.SimpleDateFormat = java.text.SimpleDateFormat@8a9df61b scala> sdf.parse("2015-02-20T17:21:17.190EST").getTime res1: Long = 1424470877190 {code} Anyway I think just patching up the test is OK for now. > SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ > name > -- > > Key: SPARK-15723 > URL: https://issues.apache.org/jira/browse/SPARK-15723 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Brett Randall >Priority: Minor > Labels: test > > {{org.apache.spark.status.api.v1.SimpleDateParamSuite}} has this assertion: > {code} > new SimpleDateParam("2015-02-20T17:21:17.190EST").timestamp should be > (1424470877190L) > {code} > This test is fragile and fails when executing in an environment where the > local default timezone causes {{EST}} to be interpreted as something other > than US Eastern Standard Time. If your local timezone is > {{Australia/Sydney}}, then {{EST}} equates to {{GMT+10}} and you will get: > {noformat} > date parsing *** FAILED *** > 1424413277190 was not equal to 1424470877190 (SimpleDateParamSuite.scala:29) > {noformat} > In short, {{SimpleDateFormat}} is sensitive to the local default {{TimeZone}} > when interpreting short zone names. According to the {{TimeZone}} javadoc, > they ought not be used: > {quote} > Three-letter time zone IDs > For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such > as "PST", "CTT", "AST") are also supported. However, their use is deprecated > because the same abbreviation is often used for multiple time zones (for > example, "CST" could be U.S. "Central Standard Time" and "China Standard > Time"), and the Java platform can then only recognize one of them. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315895#comment-15315895 ] Sean Owen commented on SPARK-15716: --- [~HalfVim] are you experiencing out of memory errors? it's normal for the JVM to keep using more memory even after GCs; it's of course not normal if it seems to use a bunch of memory and then be unable to free it. Example: run with a smaller heap? is it still fine? then this isn't a problem > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314190#comment-15314190 ] Yan Chen edited comment on SPARK-15716 at 6/5/16 2:05 PM: -- I also tried to run scala code adapted from the example code in https://github.com/apache/spark/blob/branch-1.4/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala, The code: {code:java} import org.apache.spark.SparkConf import org.apache.spark.HashPartitioner import org.apache.spark.streaming._ import org.apache.log4j.{Level, Logger} object StatefulNetworkWordCount { def main(args: Array[String]) { val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements if (!log4jInitialized) { Logger.getRootLogger.setLevel(Level.WARN) } val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => { iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s))) } val sparkConf = new SparkConf() .setAppName("StatefulNetworkWordCount") sparkConf.set("spark.streaming.minRememberDuration", "180s") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") sparkConf.set("spark.streaming.unpersist", "true") sparkConf.set("spark.streaming.ui.retainedBatches", "10") sparkConf.set("spark.ui.retainedJobs", "10") sparkConf.set("spark.ui.retainedStages", "10") sparkConf.set("spark.worker.ui.retainedExecutors", "10") sparkConf.set("spark.worker.ui.retainedDrivers", "10") sparkConf.set("spark.sql.ui.retainedExecutions", "10") sparkConf.set("spark.cleaner.ttl", "240s") // Create the context with a 1 second batch size val ssc = new StreamingContext(sparkConf, Milliseconds(args(2).toLong)) ssc.checkpoint(args(1)) // Initial RDD input to updateStateByKey val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) // Create a ReceiverInputDStream on target ip:port and count the // words in input stream of \n delimited test (eg. generated by 'nc') val lines = ssc.textFileStream(args(0))//ssc.socketTextStream(args(0), args(1).toInt) val words = lines.flatMap(_.split(" ")) val wordDstream = words.map(x => (x, 1)) // Update the cumulative count using updateStateByKey // This will give a Dstream made of state (which is the cumulative count of the words) val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc, new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD) stateDstream.print() ssc.start() ssc.awaitTermination() } } {code} The same behavior happened. was (Author: yani.chen): I also tried to run scala code adapted from the example code in https://github.com/apache/spark/blob/branch-1.4/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala, The code: {code:java} import org.apache.spark.SparkConf import org.apache.spark.HashPartitioner import org.apache.spark.streaming._ import org.apache.log4j.{Level, Logger} object StatefulNetworkWordCount { def main(args: Array[String]) { val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements if (!log4jInitialized) { Logger.getRootLogger.setLevel(Level.WARN) } val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => { iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s))) } val sparkConf = new SparkConf() .setAppName("StatefulNetworkWordCount") sparkConf.set("spark.streaming.minRememberDuration", "180s") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") sparkConf.set("spark.streaming.unpersist", "true") sparkConf.set("spark.streaming.ui.retainedBatches", "10") sparkConf.set("spark.ui.retainedJobs", "10") sparkConf.set("spark.ui.retainedStages", "10") sparkConf.set("spark.worker.ui.retainedExecutors", "10") sparkConf.set("spark.worker.ui.retainedDrivers", "10") sparkConf.set("spark.sql.ui.retainedExecutions", "10") sparkConf.set("spark.cleaner.ttl", "240s") // Create the context with a 1 second batch size val ssc = new StreamingContext(sparkConf, Milliseconds(args(2).toLong)) ssc.checkpoint(args(1)) // Initial RDD input to updateStateByKey val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) // Create a ReceiverInputDStream on target ip:port and count the // words in input stream of \n
[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315890#comment-15315890 ] Yan Chen commented on SPARK-15716: -- We already tried 1.6.0, it does not happen. Could I know what happens if the bug only happens on an old version? Do we fix it? > Memory usage of driver keeps growing up in Spark Streaming > -- > > Key: SPARK-15716 > URL: https://issues.apache.org/jira/browse/SPARK-15716 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1 > Environment: Oracle Java 1.8.0_51, SUSE Linux >Reporter: Yan Chen > Original Estimate: 48h > Remaining Estimate: 48h > > Code: > {code:java} > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.streaming.Durations; > import org.apache.spark.streaming.StreamingContext; > import org.apache.spark.streaming.api.java.JavaPairDStream; > import org.apache.spark.streaming.api.java.JavaStreamingContext; > import org.apache.spark.streaming.api.java.JavaStreamingContextFactory; > public class App { > public static void main(String[] args) { > final String input = args[0]; > final String check = args[1]; > final long interval = Long.parseLong(args[2]); > final SparkConf conf = new SparkConf(); > conf.set("spark.streaming.minRememberDuration", "180s"); > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true"); > conf.set("spark.streaming.unpersist", "true"); > conf.set("spark.streaming.ui.retainedBatches", "10"); > conf.set("spark.ui.retainedJobs", "10"); > conf.set("spark.ui.retainedStages", "10"); > conf.set("spark.worker.ui.retainedExecutors", "10"); > conf.set("spark.worker.ui.retainedDrivers", "10"); > conf.set("spark.sql.ui.retainedExecutions", "10"); > JavaStreamingContextFactory jscf = () -> { > SparkContext sc = new SparkContext(conf); > sc.setCheckpointDir(check); > StreamingContext ssc = new StreamingContext(sc, > Durations.milliseconds(interval)); > JavaStreamingContext jssc = new JavaStreamingContext(ssc); > jssc.checkpoint(check); > // setup pipeline here > JavaPairDStreaminputStream = > jssc.fileStream( > input, > LongWritable.class, > Text.class, > TextInputFormat.class, > (filepath) -> Boolean.TRUE, > false > ); > JavaPairDStream usbk = inputStream > .updateStateByKey((current, state) -> state); > usbk.checkpoint(Durations.seconds(10)); > usbk.foreachRDD(rdd -> { > rdd.count(); > System.out.println("usbk: " + rdd.toDebugString().split("\n").length); > return null; > }); > return jssc; > }; > JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf); > jssc.start(); > jssc.awaitTermination(); > } > } > {code} > Command used to run the code > {code:none} > spark-submit --keytab [keytab] --principal [principal] --class [package].App > --master yarn --driver-memory 1g --executor-memory 1G --conf > "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf > "spark.executor.instances=2" --conf > "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf > "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log > -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] > [dir-on-hdfs] 200 > {code} > It's a very simple piece of code, when I ran it, the memory usage of driver > keeps going up. There is no file input in our runs. Batch interval is set to > 200 milliseconds; processing time for each batch is below 150 milliseconds, > while most of which are below 70 milliseconds. > !http://i.imgur.com/uSzUui6.png! > The right most four red triangles are full GC's which are triggered manually > by using "jcmd pid GC.run" command. > I also did more experiments in the second and third comment I posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315887#comment-15315887 ] Ran Haim commented on SPARK-15731: -- Fine, you can close it - if it works. I will use a work around. > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315885#comment-15315885 ] Sean Owen commented on SPARK-15731: --- [~ran.h...@optimalplus.com] no, I removed "Target Version". Only committers set that, and it doesn't make sense to target a version released already. Given [~kevinyu98] sees different behavior on what I presume is a later version, I think it's probably fixed in a more up to date version. > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315884#comment-15315884 ] Ran Haim commented on SPARK-15731: -- I am using Mapr distribution and they provide spark 1.5.1 :/ - so not really. I did specify the version, but you removed it a few days ago :) > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15731: -- Affects Version/s: 1.5.1 [~ran.h...@optimalplus.com] best to indicate that in the JIRA then. Can you try master or at least 2.0 preview? 1.5.1 is pretty old now, and there's a good chance this is something fixed along the way. > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15472) Add partitioned `csv`, `json`, `text` format support for FileStreamSink
[ https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315842#comment-15315842 ] Apache Spark commented on SPARK-15472: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/13518 > Add partitioned `csv`, `json`, `text` format support for FileStreamSink > --- > > Key: SPARK-15472 > URL: https://issues.apache.org/jira/browse/SPARK-15472 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin > > Support for partitioned `parquet` format in FileStreamSink was added in > Spark-14716, now let's add support for partitioned `csv`, 'json', `text` > format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI
[ https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315797#comment-15315797 ] Liang-Chi Hsieh commented on SPARK-15433: - [~davies] Can you set the assignee to me? Thanks. > PySpark core test should not use SerDe from PythonMLLibAPI > -- > > Key: SPARK-15433 > URL: https://issues.apache.org/jira/browse/SPARK-15433 > Project: Spark > Issue Type: Test > Components: PySpark >Reporter: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.1.0 > > > Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes > many MLlib things. It should use SerDeUtil instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14839) Support for other types as option in OPTIONS clause
[ https://issues.apache.org/jira/browse/SPARK-14839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14839: Assignee: (was: Apache Spark) > Support for other types as option in OPTIONS clause > --- > > Key: SPARK-14839 > URL: https://issues.apache.org/jira/browse/SPARK-14839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This was found in https://github.com/apache/spark/pull/12494. > Currently, Spark SQL does not support other types and {{null}} as a value of > an options. > For example, > {code} > CREATE ... > USING csv > OPTIONS (path "your-path", quote null) > {code} > throws an exception below > {code} > Unsupported SQL statement > == SQL == > CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, > modelName string, comments string, grp string) USING csv OPTIONS (path > "your-path", quote null) > org.apache.spark.sql.catalyst.parser.ParseException: > Unsupported SQL statement > == SQL == > CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, > modelName string, comments string, grp string) USING csv OPTIONS (path > "your-path", quote null) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764) > ... > {code} > Currently, Scala API supports to take options with the types, {{String}}, > {{Long}}, {{Double}} and {{Boolean}} and Python API also supports other > types. I think in this way we can support data sources in a consistent way. > It looks it is okay to to provide other types as arguments just like > [Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] > because [SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] > standard mentions options as below: > {quote} > An implementation remains conforming even if it provides user op- > tions to process nonconforming SQL language or to process conform- > ing SQL language in a nonconforming manner. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14839) Support for other types as option in OPTIONS clause
[ https://issues.apache.org/jira/browse/SPARK-14839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315783#comment-15315783 ] Apache Spark commented on SPARK-14839: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/13517 > Support for other types as option in OPTIONS clause > --- > > Key: SPARK-14839 > URL: https://issues.apache.org/jira/browse/SPARK-14839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This was found in https://github.com/apache/spark/pull/12494. > Currently, Spark SQL does not support other types and {{null}} as a value of > an options. > For example, > {code} > CREATE ... > USING csv > OPTIONS (path "your-path", quote null) > {code} > throws an exception below > {code} > Unsupported SQL statement > == SQL == > CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, > modelName string, comments string, grp string) USING csv OPTIONS (path > "your-path", quote null) > org.apache.spark.sql.catalyst.parser.ParseException: > Unsupported SQL statement > == SQL == > CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, > modelName string, comments string, grp string) USING csv OPTIONS (path > "your-path", quote null) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764) > ... > {code} > Currently, Scala API supports to take options with the types, {{String}}, > {{Long}}, {{Double}} and {{Boolean}} and Python API also supports other > types. I think in this way we can support data sources in a consistent way. > It looks it is okay to to provide other types as arguments just like > [Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] > because [SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] > standard mentions options as below: > {quote} > An implementation remains conforming even if it provides user op- > tions to process nonconforming SQL language or to process conform- > ing SQL language in a nonconforming manner. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14839) Support for other types as option in OPTIONS clause
[ https://issues.apache.org/jira/browse/SPARK-14839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14839: Assignee: Apache Spark > Support for other types as option in OPTIONS clause > --- > > Key: SPARK-14839 > URL: https://issues.apache.org/jira/browse/SPARK-14839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > This was found in https://github.com/apache/spark/pull/12494. > Currently, Spark SQL does not support other types and {{null}} as a value of > an options. > For example, > {code} > CREATE ... > USING csv > OPTIONS (path "your-path", quote null) > {code} > throws an exception below > {code} > Unsupported SQL statement > == SQL == > CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, > modelName string, comments string, grp string) USING csv OPTIONS (path > "your-path", quote null) > org.apache.spark.sql.catalyst.parser.ParseException: > Unsupported SQL statement > == SQL == > CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, > modelName string, comments string, grp string) USING csv OPTIONS (path > "your-path", quote null) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764) > ... > {code} > Currently, Scala API supports to take options with the types, {{String}}, > {{Long}}, {{Double}} and {{Boolean}} and Python API also supports other > types. I think in this way we can support data sources in a consistent way. > It looks it is okay to to provide other types as arguments just like > [Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] > because [SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] > standard mentions options as below: > {quote} > An implementation remains conforming even if it provides user op- > tions to process nonconforming SQL language or to process conform- > ing SQL language in a nonconforming manner. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315782#comment-15315782 ] Ran Haim commented on SPARK-15731: -- Hi Kevin. it is pretty much the same, the only difference I see is that I have created a dataframe with a given schema using createDataFrame(Rdd, StructType). I am using spark 1.5.1 - what version did you use? > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15720) MLLIB Word2Vec loading large number of vectors in the model results in java.lang.NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-15720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315776#comment-15315776 ] yuhao yang commented on SPARK-15720: Thanks for the feedback. I'd try to investigate possible solutions and send update here. > MLLIB Word2Vec loading large number of vectors in the model results in > java.lang.NegativeArraySizeException > --- > > Key: SPARK-15720 > URL: https://issues.apache.org/jira/browse/SPARK-15720 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Linux >Reporter: Rohan G Patil > > While loading a large number of pre-trained vectors into Spark MLLIB's > Word2Vec model, will result in java.lang.NegativeArraySizeException. > Code - > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L597 > Test with number of vectors greater than 16777215 with size of each vector > 128 or more. > there is Integer Overflow happening here. Should be an easy fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org